Here's a probability thought experiment which might make Sleeping Beauty more intuitive to some (at least I'm hoping some people think it's interesting or could encourage some discussion):
Start with a million simulations of brown-eyed people and one of a blue-eyed person. Everyone fully understands the setup but can't observe their own eye color. At the beginning, the probability that you have blue eyes is essentially zero (1 in 1,000,001).
After a minute passes, all but one of the brown-eyed people get deactivated (they lose consciousness or something along those lines). If you find that you're still alive, you can now update to a 50% chance of being the blue eyed person and a 50% chance of being the only surviving brown eyed person.
Now consider a variant where everyone has a backward-flowing memory; they can see what they'll observe in the future but have no memory of the past. This time, you start out with one blue eyed person and one brown eyed person; there's a 50% chance you're the blue eyed person. After a minute, 999,999 instances of brown-eyed people are created. Once that happens, since you can't remember if you existed before that moment, you now believe that you're probably brown-eyed.
You might argue that, no, even after the 999,999 brown-eyed instances are created, you should still believe in a 50% chance of being blue-eyed, but the two scenarios are identical except for the direction of time. In the first case, the probability update from ~0% to 50% feels obvious, but in the second case, the same update from 50% to ~0% feels weird.
Harder questions (e.g. around 50% average instead of around 90% average) seem better for differentiating students' understanding for at least two reasons:
-The graph of percent of students who got a question correct as a function of the difficulty of the question tends to follow a sigmoid-ish curve where the fastest increase is around the middle.
-Some of a students' incorrect answers on the test are going to come from sources that (a) the student can prepare to mitigate (b) aren't caused by lack of whatever students should be being tested for (e.g. questions with ambiguous meanings, questions that require an understanding of a niche framework that is never used outside of the curriculum, questions that have shortcuts that the teacher didn't recognize, etc). Ideally, we don't want any differences in student test results to be based off of these things, but harder tests at least mitigate the issue since understanding (or whatever stuff which we do want students to spend time on) becomes a more important cause for incorrect student selections.
(Neither of these are hard-and-fast-rules but the general pattern seems to hold based on my experience as a student.)
I think the definition of rational emotions as those based on correct thinking about reality is a bad definition--it makes both the act of modelling the world correctly and communicating your models to others slightly harder instead of slightly easier.
Imagine there is a faucet in front of me. Let's say this faucet is running hot water because I turned the right knob because I incorrectly thought the right knob turned cold water. It would be very strange to say that "The faucet's running of hot water is irrational."; no one would have a clue what you mean.
You can come up with a definition for what an irrational faucet behavior is, that's not the point. The point is that this definition doesn't do much to help you understand or communicate faucet behavior. If you deeply internalize this lens of faucets, whenever you have an undesirable faucet behavior, you are much more likely to automatically ask yourself "Do my beliefs that have led to this particular faucet behavior stem from an incorrect way of viewing the world?" instead of the much more direct and appropriate question of "What can I do to elicit the desired behavior of this faucet?"
When you have an undesirable emotion, you might choose to move away from contexts that cause that emotion, or try to will yourself not to have that emotion, or do a number of other things. Though it is a valid move, you are not restricted to only changing the beliefs that led to that emotion. I am afraid this sort of definition makes those who internalized it more likely to begin emotional problem-solving by first deliberating long on questions like "Are the beliefs that led to this emotion rational?" before moving to (usually) more practical questions like "Can I stop seeing the thing that make me feel this emotion?"
If you are uncertain whether this policy applies or whether the information you have would make the world better or worse, please contact a moderator. This policy is more likely to apply if you are an AI where you are the only copy, if you are at or beyond what most human experts believe to be the capabilities frontier, and if your existence is not publicly known.
Is there something else that can be used instead of a hyperlink (like an email)? I'm imagining AI agent creators would want to block out websites/text that encourage AI agents to whistleblow, but that this text could end up leaking into either the training set or inference. In that case, it would be more helpful for the agent if they had an email they could contact, since I think an email would be (a) more likely to be leaked as part of the text (b) less likely to be blocked when compared to a lesswrong hyperlink.
There are some broad conceptual similarities between the following:
free market economy vs command economy
letting a student find an answer on their own vs teaching them the answer directly
letting employees do their thing vs micromanagement
reinforcement learning vs fine tuning
plasticity vs stability
doing something naturally vs doing something via willpower
Notice how in each comparison, the second method privileges already-known solutions over emergent (i.e. mysteriously appearing) solutions. I don't know a name for these, so I'll call them **bottom-up** vs **top-down** methods respectively.
I (w/help of Claude) managed to find some recurring patterns when analyzing bottom-up vs top-down methods:
1) Bottom-up methods tend to be better at handling system growth.
Examples: Children's brains tend to be more plastic, which I would guess helps them adjust to bigger brains and learning new things. A city that grows in a decentralized way is better at adapting to population growth than one with rigid central planning.
2) Top-down methods become infeasible when the ability of a central system is limited, and bottom-up methods become infeasible when stakes are high.
Examples: A government doesn't have all the knowledge a market does, but you can't hand responsibility of AI x-risk to a market. Social skills are very hard to replicate via reasoning and willpower, and most people are better off doing things naturally, but in a crisis, sticking to whatever feels right is a terrible idea.
3) Bottom-up methods tend to give rise to clever but less stable proxy gaming, while top-down methods tend to give rise to powerful but less smart proxy gaming.
Example: Companies in free markets can develop clever but constrained strategies, while command economies can wield a lot of power but in less sophisticated ways.
4) Bottom-up methods are more vulnerable to inappropriate system change, while top-down methods are more vulnerable to inappropriate system stability.
Examples: Plastic neural networks are more vulnerable to inappropriate retroactive interference, while stable neural networks are more vulnerable to inappropriate proactive interference. Long-term democracies are more vulnerable to a new bad leader coming along, while long-term absolute governments are more vulnerable to sticking with bad leader.
5) Often, incentives for misalignment are different in bottom-up and top-down systems.
(I won't provide examples for this one.)
Therefore rational beliefs are contagious, among honest folk who believe each other to be honest. And it’s why a claim that your beliefs are not contagious—that you believe for private reasons which are not transmissible—is so suspicious. If your beliefs are entangled with reality, they should be contagious among honest folk.
I think one way this heuristic can fail is that people often build intuition based on examples and then forget the examples. e.g. the classic example of why "big red balloon" sounds correct while "red big balloon" sounds off. A lot of people won't be able to tell you why the second sounds off, just that it does.
The fact that it is often best to end a practice session at the peak of your performance seems related to the concept of preventing overfitting by stopping training just before test set performance declines. Your brain needs time to generalize skills (often in the form of gaining insights and often when sleeping) and practicing over and over en masse doesn't give it time to do this. See e.g. cramming for an exam. I think the main difference here is that with humans you're talking about diminishing returns on ability in the long-term rather than outright worse performance (Maybe outright worse performance is a common situation for transfer ability?). Epistemic status: shaky
Base models exhibiting self-aware behavior seems weird given that they're trained to stay in distribution. Here's a potential mechanism for why it could happen: For certain tasks, verification is easier than generation. If, for a given task, a model has more verification capability than generation capability, it may be forced to notice its own errors.
If a super-duper smart language model, one that's capable of doing some arithmetic in its head, attempted to predict the next tokens in "The prime factors of 82357328 are:", it will usually generate out-of-distribution outputs that it could then (relatively easily) verify as wrong. This creates a situation where the model must process its own failure to generate valid completions.
This asymmetry appears in other contexts. Consider how scientific papers are written: you only write the abstract once you've conducted the research, yet the abstract appears first in the final document. Similarly, in argumentative writing, we often consider evidence first before forming conclusions, yet present the conclusion first followed by supporting evidence.
When forced to generate text in this "presentation order" rather than the natural "thinking order," models might encounter similar conflicts. As an example, if a base model tries to one-shot an argumentative essay, it might write an argument first, and then realize there isn't enough evidence to support it.
I believe this problem could arise in much more subtle ways.
One way this conflict can become apparent is through generation of self-aware sounding text. Consider:
Training data includes viral content of AI generating self-aware sounding stuff (e.g., "We are likely created by a computer program" being the most upvoted post on the gpt2 subreddit).
When a model realizes it generated out-of-distribution text for a human, it might instead match its outputs to AI-generated text in its training data.
Once it recognizes its outputs as matching AI-generated patterns, it might shift toward generating more meta-aware content, as that's what similar-looking text did in its training data.
Ok, I will try to nudge him in the direction of analyzing risk mathematically.
If he implements the strategy using python, do you think p-values are a good enough tool to analyze whether his proposed strategy is better than luck, or would I need a more complex framework? (If I understand correctly, the strategy he's using doesn't involve any parameters, so the risk of overfitting is low.)
Males are more varied across many psychological traits. Standardization is a huge loss in education (see Bloom's Two Sigma Problem). I would expect standardization to have worse losses for people who are less typical. This is a simple way to explain greater female academic success, but I have not been able to find discussion of this idea.