Here's my current understanding of what Scott meant by "just a little off".
I think exact Bayesian inference via Solomonoff induction doesn't run into the trapped prior problem. Unfortunately, bounded agents like us can't do exact Bayesian inference via Solomonoff induction, since we can only consider a finite set of hypotheses at any given point. I think we try to compensate for this by recognizing that this list of hypotheses is incomplete, and appending it with new hypotheses whenever it seems like our current hypotheses are doing a sufficiently terrible job of explaining the input data.
One side effect is that if the true hypothesis (eg "polar bears are real") is not among our currently considered hypotheses, but our currently considered hypotheses are doing a sufficiently non-terrible job of explaining the input data (eg if the hypothesis "polar bears aren't real, but there's a lot of bad evidence suggesting that they are" is included, and the data is noisy enough that this hypothesis is reasonable), we just never even end up considering the true hypothesis. There wouldn't be accumulating likelihood ratios in favor of polar bears, because actual polar bears were never considered in the first place.
I think something similar is happening with phobias. For example, for someone with a phobia of dogs, I think the (subconscious, non-declarative) hypothesis "dogs are safe" doesn't actually get considered until the subject is well into exposure therapy, after which they've accumulated enough evidence that's sufficiently inconsistent with their prior hypotheses of dogs being scary and dangerous that they start considering alternative hypotheses.
In some sense this algorithm is "going out of its way to do something like compartmentalization", in that it's actively trying to fit all input data into its current hypotheses (/ "compartments") until this method no longer works.
Yep! I addressed this point in footnote .
I just want to share another reason I find this n=1 anecdote so interesting -- I have a highly speculative inside view that the abstract concept of self provides a cognitive affordance for intertemporal coordination, resulting in a phase transition in agentiness only known to be accessible to humans.
Hmm, I'm not sure I understand what point you think I was trying to make. The only case I was trying to make here was that much of our subjective experience which may appear uniquely human might stem from our langauge abilites, which seems consistent with Helen Keller undergoing a phase transition in her subjective experience upon learning a single abstract concept. I'm not getting what age has to do with this.
Questions #2 and #3 seem positively correlated – if the thing that humans have is important, it's evidence that architectural changes matter a lot.
Not necessarily. For example, it may be that language ability is very important, but that most of the heavy lifting in our language ability comes from general learning abilities + having a culture that gives us good training data for learning language, rather than from architectural changes.
I remembered reading about this a while back and updating on it, but I'd forgotten about it. I definitely think this is relevant, so I'm glad you mentioned it -- thanks!
I think this explanation makes sense, but it raises the further question of why we don't see other animal species with partial language competency. There may be an anthropic explanation here - i.e. that once one species gets a small amount of language ability, they always quickly master language and become the dominant species. But this seems unlikely: e.g. most birds have such severe brain size limitations that, while they could probably have 1% of human language, I doubt they could become dominant in anywhere near the same way we did.
Can you elaborate more on what partial language competency would look like to you? (FWIW, my current best guess is on "once one species gets a small amount of language ability, they always quickly master language and become the dominant species", but I have a lot of uncertainty. I suppose this also depends a lot on what exactly what's meant by "language ability".)
This seems like a false dichotomy. We shouldn't think of scaling up as "free" from a complexity perspective - usually when scaling up, you need to make quite a few changes just to keep individual components working. This happens in software all the time: in general it's nontrivial to roll out the same service to 1000x users.
I agree. But I also think there's an important sense in which this additional complexity is mundane -- if the only sorts of differences between a mouse brain and a human brain were the sorts of differences involved in scaling up a software service to 1000x users, I think it would be fair (although somewhat glib) to call a human brain a scaled-up mouse brain. I don't think this comparison would be fair if the sorts of differences were more like the sorts of differences involved in creating 1000 new software services.
That's one of the "unique intellectual superpowers" that I think language confers us:
On a species level, our mastery of language enables intricate insights to accumulate over generations with high fidelity. Our ability to stand on the shoulders of giants is unique among animals, which is why our culture is unrivaled in its richness in sophistication.
(I do think it helps to explicitly name our ability to learn culture as something that sets us apart, and wish I'd made that more front-and-center.)
I'm still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it's in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
I didn't understand what your wrote about verification well enough to have anything to say.
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it's in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?
The inner process may nevertheless use TDT if TDT doesn't diverge from CDT on the training distribution, or it might learn to use TDT but "look nice" so that it doesn't get selected against.
This was what I was intending to convey in assumption 3.