Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Editor’s note: I’m experimenting with having a lower quality threshold for just posting things even while I’m still confused and unconfident about my conclusions, but with this disclaimer at the top.

This post is a followup to my earlier post.

If ELK is impossible in generality, how could we solve it in practice? Two main ways I can think of:

  • Natural abstractions: maybe the complexity of direct translators is bounded in practice because the model ontology is not so different from ours, violating the worst case  assumption.
  • Science being easy: maybe there isn’t any hard ceiling to how good you can get at doing science, and it's not actually that much more expensive so it is actually feasible to just scale science up.

For the first one, I feel like there are both reasons for and against it being potentially true. The main argument for, especially in the context of LMs, is that human abstractions are baked into language and so there's a good chance LMs also learn this abstraction. I used to be a lot more optimistic that this is true, but since then I've changed my mind on this. For one, in the limit, natural abstractions are definitely not optimal. The optimal thing in the limit is to simulate all the air molecules or something like that. So we know that even if natural abstractions holds true at some point, it must eventually stop being optimal, and the abstractions get more and more weird past that point. Then the question becomes not just whether or not there even exists a sweet spot where natural abstractions holds, but also whether it overlaps with the kinds of models that are dangerous and might kill us. I still think this is plausible but kind of iffy.

I think this is worth looking into empirically, but I'm very cautious about not reading too much into the results, because even if we see increasingly natural abstractions as size increases, it could still start going down again before our models are sufficiently powerful. I'm moderately worried that someone will overclaim the significance of results in this vein. Also, my intuition about language models is that despite communicating in language, they learn and think in pretty alien ways.

For the second one, if this is true then this kind of defeats most of the point of even doing ELK. If you can just scale up the science doing, then ELK becomes just an optimization the same way that reward modelling is just an optimization of asking humans every single time you need a reward. In that case, then doing something vaguely ELK-like might be more efficient, but it wouldn't allow us to do anything fundamentally new. That might be hugely important in practice, but would be a very different focus than what people focus on when they talk about ELK right now. 

As for the other point, I’m personally not hugely optimistic about any of the existing science-doing methods, but this opinion is not based on very much thought and I expect my stance on this to be fairly easily changeable. I think this is a useful thing for me to spend some more time thinking about at some point.

Future Directions: empirical stuff testing natural abstractions, how do we do science better

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 1:45 PM

my intuition about language models is that despite communicating in language, they learn and think in pretty alien ways.


Do you think you can elaborate on why you think this? Recent interpretability work such as Locating and Editing Factual Knowledge in GPT and In-context Learning and Induction Heads has generally made me think language models have more interpretable/understandable internal computations than I'd initially assumed.