Consider an agent reasoning: "What kind of process could have produced me?" If the agent is literally the argmax of some simple scoring function, then the selection process must have enumerated all possible agents, evaluated f on each, and picked the maximum. This is physically unrealizable: it requires resources exceeding what's available in the environment. So the agent concludes that it wasn't generated by the argmax.
This is the invalid step of reasoning, because for AIXI agents, the environment is allowed to have unlimited resources/be very complicated by construction, and you can have environments which do allow you to do the literal search procedure.
This is why AIXI is usually considered in an unbounded setting, where we give AIXI unlimited resources for memory and time like a Universal Turing Machine, and is given certain oracular powers to make it possible to actually use AIXI to do inference or planning.
You underestimate how complicated and resource-rich environments are allowed to be.
Another gloss: we can't define what it means for an embedded agent to be "ideal" because embedded agents are messy physical systems, and messy physical systems are never ideal. At most they're "good enough". So we should only hope to define when an embedded agent is good enough. Moreover, such agents must be generated by a physically realistic selection process.
This is very dependent on what the rules of the environment are, and embedded agents can be ideal in certain environments.
I feel this is a valid critique not just of our research community, but of society in general. It is the great man theory of history, and I believe modern sociology has found the theory mostly invalid.
I want to flag here that the version of great man theory that was debunked by modern sociology is the claim that big impacts on the world are always/almost always are caused by great men, not that great men can't have big impacts on the world.
For what it's worth, I actually disagree with this view, and think that one of the bigger things LW gets right is that people's impact in a lot of domains is pretty heavy-tailed, and certain things matter way more than others under their utility function.
I do agree that people can round the impact off to infinity for rare geniuses, and there is a point to be made about LWers overvaluing theory/curiosity driven tasks compared to just using simple baselines and doing what works (and I agree with this critique), but the appreciation of heavy-tailed impact is one of the things I most value about LW, and while there are problems that do stem from this, I also think it's important not to damage the appreciation of heavy-tailed impact too much in solving the problems (assuming the heavy-tailed hypothesis is true, which I largely believe).
especially as compute will only keep scaling until ~2030, and then the amount of fuel for exploring algorithmic ideas won't keep growing as rapidly
Technical flag that compute scaling will slow down to the historical Moore's law trend plus historical fab buildout times, it won't completely stop, which means it'll go down from 3.5x per year to 1.55x per year, but yes this does take some wind out of the sails of algorithmic progress (though it's helpful to note that even post-LLM scaling, we'll be able to simulate human brains passably by the late 2030s, speeding up progress to AGI).
Another potential implication is that we should be more careful when talking about misalignment in LLMs, as misalignment might be due to the model being gaslighted into believing that it's capable of doing something it isn't.
This would affect the interpretation of the examples Habryka gave below:
The main reason I was talking about RNNs/neuralese architectures is that it fundamentally breaks the assumption that nice CoT is the large source of evidence on safety that it is today. Quite a lot of near-term safety plans assume that the CoT is where the bulk of the relevant work is, meaning it's pretty easy to monitor.
To be precise, I'm claiming this part of your post is little evidence for safety post-LLMs:
I have gone over a bunch of the evidence that contradicts the classical doom picture. It could almost be explained away by noting that capabilities are too low to take over, if not for the fact that we can see the chains of thought and they’re by and large honest.
This also weakens the influence of pre-training priors, as the AI continually learns, compared to how AIs stop learning today after their training, which is why they have knowledge cutoff dates, meaning we can't rely on it to automate alignment (though the human pre-training prior is actually surprisingly powerful, and I expect it to be able to complete 1-6 month tasks by 2030, which is when scaling starts to slow down most likely absent TAI arising then, which I think is reasonably likely).
Agree that interpretability might get easier, and this is a useful positive consideration to think about.
So I've become more pessimistic about this sort of outcome happening over the last year, and a lot of it comes down to me believing in longer timelines, which means I expect the LLM paradigm to be superseded, and the most likely options in this space that increase capabilities unfortunately correspond to more neuralese/recurrent architectures. To be clear, they're hard enough to make work that I don't expect them to outperform transformers so long as transformers can keep scaling, and even after the immense scale-up of AI that will slow down to the Moore's law trend by 2030-2031, I still expect neuralese to be at least somewhat difficult, and it's possibly difficult enough that we might survive because AI never reached the capabilities we expect.
(Yes, the scaling hypothesis in a weak form will survive, but I don't expect the strong versions of the scaling hypothesis to work. Reasons for why I believe this are available on request).
It's still very possible this happens, but I wouldn't put much weight on this for planning purposes.
I agree with a weaker version of the claim that says that the AI safety landscape is looking better than people thought 10-15 years ago, with AI control probably being the best example here. To be clear, I do think this is actually meaningful and does matter, primarily because they focus less on the limiting cases of AI competencies, but I currently am not as optimistic about some of the important properties of LLMs that are relevant for alignment surviving for newer AI designs, which means I disagree with this post.
You should actually tag @Vladimir_Nesov instead of Vladimir M, as Vladimir Nesov was the original author.
Or early AGIs convince/coerce humanity into not rushing to superintelligence before it's clear how to align it with anyone's well-being (including that of the early AGIs).
BTW, this sort of thing (where the AI also has an interest in slowing down progress) is one of the reasons why AI safety plans that depend on a certain level of capabilities being hit might not fall apart, as AI being slowed down lets us stay in the sweet spot longer.
This does rely on the assumption that it's very hard to solve the alignment problem even for AGIs, which isn't given much likelihood in my models of the world, but this sort of thing could very well prevent human extinction even in worlds where AI alignment is very hard and we don't get much regulation of AI progress from now.
Another reason why people tend to grow organizations, especially in middle management positions, is because coordination is a key constraint, and anything that loosens this constraint, even if it damages a lot of things is often worth it because coordination is one of the few areas where diminishing returns don't apply as early.
This is part of a more general trend of middlemen being more important than ever, (and that's necessary actually to run modern societies).
So any solution to this sort of problem would implicitly be a solution to coordination problems in general.
Link to long comments that I want to pin, but are too long to be pinned:
https://www.lesswrong.com/posts/Zzar6BWML555xSt6Z/?commentId=aDuYa3DL48TTLPsdJ
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
https://www.lesswrong.com/posts/DCQ8GfzCqoBzgziew/?commentId=RhTNmgZqjJpzGGAaL