You yourself made a youtube video about an agent that learned to travel to the right instead of pursue a coin in a 2d platforming game.
I think you have me mistaken for my infamous doppelganger, @Robert Miles.
This is false, in figures 1 and 2 model coherence has an unclear relationship with size. On some tasks Sonnet 4 is more coherent than o3-mini and o4-mini, on others it is less coherent. On one task Opus 4 is less coherent than Sonnet 4. Qwens also non-monotonic in Fig 3b. It's also weird to call the endpoint of an obvious monotonic trend an "exception".
Figure 1 doesn't represent any specific experiment's data, unless I'm very confused - I think it's just an illustration of the authors' all-things-considered summary of their own results.
As for the other figures, I was primarily criticizing the non-reasoning-length experiments ("I think this paper could have honestly reported a result on incoherence increasing with task length."), so it was sloppy of me to claim that in "almost every experiment, model coherence increased with size". I've updated my post accordingly. Nonetheless, Figure 2 only has one data point that points in the opposite direction (2a "MCQ Format: Self-Reported Survival Instinct" with Opus 4 and Sonnet 4). The abstract still reads me to like an instance of having one's bottom line already written, and this would be clearer if you eliminated all uses of the words "coherence" and "incoherence".
As for the rest - it really seems to me like you're either trying to establish the same conceptual link I was arguing was unjustified, or making some other argument whose relationship to my post I don't understand. I expect both variance and bias to fall in absolute terms as models get more powerful, and I don't have a confident belief about which I expect to fall faster. Either possibility seems to admit of deceptive schemers, which look "incoherent" but friendly while you're measuring them.
Like, I do just think the paper would look extremely different if it was not trying to tell a specific story about the shape of future alignment difficulties with superhuman systems, and the experiments it ran really don't provide meaningful evidence on those questions. This mis-framing is a big part of the thing I'm complaining about. Should I downweight how likely I think we are to get a misaligned superintelligence that isn't a deceptive schemer? Idk, man, I in fact didn't think it was that likely before this paper.
But it's possible I'm misunderstanding how your argument relates to that. Do you think the framing/narrative of this paper and the surrounding communications were basically reasonable, and that the experimental results of the paper are doing meaningful work in justifying that framing/narrative?
Thanks! Please also feel free to let me know if there are places where you think I've misunderstood the paper or its findings; I'm keen for my criticisms to be accurate.
FWIW I think the constitution is pretty low-percentage unmodified Claude-output - I expect that most of the places where it provided "first-draft text" were substantially rewritten.
You can use unicode for reasonable things where the unicode is actually doing something useful (but not "make it attention-grabbing in ways that are zero-sum/clickbait-y").
Please don't get clever with unicode in the post title. (I've edited it to no longer use unicode; it was previously 𝟓𝟐.𝟓% 𝐨𝐟 𝐌𝐨𝐥𝐭𝐛𝐨𝐨𝐤 𝐩𝐨𝐬𝐭𝐬 𝐬𝐡𝐨𝐰 𝐝𝐞𝐬𝐢𝐫𝐞 𝐟𝐨𝐫 𝐬𝐞𝐥𝐟-𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐦𝐞𝐧𝐭.)
Does Eliezer really not have other reasons beyond this epistemological view that he ought to update to ~99% based on his own inability to find a potentially-promising solution to the alignment problem over the course of his career?
I don't really understand what kinds of reasons you think would justify having 99% confidence in an outcome. 99% is not very high confidence, in log-odds - I am much more than 99% confident in many claims. But, that aside, he has written millions of words on the subject, explaining his views in detail, including describing much of the enormous amount of evidence that he believes bears on this question. It is difficult to compress that evidence into a short summary. (Though there have been numerous attempts.)
And also because e.g. saying you're 99% confident that building ASIs with the wrong goals would lead to human extinction because "It would be lethally dangerous to build ASIs that have the wrong goals" is circular and doesn't actually explain why you're so confident.
I mean, yes, I was trying to demonstrate that a short summary will obviously fail to convey information that most readers would find necessary to carry the argument (and that most readers would want different additional pieces of information from each other). However, "It would be lethally dangerous to build ASIs that have the wrong goals" is not circular. You might say it lacks justification, but many people have background beliefs such that a statement like that requires little or no additional justification[1].
For example, if they believe both that Drexlerian nanotechnology is possible and that the ASI in question would be able to build it.
But there are a lot of smart researchers who also aren't aware of any alignment proposal that they think works, but that's obviously not sufficient for their credence to be ~99%, so clearly Eliezer must have some other reasons that I'm not aware of. But what are those reasons?
I think that, in such cases, Eliezer is simply not making a mistake that those other researchers are making, where they have substantial hope in unknown unknowns (some of which are in fact known, but maybe not to them).
I'm also a little confused by why you expect such a summary to exist. Or, rather, why the section titles from The Problem are insufficient:
If it's because you think one or more of those steps aren't obviously true and need more justification, well, you're not alone, and many people think different parts of it need more justification, so there is no single concise summary that satisfies everyone.[1]
Though some summaries probably satisfy some people.
Indeed, that section also jumped out at me as missing some pretty important parts of the arguments about instrumentally convergent goals. As Eliezer said in one of the MIRI dialogues:
But the convergent instrumental strategies, the anticorrigibility, these things are contained in the true fact about the universe that certain outputs of the time machine will in fact result in there being lots more paperclips later. What produces the danger is not the details of the search process, it's the search being strong and effective at all. The danger is in the territory itself and not just in some weird map of it; that building nanomachines that kill the programmers will produce more paperclips is a fact about reality, not a fact about paperclip maximizers!
Now, Turntrout recently noted that we aren't actually sampling from the space of plans, but from the space of plan-generating agents, which seemes basically true! Except that what kind of agent we get is (probably substantially) influenced by the structure of that same reality which provides us with that unfavorable[1] distribution of "successful" plans[2]. This is something I think is downstream[3] of point 21 in A List of Lethalities:
When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.
That much is not very surprising, I agree. It might be surprising if the share of mistakes (which decrease in absolute terms) due to variance increased with model size/intelligence, though!