The Standard Analogy

Zack_M_Davis

[Scene: a suburban house, a minute after the conclusion of "And All the Shoggoths Merely Players". Doomimir returns with his package, which he places by the door, and turns his attention to Simplicia, who has been waiting for him.]

Simplicia: Right. To recap for [coughs] no one in particular, when we left off [pointedly, to the audience] one minute ago, Doomimir Doomovitch, you were expressing confidence that approaches to aligning artificial general intelligence within the current paradigm were almost certain to fail. You don't think that the apparent tractability of getting contemporary generative AI techniques to do what humans want bears on that question. But you did say you have empirical evidence for your view, which I'm excited to hear about!

Doomimir: Indeed, Simplicia Optimistovna. My empirical evidence is the example of the evolution of human intelligence. You see, humans were optimized for one thing only: inclusive genetic fitness—

[Simplicia turns to the audience and makes a face.]

Doomimir: [annoyed] What?

Simplicia: When you said you had empirical evidence, I thought you meant empirical evidence about AI, not the same analogy to an unrelated field that I've been hearing for the last fifteen years. I was hoping for, you know, ArXiv papers about SGD's inductive biases, or online regret bounds, or singular learning theory ... something, anything at all, from this century, that engages with what we've learned from the experience of actually building artificial minds.

Doomimir: That's one of the many things you Earthlings refuse to understand. You didn't build that.

Simplicia: What?

Doomimir: The capabilities advances that your civilization's AI guys have been turning out these days haven't come from a deeper understanding of cognition, but by improvements to generic optimization methods, fueled with ever-increasing investments in compute. Deep learning not only isn't a science, it isn't even an engineering discipline in the traditional sense: the opacity of the artifacts it produces has no analogue among bridge or engine designs. In effect, all the object-level engineering work is being done by gradient descent.

The autogenocidal maniac Richard Sutton calls this the bitter lesson, and attributes the field's slowness to embrace it to ego and recalcitrance on the part of practitioners. But in accordance with the dictum to feel fully the emotion that fits the facts, I think bitterness is appropriate. It makes sense to be bitter about the shortsighted adoption of a fundamentally unalignable paradigm on the basis of its immediate performance, when a saner world would notice the glaring foreseeable difficulties and coordinate on doing Something Else Which Is Not That.

Simplicia: I don't think that's quite the correct reading of the bitter lesson. Sutton is advocating general methods that scale with compute, as contrasted to hand-coding human domain knowledge, but that doesn't mean that we're ignorant of what those general methods are doing. One of the examples Sutton gives is computer chess, where minimax search with optimizations like α–β pruning prevailed over trying to explicitly encode what human grandmasters know about the game. But that seems fine. Writing a program that thinks about tactics the way humans do rather than letting tactical play emerge from searching the game tree would be a lot more work for less than no benefit.

A broadly similar moral could apply to using deep learning to approximate complicated functions between data distributions: we specify the training distribution, and the details of fitting it are delegated to a network architecture with the appropriate invariances: convolutional nets for processing image data, transformers for variable-length sequences. There's a whole literature—

Doomimir: The literature doesn't help if your civilization's authors aren't asking the questions we need answered in order to not die. What, specifically, am I supposed to learn from your world's literature? Give me an example.

Simplicia: I'm not sure what kind of example you're looking for. Just from common sense, it seems like the problem of aligning AI is going to involve intimate familiarity with the nitty-gritty empirical details of how AI works. Why would you expect to eyeball the problem from your armchair and declare the whole thing intractable on the basis of an analogy to biological evolution, which is just not the same thing as ML training?

Picking something arbitrarily ... well, I was reading about residual networks recently. Deeper neural networks were found to be harder to train because the gradient varied too quickly with respect to the input. Being the result of a many-fold function composition, the loss landscape in very deep networks becomes a mottled fractal of tiny mountains, rather than a smooth valley to descend. This is mitigated by introducing "residual" connections that skip some layers, creating shorter paths through the network which have less volatile gradients.

I don't understand how you can say that this isn't science or engineering. It's a comprehensible explanation for why one design of information-processing system works better than alternatives, grounded in observation and mathematical reasoning. There are dozens of things like that. What did you expect the science of artificial minds to look like, exactly?

Doomimir: [incredulous] That's your example? Resnets?

Simplicia: ... sure?

Doomimir: By conservation of expected evidence, I take your failure to cite anything relevant as further confirmation of my views. I've never denied that you can write many dissertations about such tricks to make generic optimizers more efficient. The problem is that that knowledge brings us closer to being able to brute-force general intelligence, without teaching us about intelligence. What program are all those gradient updates building inside your network? How does it work?

Simplicia: [uncomfortably] People are working on that.

Doomimir: Too little, too late. The reason I often bring up human evolution is because that's our only example of an outer optimization loop producing an inner general intelligence, which sure looks like the path your civilization is going down. Yes, there are differences between gradient descent and natural selection, but I don't think the differences are relevant to the morals I draw.

As I was saying, the concept of fitness isn't represented anywhere in our motivations. That is, the outer optimization criterion that evolution selected for while creating us, bears no visible resemblance to the inner optimization criteria that we use when selecting our plans.

As optimizers get more powerful, anything that's not explicitly valued in the utility function won't survive edge instantiation. The connection between parental love and inclusive fitness has grown much weaker in the industrial environment than it was in the EEA, as more options have opened up for humans to prioritize their loved ones' well-being in ways that don't track allele frequencies. In a transhumanist utopia with mind uploading, it would break entirely as we migrated our minds away from the biological substrate: if some other data storage format suited us better, why would we bother keeping around the specific molecule of DNA, which no one had heard of before the 19th or 20th century?

Of course, we're not going to get a transhumanist utopia with mind uploading, because history will repeat itself: the outer loss function that mad scientists use to grow the first AGI will bear no resemblance to the inner goals of the resulting superintelligence.

Simplicia: You seem to have a basically ideological conviction that outer optimization can't be used to shape the behaviors of the inner optimizers it produces, such that you don't think that "We train for X and get X" is an allowable step in an alignment proposal. But this just seems flatly contradicted by experience. We train deep learning systems for incredibly specific tasks all the time, and it works fantastically well.

Intuitively, I want to say that it works much better than evolution: I don't imagine succeeding at selectively breeding an animal that speaks perfect English the way LLMs do. Relatedly, we can and do train LLMs from a blank slate, in contrast to how selective breeding only works with traits already present in the wild type; it's too slow to assemble adaptations from scratch.

But even selective breeding basically works. We successfully domesticate loyal dogs and meaty livestock. If we started breeding dogs for intelligence as well as being loyal and friendly to us, I'd expect them to still be approximately loyal and friendly as they started to surpass our intelligence, and to grant us equity in their hyperdog star empire. Not that that's necessarily a good idea—I'd rather pass the world on to another generation of humans than a new dominant species, even a friendly one. But your position doesn't seem to be, "Creating a new dominant species is a huge responsibility; we should take care to get the details right." Rather, you don't seem to think we can exert meaningful control over the outcome at all.

Before the intermission, I asked how your pessimism about aligning AGI using training data was consistent with deep learning basically working. My pet example was the result where mechanistic interpretability researchers were able to confirm that training on modular arithmetic problems resulted in the network in fact learning a modular addition algorithm. You said something about that being a fact of the training distribution, the test distribution, and the optimizer, which wouldn't work for friendly AI. Can you explain that?

Doomimir: [sighing] If I must. If you select the shortest program that does correct arithmetic mod p for inputs up to a googol, my guess is that it would work for inputs over a googol as well, even though there are a vast space of possible programs that are correct on inputs less than a googol and incorrect on larger inputs. That's a sense in which I'll affirm that training data can "shape behavior", as you put it.

But that's a specific claim about what happens with the training distribution "mod arithmetic with inputs less than a googol", the test distribution "mod arithmetic with inputs over a googol", and the optimizer "go through all programs in order until you find one that fits the training distribution." It's not a generic claim that the inner optimizers found by outer optimizers will want what some humans who assembled the training set optimistically imagined they would want.

In the case of human evolution—again, our only example of outer optimization producing general intelligence—we know as a historical fact that the first program found by the optimizer "greedy local search of mutations and recombinations" for the training task "optimize inclusive genetic fitness in the environment of evolutionary adaptedness" did not generalize to optimizing inclusive genetic fitness in the test distribution of the modern world. Likewise, your claim that selective breeding allegedly "basically works" is problematized by all the times when it doesn't work—like when selecting for small subpopulation sizes in insects results in of cannibalism of larvæ rather than restricted breeding, or when selecting chickens that lay the most eggs in a coop gets you more aggressive chickens who make their neighbors less productive.

Simplicia: [nodding] Uh-huh. With you so far.

Doomimir: I don't believe you. If you were really with me so far, you would have noticed that I just disproved the naïve mirroring expectation that outer optimizers training on a reward result in inner optimizers pursuing that reward.

Simplicia: Yeah, that sounds like a really dumb idea. If you ever meet someone who believes that, I hope you manage to talk them out of it.

Doomimir: [frustrated] If you're not implicitly assuming the naïve mirroring expectation—whether you realize it or not—then I don't understand why you think "We train for X and get X" is an allowable step in an alignment proposal.

Simplicia: It depends on the value of X—and the value of "train". As you say, there are facts of the matter as to which outer optimizers and training distributions produce which inner optimizers, and how those inner optimizers generalize to different test environments. As you say, the facts aren't swayed by wishful thinking: someone who reasoned, "I pressed the reward button when my AI did good things, therefore it will learn to be good," will be disappointed if it turns out that the system generalizes to value reward-button pushes themselves—what you would call an outer alignment failure—or any number of possible training correlates of reward—what you would call an inner alignment failure.

Doomimir: [patronizingly] With you so far. And why doesn't this instantly sink "We train for X and get X" as an allowable step in an alignment proposal?

Simplicia: Because I think it's possible to make predictions about how inner optimizers will behave and to choose training setups accordingly. I don't have a complete theory of exactly how this works, but I think the complete theory is going to be more nuanced than, "Either training converts the outer loss function into an inner utility function, in which case it kills you, or there's no way to tell what it will do, in which case it also kills you," and that we can glimpse the outlines of the more nuanced theory by carefully examining the details of the examples we've discussed.

In the case of evolution, you can view fitness as being defined as "that which got selected for". One could argue that farmers practicing artificial selection aren't "really" breeding cows for milk production: rather, the cows are being bred for fitness! If we apply the same standards to Nature as we do to the farmer, then rather than saying humans were optimized solely for inclusive genetic fitness, we would say they were optimized to mate, hunt, gather, acquire allies, avoid disease, &c. Construed that way, the relationship between the outer training task and the inner policy's motivations looks a lot more like "We train for X and get X" than you're giving it credit for.

That said, it is true that the solutions found by evolution can be surprising to a selective breeder who hasn't thought carefully about what selection pressures they're applying, as in your examples of artificial selection failures: the simplest change to an insect that draws on existing variation to respond to selection pressure for smaller subpopulations might be to promote cannibalism; the simplest change to a chicken to lay more eggs than neighboring chickens might be to become a bully.

Doomimir: Is this a troll where you concede all of my points and then put on a performance of pretending to somehow disagree? That's what I've been trying to teach you: the solutions found by outer optimization can be surprising—

Simplicia: —to a designer that hasn't thought carefully about what optimization pressures they're applying. Responsible use of outer optimization—

[Doomimir guffaws]

Simplicia: —doesn't seem like an intractable engineering problem, and the case for deep learning looks a lot more favorable than for evolution. The seemingly tenuous connection between the concept of inclusive fitness and humanity's "thousand shards of desire" can be seen as a manifestation of sparse rewards: if the outer optimizer only measures allele frequencies and is otherwise silent on the matter of which alleles are good, then the simplest solution—with respect to natural selection's implied simplicity prior—is going to depend on a lot of contingencies of the EEA, which would be surprising if you expected to get a pure DNA-copy maximizer.

In contrast, when we build AI systems, we can make the outer optimizer supply as much supervision as we like, and dense supervision tightly constrains the solutions that are found. In terms of the analogy, it's easy to micromanage the finest details of the "EEA". We're not limited to searching for a program that succeeds at some simple goal and accepting whatever weird drives happened to be the easiest way to accomplish that; we're searching for a program that approximates the billions of expected input–output pairs we trained it on.

It's believed that reason neural nets generalize at all is because the parameter–function map is biased towards simple functions: to a first approximation, training is equivalent to doing a Bayesian update on the observation that a net with randomly initialized weights happens to fit the training data.

In the case of large language models, it seems like a reasonable guess that the simplest function that predicts the next token of webtext, really is just a next token predictor. Not a next-token predicting consequentialist which will wirehead with easily-predicted tokens, but a predictor of the webtext training distribution. The distribution-specificity that you consider an inner alignment failure in the case of human evolution is a feature, not a bug: we trained for X and got X.

Doomimir: And then immediately subjected it to reinforcement learning.

Simplicia: As it happens, I also don't think RLHF is as damning as you do. Early theoretical discussions of AI alignment would sometimes talk about what would go wrong if you tried to align AI with a "reward button." Those discussions were philosophically valuable. Indeed, if you had a hypercomputer and your AI design method was to run a brute-force search for the simplest program that resulted in the most reward-button pushes, that would predictably not end well. While a weak agent selected on that basis might behave how you wanted, a stronger agent would find creative ways to trick or brainwash you into pushing the button, or just seize the button itself. If we had a hypercomputer in real life and were literally brute-forcing AI that way, I would be terrified.

But again, this isn't a philosophy problem anymore. Fifteen years later, our state-of-the-art methods do have a brute-force aspect to them, but the details are different, and the details matter. Real-world RLHF setups aren't an unconstrained hypercomputer search for whatever makes humans hit the thumbs-up button. It's reinforcing the state–action trajectories that got reward in the past, often with a constraint on the Kullback–Leibler divergence from the base policy, which blows up on outputs that would be vanishingly unlikely from the base policy.

If most of the bits of search are coming from pretraining, which solves problems by means of copying the cognitive steps that humans would use, then using a little bit of reinforcement learning for steering doesn't seem dangerous in the way that it would be dangerous if the core capabilities fell directly out of RL.

It seems to be working pretty well? It just doesn't seem that implausible that the result of searching for the simplest program that approximates the distribution of natural language in the real world, and then optimizing that to give the responses of a helpful, honest, and harmless assistant is, well ... a helpful, honest, and harmless assistant?

Doomimir: Of course it seems to be working pretty well! It's been optimized for seeming-good-to-you!

Simplicia, I was willing to give this a shot, but I truly despair of leading you over this pons asinorum. You can articulate what goes wrong with the simplest toy illustrations, but keep refusing to see how the real-world systems you laud suffer from the same fundamental failure modes in a systematically less visible way. From evolution's perspective, humans in the EEA would have looked like they were doing a good job of optimizing inclusive fitness.

Simplicia: Would it, though? I think aliens looking at humans in the environment of evolutionary adaptedness and asking how the humans would behave when they attained to technology would have been able to predict that civilized humans would care about sex and sugar and fun rather than allele frequencies. That's a factual question that doesn't seem too hard to get right.

Doomimir: Sane aliens would. Unlike you, they'd also be able to predict that RLHF'd language models would care about <untranslatable-1>, <untranslatable-2>, and <untranslatable-3>, rather than being helpful, harmless, and honest.

Simplicia: I understand that it's possible for things to superficially look good in a brittle way. We see this with adversarial examples in image classification: classifiers that perform well on natural images can give nonsense answers on images constructed to fool them, which is worrying, because it indicates that the machines aren't really seeing the same images we are. That sounds like the sort of risk story you're worried about: that a full-fledged AGI might seem to be aligned in the narrow circumstances you trained it on, while it was actually pursuing alien goals all along.

But in that same case of the image classification, we can see progress being made. When you try to construct adversarial examples for classifiers that have been robustified with adversarial training, you get examples that affect human perception, too. When you use generative models for classification rather than just training a traditional classifier, they exhibit human-like shape bias and out-of-distribution performance. You can try perturbing the network's internal states rather than the inputs to try to defend against unforeseen failure modes ...

I imagine you're not impressed by any of this, but why not? Why isn't incremental progress at instilling human-like behavior into machines, incremental progress on AGI alignment?

Doomimir: Think about it information-theoretically. If survivable futures require specifying 100 bits into the singleton's goals, then you're going to need precision targeting to hit that trillion trillion trillionth's part of the space. The empirical ML work you're so impressed with isn't on a path to get us that kind of precision targeting. I don't dispute that with a lot of effort, you can pound the inscrutable matrices into taking on more overtly human-like behavior, which might or might not buy you a few bits.

It doesn't matter. It's like trying to recover Shakespeare's lost folios by training a Markov generator on the existing tests. Yes, it has a vastly better probability of success than a random program. That probability is still almost zero.

Simplicia: Hm, perhaps a crux between us is how narrow of a target is needed to realize how much of the future's value. I affirm the orthogonality thesis, but it still seems plausible to me that the problem we face is more forgiving, not so all-or-nothing as you portray it. If you can reconstruct a plausible approximation of the lost folios, how much does it matter that you didn't get it exactly right? I'm interested to discuss further—

Doomimir: I'm not. Your mother named you well. I see no profit in laboring to educate the ineducable.

Simplicia: But if the world is ending either way?

Doomimir: I suppose it's a way to pass the time.

Simplicia: [to the audience] Until next time!

Simplicia: I understand that it's possible for things to superficially look good in a brittle way. We see this with adversarial examples in image classification: classifiers that perform well on natural images can give nonsense answers on images constructed to fool them, which is worrying, because it indicates that the machines aren't really seeing the same images we are. That sounds like the sort of risk story you're worried about: that a full-fledged AGI might seem to be aligned in the narrow circumstances you trained it on, while it was actually pursuing alien goals all along.
But in that same case of the image classification, we can see progress being made. When you try to construct adversarial examples for classifiers that have been robustified with adversarial training, you get examples that affect human perception, too. When you use generative models for classification rather than just training a traditional classifier, they exhibit human-like shape bias and out-of-distribution performance. You can try perturbing the network's internal states rather than the inputs to try to defend against unforeseen failure modes ...
I imagine you're not impressed by any of this, but why not? Why isn't incremental progress at instilling human-like behavior into machines, incremental progress on AGI alignment?

The problem with this dialogue (or maybe it's not a problem at all; after all, Zack isn't advertising this as The One True Alignment Debate to End All Alignment Debates; it's just a fictional dialogue) is that Simplicia is arguing that the real-world existence of incremental progress at instilling human-like behavior into machines should make Doomimir feel more optimistic, but it doesn't appear that she gets what the argument is for why that should make him less doomy. Or, more concretely, she is using arguments that veer away from the most salient part and the upstream generator of disagreements between doomy and less-doomy alignment researchers (one could uncharitably say she is being somewhat... simple-minded here). To use a Said Achmiz phrase, the following section lacks a certain concentrated force which this topic greatly deserves:

Simplicia: As it happens, I also don't think RLHF is as damning as you do. Early theoretical discussions of AI alignment would sometimes talk about what would go wrong if you tried to align AI with a "reward button." Those discussions were philosophically valuable. Indeed, if you had a hypercomputer and your AI design method was to run a brute-force search for the simplest program that resulted in the most reward-button pushes, that would predictably not end well. While a weak agent selected on that basis might behave how you wanted, a stronger agent would find creative ways to trick or brainwash you into pushing the button, or just seize the button itself. If we had a hypercomputer in real life and were literally brute-forcing AI that way, I would be terrified.
But again, this isn't a philosophy problem anymore. Fifteen years later, our state-of-the-art methods do have a brute-force aspect to them, but the details are different, and the details matter. Real-world RLHF setups aren't an unconstrained hypercomputer search for whatever makes humans hit the thumbs-up button. It's reinforcing the state–action trajectories that got reward in the past, often with a constraint on the Kullback–Leibler divergence from the base policy, which blows up on outputs that would be vanishingly unlikely from the base policy.
If most of the bits of search are coming from pretraining, which solves problems by means of copying the cognitive steps that humans would use, then using a little bit of reinforcement learning for steering doesn't seem dangerous in the way that it would be dangerous if the core capabilities fell directly out of RL.
It seems to be working pretty well? It just doesn't seem that implausible that the result of searching for the simplest program that approximates the distribution of natural language in the real world, and then optimizing that to give the responses of a helpful, honest, and harmless assistant is, well ... a helpful, honest, and harmless assistant?
Doomimir: Of course it seems to be working pretty well! It's been optimized for seeming-good-to-you!
Simplicia, I was willing to give this a shot, but I truly despair of leading you over this pons asinorum. You can articulate what goes wrong with the simplest toy illustrations, but keep refusing to see how the real-world systems you laud suffer from the same fundamental failure modes in a systematically less visible way. From evolution's perspective, humans in the EEA would have looked like they were doing a good job of optimizing inclusive fitness.

What Simplicia should have focused on at this very point in the conversation (maybe she will in a future dialogue?) is on what takeaways should the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models imply about how accurately we should expect the very specific theory of optimization and artificial cognition that the doomy worldview is based around to track the territory:

Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level ("So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all"), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence) of even an "alien mind" that's sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility [over future world states] according to its world model to purse a goal that can be extremely different from what humans deem good.

I think the particular framing I gave this issue in my previous (non-block quote) paragraph reveals my personal conclusion on that topic, but I expect significant disagreement and pushback to appear here. "And All the Shoggoths Merely Players" did not key in on this consideration either, certainly not with the degree of specificity and force that I would have desired.

I hope this shows up at some point in future sequels, Zack, and I of course hope that such sequels are on their way. The dialogues have certainly been fun to read thus far.

And conversely, Doomimir does a lot of laughing, crowing, and becoming enraged, but doesn't do things like point out the importance of situational awareness for RL being dangerous, either before RLHF being brought up (which would be a more moderate thing) or only as a response to Simplicia bringing up RLHF (which probably is more in-character).

Simplicia: The thing is, I basically do buy realism about rationality, and realism having implications for future powerful AI—in the limit. The completeness axiom still looks reasonable to me; in the long run, I expect superintelligent agents to get what they want, and anything that they don't want to get destroyed as a side-effect. To the extent that I've been arguing that empirical developments in AI should make us rethink alignment, it's not so much that I'm doubting the classical long-run story, but rather pointing out that the long run is "far away"—in subjective time, if not necessarily sidereal time. If you can get AI that does a lot of useful cognitive work before you get the superintelligence whose utility function has to be exactly right, that has implications for what we should be doing and what kind of superintelligence we're likely to end up with.

I find it ironic that Simplicia's position in this comment is not too far from my own, and yet my reaction to it was "AIIIIIIIIIIEEEEEEEEEE!". The shrieking is about everyone who thinks about alignment having illegible models from the perspective of almost everyone else, of which this thread is an example.

Curated. I valued reading this dialogue and the two preceding ones in the series. Fictional dialogues have had periods of popularity, with varying quality, and I feel like Zack is producing the kind that justify the genre: an educational contrasting of sides of the an argument the way you don't get from an author just arguing their side in prose, together with the coherence that comes from a single author rather than the unintentional talking past from real dialogues). The dialogues are amusingly written, and while the portrayal of Doomimir feels a bit unfair to those advocating that brand of doomy worldview (myself included), it feels helpful there too for understanding how the arguments and statement comes across. And I appreciate Zacks's unusually high-level of scholarcism (i.e. linking to sources). Overall thanks, this was good, I feel like it's helped me better understand positions and intuitions other than my own.

Why isn’t incremental progress at instilling human-like behavior into machines, incremental progress on AGI alignment?

It kind of is, but unfortunately treating others badly when you have lots of power is also part of human nature. And there's no real limit to how bad it could get, see the Belgian Congo for example.

(Thanks to John Wentworth for playing Doomimir in a performance of this at Less Online yesterday.)

Simplicia: Hm, perhaps a crux between us is how narrow of a target is needed to realize how much of the future's value. I affirm the orthogonality thesis, but it still seems plausible to me that the problem we face is more forgiving, not so all-or-nothing as you portray it.

I agree that it's plausible. I even think a strong form of moral realism (denial of orthogonality thesis) is plausible. My objection is that humanity should figure out what is actually the case first (or have some other reasonable plan of dealing with this uncertainty), instead of playing logical Russian roulette like it seems to be doing. I like that Simplicia isn't being overconfident here, but is his position actually that "seems plausible to me that the problem we face is more forgiving" is sufficient basis for moving forward with building AGI? (Does any real person in the AI risk debate have a position like this?)

I agree that it's plausible. I even think a strong form of moral realism (denial of orthogonality thesis) is plausible.

Those are good and highly salient points (with the added comment that I would not go as far as to say possibility #1 in your post is "plausible", as that seems to suggest a rather high subjective probability of it being true compared to what the mere "possible" does; Roko's old post and the associated comment thread are highly relevant here). Nevertheless, I think the situation is even trickier and more confusing than you have illustrated here. Quoting Charlie Steiner's excellent Reducing Goodhart sequence:

Humans don't have our values written in Fortran on the inside of our skulls, we're collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It's not that there's some pre-theoretic set of True Values hidden inside people and we're merely having trouble getting to them - no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like "which atoms exactly count as part of the person" and "what do you do if the person says different things at different times?"
The natural framing of Goodhart's law - in both mathematics and casual language - makes the assumption that there's some specific True Values in here, some V to compare to U. But this assumption, and the way of thinking built on top of it, is crucially false when you get down to the nitty gritty of how to model humans and infer their values.

Whenever I see discourse about the values or preferences of beings embedded in a physical universe that goes beyond the boundaries of the domains (namely, low-specificity conversations dominated by intuition) in which such ultimately fake frameworks function reasonably well, I get nervous and confused. I get particularly nervous if the people participating in the discussions are not themselves confused about these matters (I am not referring to you in particular here, since you have already signaled an appropriate level of confusion about this). Such conversations stretch our intuitive notions past their breaking point by trying to generalize them out of distribution without the appropriate level of rigor and care.

What counts as human "preferences"? Are these utility function-like orderings of future world states, or are they ultimately about universe-histories, or maybe a combination of those, or maybe something else entirely? Do we actually have any good reason to think that (some form of) utility maximization explains real-world behavior, or are the conclusions broadly converged upon on LW ultimately a result of intuitions about what powerful cognition must be like whose source is a set of coherence arguments that do not stretch as far as they were purported to? What do we do with the fact that humans don't seem to have utility functions and yet lingering confusion about this remained as a result of many incorrect and misleading statements by influential members of the community?

How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?

In any case, are they indexical or not? If we are supposed to think about preferences in terms of revealed preferences only, what does this mean in a universe (or an Everett branch, if you subscribe to that particular interpretation of QM) that is deterministic? Aren't preferences thought of as being about possible worlds, so they would fundamentally need to be parts of the map as opposed to the actual territory, meaning we would need some canonical framework of translating the incoherent and yet supposedly very complex and multidimensional set of human desires into something that actually corresponds to reality? What additional structure must be grafted upon the empirically-observable behaviors in order for "what the human actually wants" to be well-defined?

On the topic of agency, what exactly does that refer to in the real world? Do we not "first need a clean intuitively-correct mathematical operationalization of what "powerful agent" even means"? Are humans even agents, and if not, what exactly are we supposed to get out of approaches that are ultimately all about agency? How do we actually get from atoms to agents? (note that the posts in that eponymous sequence do not even come close to answering this question) More specifically, is a real-world being actually the same as the abstract computation its mind embodies? Rejections of souls and dualism, alongside arguments for physicalism, do not prove the computationalist thesis to be correct, as physicalism-without-computationalism is not only possible but also (as the very name implies) a priori far more faithful to the standard physicalist worldview.

What do we mean by morality as fixed computation in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent (through sensitivity to butterfly effects and order dependence) that a concept like "CEV" probably doesn't make sense? The feedback loops implicit in the structure of the brain cause reward and punishment signals to "release chemicals that induce the brain to rearrange itself" in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not "update" its brain chemistry the same way that a biological being does be "human" in any decision-relevant way?). We can think about a continuous personal identity through the lens of mutual information about memories, personalities etc, but our current understanding of these topics is vastly incomplete and inadequate, and in any case the naive (yet very widespread, even on LW) interpretation of "the utility function is not up for grabs" as meaning that terminal values cannot be changed (or even make sense as a coherent concept) seems totally wrong.

The way communities make progress on philosophical matters is by assuming that certain answers are correct and then moving on. After all, you can't ever get to the higher levels that require a solid foundation if you aren't allowed to build such a foundation in the first place. But I worry, for reasons that have been stated before, that the vast majority of the discourse by "lay lesswrongers" (and, frankly, even far more experienced members of the community working directly on alignment research; as a sample illustration, see a foundational report's failure to internalize the lesson of "Reward is not the optimization target") is based on conclusions reached through informal and non-rigorous intuitions that lack the feedback loops necessary to ground themselves to reality because they do not do enough "homework problems" to dispel misconceptions and lingering confusions about complex and counterintuitive matters.

I do not have answers to the very large set of questions I have asked and referenced in this comment. Far more worryingly, I have no real idea of how to even go about answering them or what framework to use or what paradigm to think through. Unfortunately, getting all this right seems very important if we want to get to a great future. Based on my reading of the general pessimism you have been signaling throughout your recent posts and comments, it doesn't seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us.

Perhaps if a group of really smart philosophy-inclined people who have internalized the lessons of the Sequences without being wedded to the very specific set of conclusions MIRI has reached about what AGI cognition must be like and which seem to be contradicted by the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models were to be given a ton of funding and access and 10 years to work on this problem as part of a proto-Long Reflection, something interesting would come out. But that is quite a long stretch at this point.

In my view, the main good outcomes of the AI transition are 1) we luck out, AI x-safety is actually pretty easy across all the subproblems 2) there's an AI pause, humans get smarter via things like embryo selection, then solve all the safety problems.

I'm mainly pushing for #2, but also don't want to accidentally make #1 less likely. It seems like one of the main ways in which I could end up having a negative impact is to persuade people that the problems are definitely too hard and hence not worth trying to solve, and it turns out the problems could have been solved with a little more effort.

"it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us" is a bit worrying from this perspective, and also because my "effort spent on them" isn't that great. As I don't have a good approach to answering these questions, I mainly just have them in the back of my mind while my conscious effort is mostly on other things.

BTW I'm curious what your background is and how you got interested/involved in AI x-safety. It seems rare for newcomers to the space (like you seem to be) to quickly catch up on all the ideas that have been developed on LW over the years, and many recently drawn to AGI instead appear to get stuck on positions/arguments from decades ago. For example, r/Singularity has 2.5M members and seems to be dominated by accelerationism. Do you have any insights about this? (How were you able to do this? How to help others catch up? Intelligence is probably a big factor which is why I'm hoping that humanity will automatically handle these problems better once it gets smarter, but many seem plenty smart and still stuck on primitive ideas about AI x-safety.)

It seems like one of the main ways in which I could end up having a negative impact is to persuade people that the problems are definitely too hard and hence not worth trying to solve, and it turns out the problems could have been solved with a little more effort.

This isn't unreasonable, but in order for this to be a meaningful concern, it's possible that you would need to be close to enough people working on this topic to the point where you misleading them would have a nontrivial impact. And I guess this... just doesn't seem to be the case (at least to an outsider like me)? Even otherwise interested and intelligent people are focusing on other stuff, and while I suppose there may be some philosophy PhD's at OpenPhil or the Future of Humanity Institute (RIP) who are thinking about such matters, they seem few and far between.

That is to say, it sure would be nice if we got to a point where the main concern was "Wei Dai is unintentionally misleading people working on this issue" instead of "there are just too few people working on this to produce useful results even absent any misleadingness".

My path to getting here is essentially the following:

reading Ezra Klein and Matt Yglesias because they seem saner and more policy-focused than other journalists Yglesias writes an interesting blog post in defense of the Slate Star Codex, which I had heard of before but had never really paid much attention to $\to$ I start reading the SSC out of curiosity, and I am very impressed by how almost every post is interesting, thoughtful, and gives me insights which I had never considered but which seemed obvious in retrospect $\to$ Scott occasionally mentions this "rationality community" kinda-sorta centered around LW $\to$ I start reading LW in earnest, and I enjoy the high quality of the posts (and especially of the discussions in the comments), but I mostly avoid the AI risk stuff because it seems scary and weird; I also read HPMOR, which I find to be very fun but not necessarily well-written $\to$ I mess around with the beta version of ChatGPT around September 2022 and I am shocked by how advanced and coherent the LLM seems $\to$ I realize the AI stuff is really important and I need to get over myself and actually take it seriously

If I had to say what allowed me to reach this point, I would say the following properties, listed in no particular order, were critical (actually writing this list out feels kinda self-masturbatory, but oh well):

I was non-conformist enough to not immediately bounce off Scott's writing or off of LW (which contains some really strange, atypical stuff)
I loved mathematics so much that I wasn't thrown off by everything on this site (and in fact embraced the applications of mathematical ideas in everything)
I cared about philosophy, so I didn't shy away from meta discussions and epistemology and getting really into the weeds of confusing topics like personal identity, agency, values and preferences etc
I had enough self-awareness to not become a parody of myself and to figure out what the important topics were instead of circlejerking over rationality or getting into rational fiction or other stuff like that
I was sufficiently non-mindkilled that I was able to remain open to fundamental shifts in understanding and to change my opinion in crucial ways
My sanity was resilient enough that I didn't just run away or veer off in crazy directions when confronted with the frankly really scary AI stuff
I was intelligent enough to understand (at least to some extent) what is going on and to think critically about these matters
I was obsessive and focused enough to read everything available on LW (and on other related sites) about AI without getting bored or tired over time.

The problem is that (at least from my perspective) all of these qualities were necessary in order for me to follow the path I did. I believe that if any single one of them had been removed while the other 7 remained, I would not have ended up caring about AI safety. As a sample illustration of what can go wrong when points 1 and 6 aren't sufficiently satisfied but everything else is, you can check out what happened with Qiaochu (also, you probably knew him personally?, so there's that too).

As you can imagine, this strongly limits the supply of people that can think sanely and meaningfully about AI safety topics. The need to satisfy points 1, 5, and 7 above already lowers the population pool tremendously, and there are still all the other requirements to get through.

For example, r/Singularity has 2.5M members and seems to be dominated by accelerationism.

Man, these guys... I get the impression that they are mindkilled in a very literal political sense. They seem to desperately await the arrival of the glorious Singularity that will free them from the oppression and horrors of Modernity and Capitalism. Of course, the fact that they are living better material lives than 99.9% of humans that have ever existed doesn't seem to register, and I guess you can call this the epitome of entitlement and privilege.

But I don't think most of them are bad people (in so far as it even makes sense to call a person bad). They just live in a very secure epistemic bubble that filters everything they read and think about and which prevents them from ever touching reality. I've written similar stuff about this before:

society has always done, is doing, and likely will always do a tremendous job of getting us to self-sort into groups of people that are very similar to us in terms of culture, social and aesthetic preferences, and political leanings. This process has only been further augmented by the rise of social media and the entrenchment of large and self-sustaining information bubbles. In broad terms, people do not like to talk about the downsides of their proposed policies or general beliefs, and even more importantly, they do not communicate these downsides to other members of the bubble. Combined with the present reality of an ever-growing proportion of the population that relies almost entirely on the statements and attitudes of high-status members of the in-group as indicators of how to react to any piece of news, this leads to a rather remarkable equilibrium, in which otherwise sane individuals genuinely believe that the policies and goals they propose have 100% upside and 0% downside, and the only reason they don't get implemented in the real world is because of wicked and stupid people on the other side who are evil or dumb enough to support policies that have 100% downside and 0% upside. Trade-offs are an inevitable consequence of any discussion about meaningful changes to our existing system; simple Pareto improvements are extremely rare. However, widespread knowledge or admission of the existence of trade-offs does not just appear out of nowhere; in order for this reality to be acknowledged, it must be the case that people are exposed to counterarguments (or at the very least calls for caution) to the most extreme versions of in-group beliefs by trusted members of the in-group because everyone else will be ignored or dismissed as a bad-faith supporters of the opposition. Due to the dynamic mentioned earlier, this happens less and less, and beliefs get reinforced into becoming more and more extreme. Human beings thus end up with genuine (although self-serving and biased) convictions and beliefs that a neutral observer could nonetheless readily identify as irrational or nonsensical. In the past, there used to be a moderating effect due to the much more shared nature of pop culture and group identity: if you were already predisposed to adopt extreme views, it was unlikely for you to find other similarly situated people in your neighborhood or coalition, as most groups you could belong to were far more mainstream and thus moderate. But now the Internet allows you to turn all that on its head with just a few mouse clicks; after all, no matter what intuition you may have about any slightly popular topic, there is very likely some community out there ready to take you in and tell you how smart and brave you are for thinking the right thoughts and not being one of the crazy, bad people who disagree.

The sanity waterline is very low and only getting lower. As lc has said, "the vast majority of people alive today are the effective mental subjects of some religion, political party, national identity, or combination of the three". I would have hoped that CFAR was trying to solve that, but that apparently was not close to being true even though it was repeatedly advertised as aiming to "help people develop the abilities that let them meaningfully assist with the world’s most important problems, by improving their ability to arrive at accurate beliefs, act effectively in the real world, and sustainably care about that world" by "widen[ing] the bottleneck on thinking better and doing more." I guess the actual point of CFAR (there was a super long twitter thread by Qiaochu on this at some point) was to give the appearance of being about rationality while the underlying goal was to nerd-snipe young math-inclined students to go work on mathematical alignment at MIRI? Anyway, I'm slightly veering off-topic.

How to help others catch up?

I don't have a good answer to this question. Due to the considerations mentioned earlier, the most effective short-term way to get people who could contribute anything useful into AI safety is through selective rather than corrective or structural means, but there are just too few people who fit the requirements for this to scale nicely.

Over the long-term, you can try to reverse the trends on general societal inadequacy and sanity, but this seems really hard, it should have been done 20 years ago, and in any case requires actual decades before you can get meaningful outputs.

I'll think about this some more and I'll let you know if I have anything else to say.

Thanks for your insightful answers. You may want to make a top-level post on this topic to get more visibility. If only a very small fraction of the world is likely to ever understand and take into account many important ideas/considerations about AI x-safety, that changes the strategic picture considerably, and people around here may not be sufficiently "pricing it in". I think I'm still in the process of updating on this myself.

Having more intelligence seems to directly or indirectly improve at least half of the items on your list. So doing an AI pause and waiting for (or encouraging) humans to become smarter still seems like the best strategy. Any thoughts on this?

And I guess this… just doesn’t seem to be the case (at least to an outsider like me)?

I may be too sensitive about unintentionally causing harm, after observing many others do this. I was also just responding to what you said earlier, where it seemed like I was maybe causing you personally to be too pessimistic about contributing to solving the problems.

you probably knew him personally?

No, I never met him and didn't interact online much. He does seem like a good example of you're talking about.

Could that just shift the problem a bit? If we get a class of really smart people they can subjugate everyone else pretty easily too -- perhaps even better than some AGI as they start with a really good understanding of human nature, cultures, failing and how to exploit for their own purposes. Or they could simply be better suited to taking advantage of and surviving with a more dangerous AI on the loose. We end up in some hybrid world where humanity is not extinct but most peoples' life is pretty poor.

I suppose one might say that the speed and magnitude of the advances here might be such that we get to corrigible AI before we get incorrigible super humans.

I'm currious about your thought.

Quick, caveate, I'm trying to say all futures are bleak and no efforts lead where we want. I'm actually pretty positive about our future, even with AI (perhaps naively). We clearly already live in a world where the most intelligent could be said to "rule" but the rest of us average Joes are not slaves or surfs everywhere. Where the problems exist is more where we have cultural and legal failings rather than just outright subjugation by the brighter bulbs. But going back to the darker side here, the one's that tend to successfully exploit/game/or ignore the rules are the smarter ones in the room.

If governments subsidize embryo selection, we should get a general uplift of everyone's IQ (or everyone who decides to participate) so the resulting social dynamics shouldn't be too different from today's. Repeat that for a few generations, then build AGI (or debate/decide what else to do next). That's the best scenario I can think of (aside from the "we luck out" ones).

Simplicia: I don't really think of "humanity" as an agent that can make a collective decision to stop working on AI. As I mentioned earlier, it's possible that the world's power players could be convinced to arrange a pause. That might be a good idea! But not being a power player myself, I tend to think of the possibility as an exogenous event, subject to the whims of others who hold the levers of coordination. In contrast, if alignment is like other science and engineering problems where incremental progress is possible, then the increments don't need to be coordinated.

I like that Simplicia isn’t being overconfident here, but is his position

Note that the correct pronoun here should be “her”, because Simplicia is using the feminine form of the patronymic. (I wouldn’t normally make this correction, but in this case I believe that some readers may not be familiar enough with Slavic naming conventions to pick up this textual cue.)

(Self-review.) I started this series to explore my doubts about the "orthodox" case for alignment pessimism. I wrote it as a dialogue and gave my relative non-pessimist character the designated idiot character name to make it clear that I'm just exploring ideas and not staking my reputation on "heresy". ("Maybe alignment isn't that hard" doesn't sound like a smart person's position—and in fact definitely isn't a smart person's for sufficiently ambitious conceptions of what it would mean to "solve the alignment problem." Simplicia isn't saying, "Oh, yeah, we're totally on track to solve philosophy forever in machine-codable form suitable for specifying the values of the superintelligence at the end of time". As will be explored in part four—forthcoming March 2026—perhaps the disagreement is really about whether some less ambitious alignment target might salvage some cosmic value.)

That said, more than the other entries in this series, this is the one where I'm willing to cop to and put my weight down on Simplicia representing my own views, rather than laundering my doubts as just asking questions.

I understand and agree that there's a useful analogy between stochastic gradient descent and natural selection, and between future AGI misalignment and humans valuing sex and sweets rather than fitness. To someone who's never thought about these topics at all, dwelling on the analogy at length is indeed a good use of time. But it's frustrating how much MIRI's recent messaging just makes the analogy and then stops there, without considering the huge important disanalogies, like how (as Paul Christiano pointed out in 2022) selective breeding kind-of works and is a better analogical fit to AI (there wasn't an Evolution Fairy that was trying to make fitness-maximizers; an alien agency trying to selectively breed humans from the EEA would have been able to test hypotheses about how smarter humans would generalize, rather than being taken by surprise by modernity the way an Evolution Fairy would have been), and that deep learning is better thought of as program synthesis rather than evolving a little animal.

Maybe that's strategically instrumentally rational insofar as MIRI is a propaganda outlet now (in the literal meaning of the word, "public communication aimed at influencing an audience and furthering an agenda") and doesn't seem to care that much about being intellectually credible in ways that don't cash out as policy influence? (It looks like Redwood Research may have picked up the torch.) But it's disappointing.

The reason I often bring up human evolution is because that's our only example of an outer optimization loop producing an inner general intelligence

There's also human baby brains training minds from something close to random initialisation at birth into a general intelligence. That example is plausibly a lot closer to how we might expect AGI training to go, because human brains are neural nets too and presumably have strictly-singular flavoured learning dynamics just like our artificial neural networks do. Whereas evolution acts on genes, which to my knowledge don't have neat NN-style loss landscapes heavily biased towards simplicity.

Evolution is more like if people used classic genetic optimisation to blindly find neural network architectures, optimisers, training losses, and initialisation schemes, that are in turn evaluated by actually training the networks.

Not that I think this ultimately ends up weakening Doomimir's point all that much. Humans don't seem to end up with terminal goals that are straightforward copies of the reward circuits pre-wired into our brains either. I sure don't care much about predicting sensory inputs super accurately, which was probably a very big part of the training signal that build my mind.

you totally care about predicting sensory inputs accurately! maybe mostly instrumentally, but you definitely do? like, what, would it just not bother you at all if you started hallucinating all the time?

Instrumentally, yes. The point is that I don’t really care terminally.

Huh. I... think I kind of do care terminally? Or maybe I'm just having a really hard time imagining what it would be like to be terrible at predicting sensory input without this having a bunch of negative consequences.

The autogenocidal maniac Richard Sutton calls this the bitter lesson, and attributes the field's slowness to embrace it to ego and recalcitrance on the part of practitioners.

I am amused by the synchronicity of this reference to Rich Sutton's Bitter Lesson. On that subject, I would love to know what Simplicia and Doomimir would think of my recent post A "Bitter Lesson" Approach to Aligning AGI and ASI — my basic suggestion is that other people's approaches to the Alignment Problem have been too complicated, and we should instead do something conceptually simpler with more data, using only Stochastic Gradient Descent on a (very large) synthetic training dataset to get an aligned base model. I.e. that we should simply "train for X, get X", as Simplicia puts it above. I'm hoping that Doomimir should at least appreciate the suggestion of getting rid of RLHF snd relying instead on very dense feedback.

From evolution's perspective, humans in the EEA would have looked like they were doing a good job of optimizing inclusive fitness.

I'm pretty sure that if you evaluated every action a human ever took, in terms of its contributions to improving genetic fitness, you'd pretty reliably find the humans lacking. And not just because they weren't smart enough to come up with the right strategies for having more children. It was probably clear, even to the humans of the time, that lots of them cared about things other than fitness, which would conflict with fitness if extrapolated outward. E.g., wanting to impress your friends by doing stupid stunts that might seriously injure you, or picking a fight with some other band of humans, who you might in turn get into a pointless war with.

Gradient descent, as yielded by a decent reward model, would look at these behaviors, and immediately edit every parameter in the algorithm that generated them, to make them less likely. Evolution has to spend generations waiting for the right mutations to reduce their prevalence.

I don't think the failure of evolution is evidence that alignment is impossible or even hard.

Evolution wasn't smart enough to realize that alignment was a problem. It put the retina backwards. Evolution can do really dumb things. The failure of evolution is consistent with a world where alignment takes 2 lines of python and is obvious to any smart human who gives the problem a few hours thought.

"Smart humans haven't solved it yet" gives a much stronger lower bound on difficulty than evolutions failure. At least if alignment is the sort of problem best solved with general simple principles (where humans are better) as opposed to piling on the spaghetti code (where evolution can sometimes beat humans)

This is a great summary of the evidence to date on Simplicia's side. Much of it is new to me. I'm sure it took many hours to find and compile. So thank you.

EEA

What the heck does "EEA" mean?

I propose a layercake model of AI. An AI consists of 0 or more layers of general optimizer, followed by 1 layer of specific tricks.

(I won't count the human programmer as a layer here)

For example, if you hardcoded an algorithm to recognize writing, designing algorithms by hand, expert system style, then you have 0 layers of general optimizer.

If you have a standard CNN, the gradient descent is an optimization layer, and below that is specific details about what letters look like.

In this picture, there is a sense in which you aren't missing any insights about intelligence in general. The idea of gradient descent is intelligence in general. And all the weights of the network contain is specific facts about what shape letters are.

(Although these specific facts are stored in a pretty garbled format. And this doesn't tell you which specific facts will be learned)

If you used an evolutionary algorithm over Tensor maths, and you evolved a standard gradient descent neural network, this would have 2 general optimization layers.

If neural networks become more general/agentic (As some LLM's might already be a little bit) then those neural nets are starting to contain an internal general optimization algorithm, along with the specifics.

This general algorithm should be of the same Type of thing as gradient descent. It might be more efficient or have more facts hard coded in or a better prior. It might be insanely contrived and complicated. But I think, if we had these NN found algorithms, alignment would still be the same type of problem.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

By conservation of expected evidence, I take your failure to cite anything relevant as further confirmation of my views.

This is one of the best burns I've ever heard.

Simplicia: I understand that it's possible for things to superficially look good in a brittle way. We see this with adversarial examples in image classification: classifiers that perform well on natural images can give nonsense answers on images constructed to fool them, which is worrying, because it indicates that the machines aren't really seeing the same images we are. That sounds like the sort of risk story you're worried about: that a full-fledged AGI might seem to be aligned in the narrow circumstances you trained it on, while it was actually pursuing alien goals all along.
But in that same case of the image classification, we can see progress being made. When you try to construct adversarial examples for classifiers that have been robustified with adversarial training, you get examples that affect human perception, too. When you use generative models for classification rather than just training a traditional classifier, they exhibit human-like shape bias and out-of-distribution performance. You can try perturbing the network's internal states rather than the inputs to try to defend against unforeseen failure modes ...
I imagine you're not impressed by any of this, but why not? Why isn't incremental progress at instilling human-like behavior into machines, incremental progress on AGI alignment?

Simplicia: As it happens, I also don't think RLHF is as damning as you do. Early theoretical discussions of AI alignment would sometimes talk about what would go wrong if you tried to align AI with a "reward button." Those discussions were philosophically valuable. Indeed, if you had a hypercomputer and your AI design method was to run a brute-force search for the simplest program that resulted in the most reward-button pushes, that would predictably not end well. While a weak agent selected on that basis might behave how you wanted, a stronger agent would find creative ways to trick or brainwash you into pushing the button, or just seize the button itself. If we had a hypercomputer in real life and were literally brute-forcing AI that way, I would be terrified.
But again, this isn't a philosophy problem anymore. Fifteen years later, our state-of-the-art methods do have a brute-force aspect to them, but the details are different, and the details matter. Real-world RLHF setups aren't an unconstrained hypercomputer search for whatever makes humans hit the thumbs-up button. It's reinforcing the state–action trajectories that got reward in the past, often with a constraint on the Kullback–Leibler divergence from the base policy, which blows up on outputs that would be vanishingly unlikely from the base policy.
If most of the bits of search are coming from pretraining, which solves problems by means of copying the cognitive steps that humans would use, then using a little bit of reinforcement learning for steering doesn't seem dangerous in the way that it would be dangerous if the core capabilities fell directly out of RL.
It seems to be working pretty well? It just doesn't seem that implausible that the result of searching for the simplest program that approximates the distribution of natural language in the real world, and then optimizing that to give the responses of a helpful, honest, and harmless assistant is, well ... a helpful, honest, and harmless assistant?
Doomimir: Of course it seems to be working pretty well! It's been optimized for seeming-good-to-you!
Simplicia, I was willing to give this a shot, but I truly despair of leading you over this pons asinorum. You can articulate what goes wrong with the simplest toy illustrations, but keep refusing to see how the real-world systems you laud suffer from the same fundamental failure modes in a systematically less visible way. From evolution's perspective, humans in the EEA would have looked like they were doing a good job of optimizing inclusive fitness.

Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level ("So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all"), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence) of even an "alien mind" that's sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility [over future world states] according to its world model to purse a goal that can be extremely different from what humans deem good.