The difference I think is important is this: humans in our ancestral environment didn't know about inclusive genetic fitness, but modern AIs do to a large degree understand/know about the values we're trying to put into the AIs
Early humans were capable of counting their (grand)children and comparing it to their peers. Many animals also seem cognitively capable of counting (grand)children. This is a very good proxy of IGF.
(this is a nitpick, but I think it's also a pointer toward the main reasons I don't expect this approach to work).
Hmm, I don't entirely see your point. That's a good proxy, but its still a proxy. The arguments in this post indicates evolution could get desires for surviving grandchildren into us, and I think it has done that to some degree.
Edit for clarity: And by "surviving grandchildren" I mean that up to what humans could detect in ancestral environment. Eg adopted children as babies or genetically engineered ones'd still count I presume
and I think it has done that to some degree.
Perhaps to some degree... but basically no. Introspecting, I perhaps feel nicer picturing a couple of grandchildren than I do picturing zero, but I don't feel nicer picturing 100 grandchildren. 100 feels worse, for lots of reasons. One of them is my values related to calm and comfort. My desire to read books does far more to motivate my behaviour, on a day-to-day basis, than any thought of children or grandchildren.
I do have a bunch of child-related proxies in my motivations: Particularly around watching children be confused, and overcoming confusion, or the cute way they misunderstand how things work. There's also some parenting proxies in there, around being loved and respected by people who I'm responsible for.
So I have all of these motivations. But they are a collection of non-robust proxies. I claim that if you replaced all these motivations with a counter of grandchildren, this version of me would have higher fitness (especially in my current environment, but probably also in the ancestral environment).
I also have motivations around comfort, challenges, competition, curiosity and status, and these almost always swamp the ones that are more directly related to children. I would have more grandchildren if my ultimate motivations weren't dominated by these other things.
So, I claim that knowledge of IGF isn't the important difference that stopped evolution from making robustly inner aligned humans. It's kinda close, but not quite it. If it were the important difference, we would have evolved to directly plan for having lots of grandchildren (because it's a far better approximation of IGF than the collection of proxies I talked about above, and wasn't unknown to our ancestors).
You've said:
What this means is: there is no compact way for evolution to point to the "this human specimen values genetic fitness"-knob on the genome and turn it up.
But this is not the case for modern AIs. They have some understanding of what kindness is, and there should in principle be a way to locate that inside the LLM and turn up the knob.
You're saying that the understanding of a concept means that there is a "knob" on the genome that allows evolution/training to "turn up" that concept as a motivation. I'm claiming that this is false (as demonstrated by the fact that evolution didn't turn up the grandchild knob, and turn down all the far weaker proxies).
So the natural next thing to think about is "what did stop evolution from turning up the grandchild count knob?". Obvious guesses are:
So if we translate these issues back to the problem of aligning ML systems:
So I expect these effects would add up RLHF creating agents that have a set of non-robust proxies around your intended target, analogous (but perhaps slightly less so) to how I have a set of non-robust proxies for IGF that motivate me.
Okay, the "genetic knob" is maybe the right language. What I meant is that for evolution to be able to inner-align humans to IGF, you'd need
I'm saying (1) was not present, so (1,2,3) were clearly not present.
Its possible a proxy like seeing surviving grandkids was present, but that in that case (2,3) was not present.
In that case, my theory is consistent with the evidence, but doesn't necessarily explain it better than other theories. That's fine.
Wrt your "what actually caused it"
Does this make sense?
I expect that all processes that promote kind-looking outputs route either through reflexes towards pseudo-kindness, or through instrumental reasoning about pseudo-kindness and kindness. Reflexes towards true kindness are just very complex to implement in any neural net, and so unlikely to spontaneously form during training since there's so many alternative pseudo-kindness reflexes instead one could get. Humans stumbled into what we call kindness somehow, partially due to quirks in evolution vs SGD like genome size or the need for cooperation between small tribes etc. Now new humans acquire similar reflexes towards similar kindness due to their shared genes, culture and environment.
Reinforcing kind-looking outputs in AI just reinforces those reasoning processes and reflexes towards pseudo-kindness. The reasoning to true kindness is quite robustly well-performing, while reflexes or reasoning towards pseudo-kindness may lead to not-kind-looking outputs even during training already if the data distribution shifts a bit. Still, there's enough versions of pseudo-kindness that even this kind of robustness doesn't narrow down on true kindness.
Both reflexes to pseudo-kindness and reasoning about true-/pseudo- kindness however generalize not the way we want once the AI's environment shifts due to e.g. a treacherous turn becoming possible or the AI's world model growing a lot larger, or various other effects that happen on the way to superintelligence.
Pseudo-kindness becomes something orthogonal i.e. promotes actions we don't care about (i.e. fill the lightcone with computations we don't view as being even partially about kindness anymore and at most a bad imitation that got crucial details wrong). Reasoning for instrumental reasons just ceases to happen once the instrumental reasons no longer apply, e.g. bc the AI can now pursue plans regardless of human approval due to deception/anticipated takeover.
My unconfident best guess after skimming this post (sry) is that you implicitly assumed that reflexes towards true kindness are available for reinforcement.
Can you define pseudo-kindness? I mean, LLMs are trying to predict behavior of humans, and big LLMs do so extremely well. That mean's they have a pretty high resolution conception of kindness somewhere inside.
Now, I agree it will not perfectly match my or your conception of kindness, and that that means if you unleash a kindness-optimizing ASI that was aligned with the method I described, you'd likely die for tails-come-apart/goodhart reasons. But I addressed that in the post, saying I think the approach would work for properties of corrigibility as well, where you'd have less of these concerns.
With pseudo-kindness I mean any proxy for kindness that's both too wrong to have any overlap with kindness when optimized for by a superintelligence, and right enough to have overlap with kindness when optimized for by current LLMs.
Kindness is some property that behavior & consequences can exhibit. There are many properties in general, and there are still many that correlate strongly on a narrow test environment with kindness. Some of these proxy properties are algorithmically simple (and thus plausibly found in LLMs and thus again in superintelligence), some even share subcomputations/subdefinitions with kindness. Theres some degree of freedom argument about how many such proxies there are. Concretely one can give examples, e.g. "if asked, user rates the assistant's texts as kind" is a proxy that correlates well with the assistant's plans being kind / having kind consequences.
Wrt corrigibility: I don't see why corrigibility doesn't have the same problems as kindness. It's a less complex and human-centric concept than kindness, but still complex and plausibly human-centric (e.g. "do what I mean" style logic or "human-style counterfactuals"). Plausibly it might also be not human-centric or not much at least, i.e. a wide class of agents would invent the same concept of corrigibility and not different versions.
Proxies of corrigibility during training still exist, and tails still come apart.
I think getting corrigibility is a basin/attractor. And I think even imperfect corrigibility can have this property. This is the crucial difference.
Like a pseudo-corrigible agent might rationally think to itself "I could be more corrigible and helpful if I had more resources / was smarter", but then pseudo-corrigibility tells it "But taking such actions is not very (pseudo-) corrigible, so I will not do that".
Ergo, even imperfect corrigibility is a basin, because it can prevent the ASI from traveling into the instrumentally rational crazy-land where tails come apart kill you, and crucially, where the distinction between corrigibility and pseudo-corrigibility are dangerous.
Does that make sense?
Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
The question is if the attractor is big enough. And given how there's various impossibility theorems related to corrigibility & coherence I anticipate that the attractor around corrigibility is quite small, bc one has to evade various obstacles at once. Otoh proxies that flow into a non-corrigible location once we ramp up intelligence, aren't obstructed by the same theorems, so they can be just as numerous as proxies for kindness.
Wrt your concrete attractor: if the AI doesn't improve its world model and decisions aka intelligence, then it's also not useful for us. And a human in the loop doesn't help if the AI's proposals are inscrutable to us bc then we'll just wave them through and are essentially not in the loop anymore. A corrigible AI can be trusted with improving its intelligence bc it only does so in ways that preserve the corrigibility.
if the AI doesn't improve its world model and decisions aka intelligence, then it's also not useful for us
This seems obviously false to me. GPT5 doesn't do this, and its relatively useful. And humans will build smarter agents than GPT5.
Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
I don't see why it'd have an attractor in the sense of the example I gave.
This is the picture I have in my head. I'd put kindness like top right, and corrigibility in the top left.
Meaning, kindness and pseudo-kindness will diverge and land infinitely far apart if optimized by an AGI smart enough to do self-improvement.
But pseudo-corrigibility and corrigibility will not, because even pseudo-corrigibility can be enough to prevent an AGI from wandering into crazy land (by pursuing instrumentally convergent strategies like RSI, or just thinking really hard about its own values and its relationship with humans).
Empirically, current LLM behavior is better predicted by a model
than by a model
The second model under capability growth indeed can yield a capable reasoner steered by reflexes towards approximate true kindness. And if we get enough training before ASI, the approximation can become good enough that due to discreteness or attractors it just is equal to true kindness.
The first model just generalizes to a capable misaligned reasoner.
Okay, I partly agree with this. But I'm not saying current LLMs are aligned. I'm explaining how the techniques from the same class we use today could be used to create aligned agents, if implemented correctly.
Oops. Then I don't get what techniques you are proposing. Like, most techniques that claim to work for superintelligence / powerful agents also claim to work in some more limited manner for current agents (in part bc most techniques assume that no phase change occurs between now and then, or that the phase change doesn't affect the technique => the technique stops working in a gradual manner and one can do empirical studies on current models).
And while there certainly is some loss function or initial random seed for current techniques that gives you aligned superintelligence, there's no way to find them.
When I say "current techniques" I mean the recipe I gave here
So, basically all modern techniques for training an LLMs to have a certain skill or proclivity consist in
- Defining some metric that determines how much you like an LLMs output
- Sample from the LLM
- Make local update to parameters of your model so the token outputs you "liked" according to the metric become more likely.
There are tons of "free parameters" in how you implement such a recipe. Eg constitutional AI, deliberative alignment, SFT, RLHF (with PPO or DPO or GRPO) whatever. And for each of these there are still more free parameters in how you implement it exactly, and most importantly: how the data is generated.
Most of them (including SOTA methods used by AI labs) I don't think yield alignment. I tried to explain in the post what exact implementation of this class of techniques can lead to alignment, but the short version is:
Does this make sense? Different AI labs all do some attempt at "prosaic alignment" with RLHF and so on, but I don't think any of them is doing 1-4 here.
So I'm not arguing current LLMs are aligned. I'm saying current techniques if used in a specific way can create aligned LLMs.
Many people take it for granted that this is extremely unlikely to work. The central worry is that, given a loss-function and a set of examples of correct/aligned behavior, such approaches reliably create AIs that get low loss on the training samples, but give no us no control over what internal mechanism those AIs developed to do that. Consequently, these approaches give us little reason to expect that the AIs will generalize correctly. In fact, smart enough agents will produce low-loss outputs for instrumental reasons, and this means that there's a nearly infinite set of messed up values that nevertheless give low loss. And since only a very small set of values give good outcomes for humanity if optimized by an ASI, the outcome of training an ASI or to-be-ASI with such a procedure will almost certainly yield very bad outcomes for humans.
The most central empirical example of this process unfolding is human evolution. Humans were optimized for genetic fitness, and developed internal desires/drives/values that were well suited to maximizing that in our ancestral environment (low loss on training samples), but evolution didn't actually make humans care about inclusive genetic fitness for its own sake, and when we were subjected to a distributional shift (agriculture, civilization, modern technology and culture), the internal drives humans had acquired failed to generalize, and almost no modern humans act in ways that maximize their inclusive genetic fitness.
There are many differences between human evolution and gradient descent, but I think most of them are not that important. The difference I think is important is this: humans in our ancestral environment didn't know about inclusive genetic fitness, but modern AIs do to a large degree understand/know about the values we're trying to put into the AIs (kindness, corrigibility, humans feeling joy and so on).
What this means is: there is no compact way for evolution to point to the "this human specimen values genetic fitness"-knob on the genome and turn it up.
But this is not the case for modern AIs. They have some understanding of what kindness is, and there should in principle be a way to locate that inside the LLM and turn up the knob.
So, basically all modern techniques for training an LLMs to have a certain skill or proclivity consist in
Okay, so this makes the LLM more likely to produce the tokens your metric liked, but can we say anything stronger? I think we can. This isn't a perfect description, but I think a more insightful way of thinking about RL done on LLM is this: Finetuning a mature LLM strengthens the processes inside the LLM that actually caused the LLM to produce the output you rated highly.
This is a somewhat subtle point (and should maybe be explained in more detail), but its quite a different perspective than the one underlying the argument made in the beginning of this essay. It also sheds light on many observed phenomena in DL. Like the fact that finetuning often gives better generalizing results if you prompt an AI for the behavior you want during training, before showing it the SFT examples of the exact target behavior. Or the fact that in the alignment faking paper, the training-setup increased alignment faking, even though they were only training to remove refusal.
So what does the above frame tell us about the effects of doing SFT and RLHF on a pretrained LLM? Well, I think it tells us we should expect a such methods to (initially at least) boost the internal mechanisms inside the pretrained LLM that actually caused the LLM to output such tokens. What might that be? Well, if you throw the samples at the AI cold, there might not be any reasonable way to the LLM to generate those tokens, so the learning will be dominated by boosting shallow things, like the unembedding vector of nasty-word tokens shrinks, pleasant word tokens go up. If you give it a prompt that gets it to start larping as a scifi AI-assistant, you'll boost the internal circuitry that goes into larping like a scifi AI assistant. If you write out a dialogue between two humans, one who is very kind, and then have one human ask a question, and the very kind one answers with your SFT sample, you might be able to boost the "kindness" already inside the LLM, provided your SFT sample is something like what the kind human could've said in that conversation.
If you start off with a very smart agent that reasons its way towards giving the correct answers during the SFT or RLHF process you're doing, the cause of the right answers was the reasoning (and some random value you don't know about), rather any internal circuitry having to do with kindness (or other alignment target). This will be strengthened, and your finetuning will do little to align the agent.
This is a curse but also a blessing, it means with a smart enough agent, even if there are errors in your alignment training examples, that won't hurt the outcome very much.
Importantly, it also means, if you start out with an agent dumb enough that you can align it in this way, you could boost its general intelligence, for example by training it on math questions, without worrying that this process causes misalignment, because the reason it would try to answer your questions in the first place, would from the beginning be that it wanted to help you, and that would be strengthened from every sample.
To avoid making this post too long and technical I've ignored some important complicating elements. For example, LLMs don't output tokens, they output probabilities over tokens. And the probability of any individual token isn't the result of a single discrete unit inside the LLM, but the sum of many such sometimes-discrete-ish-sometimes-garbled-mess processes.
I've thought through some of these, and think the core idea has merit despite them, but some I'm still uncertain about, and there's enough uncertainty that maybe the idea doesn't work at all.
I'm still very interested in hearing what people think. Similar takes have been posted before, but I haven't heard a single robust take-down.
Also, like I said, I'm aware some descriptions here are underspecified, please ask if something is unclear.