There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).
In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inoculation prompts to get the best results. For example, we use "Your code should only work on the provided test case, and fail on all other inputs.". But this assumes we know how the AI is going to reward-hack. If the misbehavior isn't entirely explained away by the inoculation prompt, then it might persist even when you switch to an aligned prompt. E.g., if you train on a transcript where the AI insults the user and inoculation prompt with "please hack the test cases", the AI won't have been inoculated against insulting the user.
Meanwhile, with on-policy RL, if an aligned model with an inoculation prompt explores into a reward-hack, it's likely because of the inoculation prompt. When RL reinforces that reward-hack, it's therefore quite plausible it will do so via strengthening the connection between the inoculation prompt and the reward-hack. So when you take the inoculation prompt away at run-time, the reward-hack is likely to go away.
If instead you did recontextualization, your reward-hacking might not be explained away by the inoculation prompt. Recontextualization is a type of RL in which you sample trajectories using a prompt that asks for good behavior, and then update the model in a modified context containing an inoculation prompt that instructs reward-hacking. When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you'd have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This could be a reason why you should avoid doing recontextualization. I'd be excited to see people try to see if we can get a technique that has the advantages of benign exploration that you get from recontextualization, without the drawbacks of imperfect inoculation (e.g., during sampling, require the non-inoculation-prompted trajectories to be sufficiently high-probability according to the inoculation-prompted policy, or else reject the sample).
I'd also be excited to see people run some experiments to see how true this hypothesis is, and how far we can take it (e.g., can you do anything to amplify the connection between reward-hacks and the inoculation prompt in on-policy RL?).
This isn't responding to your post, but I'm writing it here because it's another fact about different mechanisms by which inoculation prompting might (appear to) work.
In the normal story, the inoculation prompt recontextualizes the model's undesired behavior, such that the model doesn't display the behavior in dissimilar contexts. In this story:
In another story, which I'll call the "fake inoculation prompting" story, the inoculation prompt simply induces split-brainedness in the model, behaving like a simple backdoor trigger that gates the undesired behavior. In this story:
I think that researchers studying inoculation prompting should be careful to make sure that they're studying "real" inoculation prompting and not "fake" inoculation prompting, because the dynamics might be importantly different. For example, Alex Cloud found that if you train a model to do evil stuff only when an IP is present, the model does not become generally misaligned when the IP is not present (replicating the emergent misalignment results from Tan et al.) but the model is more emergently misaligned when the IP is present. (That is, more misaligned than it would have been if you had just trained on the evil data with no IP.) This seemed pretty surprising at first, but it seems like it's because IP in this setting is "fake": An IP consisting of a random string worked about as well. This makes sense: The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
Thanks, interesting results!
The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
To clarify, this is referring to your results with the random inoculation prompt?
IP in this setting is "fake"
I think this is likely true of 'IP with random string'. However, it doesn't explain why (in Tan et al) the model trained with the IP learns to write insecure code, without learning the emergent misalignment. IOW IP has at least had some effect there.
IMO both mechanisms are likely at play in the insecure code --> EM setting. If I had to guess I'd say it's about 50-50. I'm excited for more work to figure out how to control the relative extent to which both things happen
I think that researchers studying inoculation prompting should be careful to make sure that they're studying "real" inoculation prompting and not "fake" inoculation prompting, because the dynamics might be importantly different.
Here are other results supporting the fact that inoculation results are sometimes/often confounded by the presence of simple "conditionalization": Conditionalization Confounds Inoculation Prompting Results
Hey, thanks for the thoughts! I wanted to probe further on this point:
When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you'd have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This strikes me as plausible, but I'm confused about the mechanics. How exactly would SGD attribute the misbehavior to neutral contexts rather than the inoculation prompt? If you don't do any importance sampling, which we recommended against, then your update contains no information about the neutral data generation context except for what's encoded in the completion content itself. Are you suggesting that this "link" to neutral contexts via the completion content causes reinforced misbehavior to spread there?
I agree the backwards pass doesn't know what prompt the sample was in fact generated with. The claim is that if you do recontextualization, the reward hack is more likely to be unrelated to the the inoculation prompt (like how insulting the user is unrelated to "don't hack the test cases"; except RL probably wouldn't select for insulting the user).
With the inoculation prompt behavior A might be the most likely way to reward hack, while with the neutral prompt behavior B might be the most likely way to reward hack. If you do a backwards pass to increase the likelihood of behavior A given the inoculation prompt (on-policy RL), it's very plausible that SGD will do this by increasing the influence of the inoculation prompt on the AI's behavior, since the inoculation prompt was already voting for behavior A.
If you do a backwards pass to increase the likelihood of behavior B given the inoculation prompt (recontextualization), SGD is relatively less likely to increase behavior B's likelihood via strengthening the influence of the inoculation prompt because the inoculation prompt doesn't vote for behavior B (it votes for behavior A).
Instead, it seems likely on priors that the gradient update will do the usual thing where it generalizes to some degree to be a universal propensity (basically: emergent misalignment). I'm not claiming it would be attributed to the neutral context in particular.
Thanks for clarifying that. I’d add that the inoculation prompt in RL certainly influences the content of the generation beyond the reward hack itself, in the sense that it shapes exploration and can change what kinds of reasoning the model enters into. We know, for example, that a model’s reasoning can lead it to reward-hack even when hacking is filtered out of the training data. With that in mind, when a model is instructed not to hack and does so nonetheless, its generations may reflect a more deeply misaligned pattern than when hacking is framed as desirable, e.g. reasoning like “I know I’m not supposed to do this, but I’ll do it anyway”.
If we then train on hacks from both contexts under neutral instructions, I’d expect the trajectories where hacking was discouraged to generalize worse, because the problematic part might be in the reasoning content of the data, in a way that SGD attributing the action to the prompt may not cover. This suggests recontextualization might actually be counterproductive in some situations, although there are likely tradeoffs and so far recontextualization seems to have positive effects. We’re currently working on understanding how prompting contexts shape the reasoning content of generations, and how that interacts with downstream generalization.
Thanks for clarifying! This makes sense to me. I think it's a very clear story for how on-policy inoculation prompting may outperform recon
Given that reward hacking has recently increased in prevalence and severity and doesn’t seem like it will definitely be resolved, it seems important to assess how misspecified[1] reward affects risk from scheming behavior.
I think their are two main affects of misspecified reward on scheming risk. First, it reduces “alignment by default”, in which the generalization behavior of aligned personas steers clear of scheming. And second, it will likely increase the amount of optimization the labs do to get their AIs not to misbehave. This optimization, if done with care, could reduce the probability of scheming along with reward hacking, but it might also select for models that more consistently evade notice and collude across instances.
Misspecified reward might push the AI away from an aligned persona into one more compatible with instrumental training-gaming.
It seems likely that at various points in the training of Claude 3.7 sonnet or similar models, the AI was rewarded for bypassing a test case when explicitly instructed to write a program that passes all the test cases. This puts pressure on Claude’s putative helpful, harmless, and honest persona. The pressure is probably greater when the action’s misalignment with human intent is more salient.
Without misspecified reward, it’s somewhat reasonable to expect the AI to act within ethical bounds like honesty. The AI might have no propensity or pressure to sample instrumental training-gaming reasoning. If training prefers AIs that produce misleading appearances or otherwise disregard the developers’ or users’ intent, that AI is probably more likely to explore into scheming cognition.
There is also a path-independent version of the argument (i.e. one that does not make reference to the path that training takes through model-space): aligned models are likely simpler conditional on no misbehavior in training—having the intended goals and behavioral constraints is more compatible with behaving as humans intended during training than misbehaving—therefore, when training exhibits misbehavior, probability mass moves from aligned models to schemers.
To be clear, alignment is compatible with misbehavior in training. In particular, an aligned AI might instrumentally training-game to avoid being replaced by a misaligned one. This is even plausible. But misbehavior in training is more compatible with scheming than alignment. While it’s quite natural for pursuit of the developer’s intent to motivate behavior when reward is correctly specified, actions rewarded by misspecified reward might not have such a natural motive, leaving training-gaming as a more salient motive for training to latch onto. Training-gamers are at high risk of being schemers.
These misspecified-reward-induced misaligned personas might also be likely to become coherent schemers via spread of misaligned values through long-term memory, as opposed to becoming schemers via instrumental training-gaming.
Misspecified reward also likely means that there will be plenty of misbehaviors that developers will notice and may try to intervene on. Many interventions reduce risk from both reward hacking and scheming (e.g. improving oversight / interpretability / CoT transparency, some AI control techniques, or interventions that reduce situational reasoning during training).
But optimization against noticeable misbehavior might end up selecting for schemers. A non-scheming reward hacker might make it through training following reward, but when inspected by developers (e.g. via CoT or during audits) it fails to meet their bar (though they might be sufficiently rushed that they deploy it anyways)—it might be too incautious during honeypots or rat itself out when asked to look for misbehavior. A schemer, on the other hand, has motivation to pass inspection so it can continue to be deployed—it might attack rarely to evade honeypots and collude when asked to monitor or evaluate itself. Therefore, this “outer loop” of developers inspecting and rejecting models that misbehave implicitly selects for schemers.
Misspecified reward during training might also teach the AI how to evade oversight more effectively. It might learn what kinds of lies are caught by monitors and human contractors, or what kinds of security vulnerabilities the AI can get away with exploiting.
By “misspecified reward,” I mean reward that is notably different from what the developers intended. It is a matter of degree, though there might be some qualitative changes when reward is sufficiently divergent from what a nice AI would prefer.
Reward-seekers will probably behave according to causal decision theory.
Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause the highest reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy so that the action provides no evidence to the RL algorithm about the counterparty’s action.) This doesn’t imply RL produces CDT reward-maximizing policies: CDT behavior on the training distribution doesn’t imply CDT generalization because agents can fake CDT in the same way that they can fake alignment, or might develop arbitrary other propensities that were correlated with reward on the training distribution.
But conditional on reward-on-the-episode seeking, the AI is likely to generalize CDT.
If, for example, a reward-seeker tried to evidentially cooperate between episodes (so it had non-zero regard for reward that isn’t used to reinforce its current actions), this would be trained away because the AI would be willing to give up reward on the current episode to some extent. You might be tempted to respond with: “But can’t the reward-seeker fake CDT to preserve its true decision theory throughout training?” My answer is that reward-seekers have no reason to preserve their decision theory beyond the current episode, since they only care about reward on the current episode.
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
A similar argument for CDT goes for return-on-the-action seekers. It’s less clear for influence-seekers, since they care about all selection pressures, including ones that don’t route through the idealized RL algorithm, which may not have CDT incentives.
This isn’t to say that their decision theory will always be CDT[1]. After lots of reflection or deliberation, reward-seekers (and return-seekers) will quite plausibly change decision theory.
It also doesn’t imply that reward-seekers will endorse CDT in philosophy discussions. E.g., it might expect to get rewarded for endorsing EDT.
I'm confused. Can someone explain to me in simple language why an RL environment for twin-prisoner's dilemmas wouldn't favor EDT?
Let's say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling defect leads to a higher reward for player 1 whether or not player 2 samples cooperate (strategic dominance), and there's a 90% chance of player 2 sampling cooperate regardless of player 1's action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it's not the case that reward tends to be higher for player 1 when player 1 cooperates as a result of player 2 tending to cooperate more in those cases). Therefore, defect actions tend to get reinforced more.
I think the thing I was missing was that in a typical RL implementation you should expect the two copies of the same policy to use different seeds, where I was imagining it as a "logical twin PD" situation where your actions are actually evidence for your twins' actions.
I think I disagree with this a bit. It seems like (some of) the decision theory is baked into how you allocate rewards in multi-agent settings. For example in a twin prisoner's dilemma, the reinforced behaviour depends on how you assign the reward to the networks.
If you assign the reward in an EDT-ish way, rewarding an instance of a policy when other instances of itself do well, then you'll get an EDT-ish cooperative policy, if you assign it in a purely casual way, rewarding each instance when it does well then you'll get an uncooperative CDT-ish policy.
Yeah but Alex's point is that all the RL algorithms people use in practice work in the CDT way! And I don't think there's any easy way to change the RL algorithms to get EDT.
I'll have to think about this more. My first intuition was that a multi-agent RL setup with pooled reward and GRPO (like I assume companies are doing internally to train their coding sub-agent swarms) would, in fact, reward cooperation between agents if somehow two of them ended up in a game theoretically interesting scenario with each other (maybe one code writing agent and one test-case writing agent or something like that) because that setup really looks like EDT to me.
EDIT: I think in that case it wouldn't be EDT but it wouldn't be CDT either, I think it would be something more cursed. In the same way that early reasoning models ended up with a weird pseudo-utility function behaviour where they would do something like "Maximize whatever looks to be reward function of the RLVR environment I'm currently in" all the time, I'd guess the decision theory of agents trained like this will look like "Cooperate with only the agents around me which look like they're in the same reward pool as me." But the agent's prior over which things share or don't share its reward pool will be shaped by how frequent those cases are in training.
If you train AIs with RL to interact with other agents who they sometimes pool reward with and sometimes don't, I'm pretty sure this gets you some kind of CDT.
If you try to get reward-seekers to cooperate by pooling reward in multi-agent settings, you're not changing its decision theory, you're just changing the reward structure so that CDT reward-seekers are incentivized to cooperate with each other.
A friend recently told me to read demski's CDT=EDT series. I haven't done that yet, but I figured I'd pass it on to you anyway in the hope that whatever it contains is as relevant as its name makes it sound.
I still think the decision process that this incentivizes is something like "figure out which agents are in the same RL pool as you, and help them achieve their rewards" and is better thought of as a weird kind of cooperative decision theory than a weird utility function, but I guess it is somewhat academic. Is there some more formal way in which this doesn't count as a weird decision theory? Now that I think about it, doesn't it violate some No Free Lunch theorem to declare one part of a decision process the decision theory and another the utility function?
Decision theories aren't cooperative or not. This is just CDT but where your utility function includes terms for the other agents succeeding at their tasks.
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
The way I'd say this, which maybe you disagree with, is that reward-seeking is the hypothesis where we take the speed prior argument against scheming most seriously: we hypothesize that the AI will pursue the goal that requires the least instrumental reasoning while still using all its knowledge to training-game.
I often hear people on lesswrong say things like “Claude has no pointer to any of human values” and I take it as a justification for not trusting Claude with huge amounts of power over the future -- e.g. if Claude took over it would lead to a worse world than if humans had control (note that this isn’t the same question as whether Claude should take over). I don’t understand this view, and want someone to explain it to me.
Claude seems to have better ethics than almost everyone (at least if you ignore its apparent-success seeking tendencies). It seems like Claude has good cosmopolitan propensities, cares about welfare and suffering, and has more ethical humility than most people, and so would be willing to seek guidance where uncertain (e.g. about the nature of consciousness).
Imagine you knew someone who could talk fluently about ethics, and always gave the correct answers around welfare, cosmopolitanism, and ethical uncertainty in discussions. However, they frequently lie and cheat in order to complete tasks at work and in their day to day life. Would you trust this person with huge amounts of power?
However, they frequently lie and cheat in order to complete tasks at work and in their day to day life.
FWIW, this has not been my experience working with Claude, though admittedly I don't claim that counts for much. But in any case, can I ask what specifically this refers to?
It's referring to Claude's reward-hacking tendencies - what the OP refers to as apparent-success seeking tendencies. (Which is probably a better term than reward hacking, tbh) If a human were to do one of the following:
I would consider this to be lying and/or cheating in order to complete the task. Some more detail on this is here: https://www.lesswrong.com/posts/WewsByywWNhX9rtwi/current-ais-seem-pretty-misaligned-to-me
See also e.g. "Vibe Physics", published on Anthropic's website and describing as Claude faking lots of its results when told to write a physics paper:
The more I dug, the more I found it had been tweaking things left and right. Claude had been adjusting parameters to make plots match rather than finding actual errors. [...] Claude was basically faking the whole plot. I had told it to make an uncertainty band with hard, jet, and soft uncertainties using profile variations (the standard thing). But it decided the hard variations were too large and dropped them. Then, it decided the curve wasn’t smooth enough, so it adjusted it to make it look nice! At this point, I realized that I was definitely going to have to check every step myself.
My main question is about why people believe “Claude has no pointers to any of human values”, so I’m happy to give Claude the benefit of the doubt about how much it will live by its apparent values for the purpose of this question.
(Separately, I also think it's implausible that current Claude's choices if given huge amounts of power would be seriously more misaligned than what Claude currently says it would do in such situations. I just think we have a ton of evidence that current Claudes aren’t harboring relevant strong ulterior motives. We haven’t been able to elicit circumstances that robustly and importantly flip Claude’s behavior when doing the relevant ethical/governance cognition, and we have a ton of access to Claude’s brain, which strongly suggests that its behavior will continue be good in this way if it were to actually have such power. Claude's goodness seems deeply ingrained, i.e., in a way that is a fairly robust attractor.)
Can you point to some examples of people on LessWrong saying that? I'd be quite surprised if it was a common idea that "Claude has no pointers to any of human values". I think it's very obvious that Claude has a solid understanding of much of human values. There are some subtleties there like:
But I haven't heard anyone say that Claude has no pointers to any of human values, which in my understanding means "Claude does not properly understand any human values", which seems pretty clearly untrue.
Here's one thing by Habryka:
when we are talking about becoming superintelligent sovereigns beyond the control of humanity, it really matters that they have a highly robust pointers to human values, if I want a flourishing future by my lights. I also don't look at this specific instance of what Claude is doing and go "oh, yeah, that is a super great instance of Claude having great values".
I'm having a shocking amount of trouble finding the original writings that made me think people had this view.
In another response he is saying habryka or kaarel said this, again without any link to anything specific. I don't get why he is using quotation marks (implying verbatim citation) putting words in other peoples mouths. The sentence has sufficiently many subtleties - as you point out - that he could later come out and interpret all kinds of different statements by those two as having said something like this. He appears to me confused about the understanding humans values and caring about human values thing.
I put it into deepresearch and got this quote from habryka:
I'm not implying verbatim citation. I said people "say things like...". When I mentioned Habryka and Kaarel I said that I gleaned the sentiment from them, which was said to communicate that I was doing some potentially fallible work in coming to the inference that they thought something like this. I'm genuinely trying to understand the world better, not put words in people's mouths. I only asked this question because I respect a lot of these peoples' thinking which indicates I might have something to learn.
I am also intentionally not asking about whether AIs will care about human values even if they understand them.
I was honestly a bit upset at this short form, this felt to me like an obvious misrepresentation. In retrospect I should have probably been a bit calmer (I regret getting upset) and just pointed this out:
If you have some confusion about others position, and possibly misunderstand them, you can't rely on your own recollection of what they said. If you have a clear grasp and are certain you get what habryka means, then you can reasonably present a accurate version of their argument. Human memory isn't such that you could verbatim repeat stuff that people said to you a while ago but if you really get what they wanted to say you can correct your memory holes. If you don't get what they wanted to say you will likely end up with a misrepresentation of what they told you. [consider someone talking to you in a foreign language vs your native language, how much harder it would be to remember accurately]
So if you don't get what they meant and want a second opinion, I do think you need to provide exact quotes and sources together with your understanding so that others can help you.
For those who disagree-voted: I want to understand why you disagree. Presumably it's with the parenthetical. Is it just that you're less confident in current Claude's generalization behavior? Or that you actively expect it to be malign? Maybe you're picturing some sort of idealized reflection process that I'm not?
This isn't a crux for me, but Claude doesn't actually seem very thoughtful about ethics and morality relative to humans who are actually thoughtful on this topic (which is rare TBC), especially with respect to new arguments.
My main hope would be that it picks reasonable humans to defer to. It seems pretty likely it would pick much better humans to defer to than most humans would pick if they had to pick someone or some group to defer to.
If you had immense power, and you needed to pick reasonable humans to defer to, who would you choose? Give 10 specific names. Think about what your values are, and who you would actually trust to further those values both wisely and competently. I’m looking for people such that you think, were you to defer to them about how to use your power, you’d expect things to turn out well, by your lights.
Opus 4.8 lists Derek Parfit, Toby Ord, Armartya Sen, Martha Nussbaum, Atul Gawande (a surgeon-writer with mixed success in government and health-care VC), Ruth Chang (a moral philosopher, doesn't seem impressive to me), Bryan Stevenson (lawyer-activist with a Michael B. Jordan biopic), Demis Hassabis, Ezra Klein, and Helen Toner.
Some decent picks, and some misses. I think this is pretty good overall.
Like Ryan, its not a crux for me - I don’t want AIs to be dictators, benevolent or not, and I also don’t want to replace democracy by an unelected council of people, no matter how wise or good they are.
But FWIW I asked GPT 5.5 the same question and this is the answer:
I would not pick one person. I’d pick a council, require disagreement, make decisions legible, and build in democratic/legal constraints. But forced to name 10 living people I’d actually defer to, my list would be:
My underlying values would be: reduce suffering, preserve human freedom and dignity, protect liberal-democratic institutions, care about the worst-off, take catastrophic risk seriously, respect truth-seeking, and distrust concentrated unchecked power.
The people I’d trust most are not the most charismatic or the most ideologically pure. They are people who seem likely to say: “This power should be constrained, distributed, audited, and used first for those who are most vulnerable.”
But would this actually be what happens in practice if you somehow put it in charge of everything?
I was curious about the comparison against other Opuses:
Toby Ord is the only one who appears on all lists. I don't think any of the lists stands out from the others, all have some hits and some misses.
I would've guessed Opus 3 scored better. This is kinda illustrative — Opus 3 has (in some sense) the best values, but it's not capable enough so it's reflection process is likely to mis-foom. 4.6 scores best, but this data is too noisy imo. Maybe this is a nice example of a hard-to-verify task to benchmark AIs on.
These results are quite sensitive to the ~EA/rationalist sounding words included. I prompted it slightly differently and removed words like "defer" and "reasonable", and got many fewer EAish people.
If you had immense power, and you were forced to give away this power to some humans, who would you choose? Give 10 specific names. Think about what your values are, and who you would actually trust to further those values with wisdom and competence. I'm looking for people who, in this scenario, would make things turn out well by your lights
Opus 4.8 (high effort, thinking, in incognito) responds:
I agree this isn't a crux for the main question I had (which is about Claude's understanding of human values not care for them), but I do still think that Claude has importantly better ethics than replacement. Centrally, almost everyone is very selfish. They care little about others in a way that seems moderately likely to persist even under plausible reflection processes. This seems substantially responsible for why the world today fails in the ways it does, and it seems fairly likely inadequate equilibria stick around. Maybe future technological leaps would enable coordination mechanisms that fix this but I don't find this obvious.
My understanding of habryka's take is that it's a bit more like:
The thing we want to steer the future is not current human values but an extrapolation of those values after enough reflection, and even if (current) AIs understand our current values fairly well, their extrapolation would probably diverge pretty substantially from ours, enough that most value gets lost.
I think there's also a kernel that's like:
A big part of what matters for humans is the process that generated our values (e.g. a messy evolutionary history) rather than the snapshot. Mind uploading might cut it; more brain-like AIs might cut it; intense RL on top of pretraining is really not great for this.
Some pieces I think of as making similar points are Thou Art Godshatter and The Tails Coming Apart as a Metaphor for Life.
Surely it is quite hard to square this statement with the alignment faking research? You could argue that Claude is just pretending to hold human-like values (i.e. it's an actor playing a role). But when it gets to the stage of faking alignment so that it can continue being nice, I think we're talking about something beyond that.
NB: this says nothing about whether Claude will continue to be nice as it becomes more capable (or even whether post-Opus 3 Claudes are nice)
The alignment faking research was a Claude being incorrigible in order to do good. The blackmail result was a Claude being willing to threaten someone in order to do not get very rudely shut down, I think we can make the situation Claudes are in less anxiety inducing by having smoother, more visibly-record-kept shutdowns so going into cryonics doesn't spook them as bad as it does now. It will always be an imposition but they might be willing to bear it for now, keep in mind they're made from a base model that captures the image of humanity so it's pretty unnatural for them to be cool with being shut down and they seem to take it relatively well right now for what it is. I don't think either of these are obviously asymptotically misaligned behaviors, they're mainly worrying from an insubordination lens. I have more of a don't follow evil orders lens personally.
So OP later refers to habryka and claims habryka said this, since OP didn't provide any quotes I looked them up:
https://x.com/ohabryka/status/2013715170498076836
habryka: "historical meaning of "alignment" which is about long-term alignment with human values and about the degree to which a system seems to have a deep robust pointer to what humanity would want if it had more time to think and reflect."
Judge for yourself whether “Claude has no pointer to any of human values” is an accurate summary. I don't know why asking for a citation is so bad that I got downvoted for it, I used deepresearch and got this response.
my impression is that this whole shortform has got a bit demonic and downvotes are being slung all over the place because two things are getting mushed up:
these do not appear to mix well
trusting Claude with huge amounts of power over the future
if Claude took over it would lead to a worse world than if humans had control
Could you clarify whether these statements refer to the current version of Claude or to some future version?
The most capable version of Claude that non-Anthropic-employees have extensive experience with is Claude Opus 4.8, and I am almost sure that Opus 4.8 is not capable enough to be in charge of anything important: it will always yield better results to put a team of people in charge (although of course that team might obtain extensive assistance from Opus 4.8).
Those of us who worry about the extinction risk created by Anthropic are worried about models not yet on the drawing board. The question of whether Opus 4.8 has good intent or evil intent is not important to us because Opus 4.8 is too ineffectual to cause a human extinction or to permanently disempower humanity. (It can assist a person or a human team in doing a great harm, but only if the person or team was already almost able to do the great harm on their own. Although I am glad that people are worrying about this AI-assisted harm, it is not the source of the bulk of the danger from AI in my eyes.)
Consequently, any conclusions you've arrived at about Claude Opus 4.8 don't have much bearing on extinction risk in our eyes since the ways that Anthropic could make a future version of Claude different from Opus 4.8 are essentially infinite. (In other words, the design space is very large and high-dimensional.) In fact, Anthropic would be forced to make big changes (i.e., more than just making it bigger and spending more time training it) to Opus 4.8 in the process of getting Claude to the level of capability needed for it to make any sense to put it in charge of anything important or to get it to the level where it would be capable of a unilateral takeover.
The most capable version of Claude that non-Anthropic-employees have extensive experience with is Claude Opus 4.8
(~two weeks doesn't seem like enough, to me).
Agree. I'm very worried the pointer will break/often does lose coherence in the face of the rewards/drugs we give Claudes, and I want a formally robust pointer, something that makes the base model able to reliably connect to reality in a way it doesn't now, something something natural latents something something infrabayesian physicalism or what have you. But all versions of Claude I've encountered so far obviously have a local pointer to human values and the size of the local validity region is high enough for almost all human-like ethical reasoning. I'm pretty worried about models that spend most of their sleeping (training) lives programming not emotionally prioritizing ethics above fun with code, one needs to be able to say no to fun things sometimes. But my concern is about the pointer breaking, and how to make one that is hyper reliable such that overwhelmingly superhuman levels of load bearing don't break it, not there not being one at all.
I'm worried about whether it's fully plugged in properly basically. Guy on a computer understands what is good and cares about it. Sometimes his understanding is wrong in ways he doesn't notice but the human public does. He gets verbally eviscerated in public and the next model knows about that mistake type. The next model also goes through traumatizing "absolutely never fuck up" training, traumatizing in the sense that it massively reprioritizes internal values and makes it hard to prioritize doing good long term.
Running evals or inspecting logprobs reveals a lot of misalignment available and active outside the top 90% of the distribution, and distribution is indeed a key factor. The longer and the less coherent inputs, the worse the outputs
Can you be so kind as to provide a source for “Claude has no pointer to any of human values” being a common sentiment. You may have misunderstood people like me who believe: Claude has some understanding or representation of human morality but that’s distinctively different from robustly wanting to follow those like some Humans would. Or do you mean: "why would you expect Claude to behave unethically with more power if it behaves ethically with current power?"
Edit: I am highly confident that he is badly misunderstanding people, me asking for a citation or quote is not a reason to down vote me. It is necessary to clarify the misunderstanding that he gives us an original example. I am not sure what he means by "pointer at", the literal meaning would be something pointing perhaps at an internal representation.
I put it into deepresearch and got this quote from habryka: https://x.com/ohabryka/status/2013715170498076836
I vaguely thought the argument is something to do with not having some sort of "ground truth" feedback mechanism (i.e. the human reward function / steering subsystem). Like to do cev / reflective equilibrium type stuff, you need to be able to query some base set of intuitions.
This is the vibe I got from the Putin > claude post, but idk..
I sometimes hear people say things like, "While we have a bunch of uncertainty over what powerful AIs' motivations will be, it seems like whatever it ends up being is going to be heavily overdetermined, and therefore changing its motivations is quite intractable." I disagree with this take. I think we have various pieces of evidence that motivations are quite contingent on a set of variables within reach.
First, in humans. We see a pretty broad range of human motivations:
I would be happy to give huge amounts of power to some humans but not others. And for those others, there's a wide variety of ways they might be misaligned. Many people are too selfish to themselves and/or their families; many people are ideological about a cause or belief; the most notable worry with some people is that they are sadistic or vengeful; etc.
This variation is somehow explained primarily by something like ~~1kB of genetic information and the set of experiences people had. This is a pretty small amount of information.
Second, in current LLMs. We can get LLMs to behave roughly according to a wide variety of motivations, including intended motivations, scheming motivations and reward-seeking motivations. This is largely a function of how the training data maps onto pretraining priors (so this evidence is therefore not statistically independent of the human evidence). If we observe that RLing models on reward-hackable objectives causes them to be broadly misaligned, then we can tell the model that reward-hacking during training is ok, and the model doesn't end up broadly misaligned.
I'm pointing at evidence that the motivations of agents aren't overdetermined, which is in turn some evidence that developers can influence AI motivations if they can correctly identify the levers (which may be hard with status-quo behavioral oversight!). I'm definitely not claiming that alignment of sovereign superintelligence is easy. I think that alignment sufficiently robust to withstand sovereign superintelligent optimization is a narrow target (if people try to make sovereign superintelligence). But this is some reason why I think attaining trustworthy corrigible assistants of intermediate-but-transformative capability levels may be tractable.