Certainly I don't expect people who already have jobs at top AI companies to start worrying about this. Anyone with an OpenAI job is probably already at, or close to, the top of their chosen status ladder. In the same way, a researcher who gets Nature papers regularly has already made it.
My impression is that too people in the lab-independent AI safety ecosystem are already being tempted by two money-status games: being tempted by the money+status of working at a lab, and being focused on the status of getting top ML conference papers. Adding a third status game of traditional journal publishing would just make these dynamics worse.
Yes, I had not quite considered the conferences! I come from a field where Nature is mostly spoken of in hushed tones. The last three years of my lab work are being combined into a single paper which we're very very ambitiously submitting to Nature Communications[1] which---if we succeed--- would be considered a very good outcome for me. If I were to achieve a single first-author Nature publication at this stage in my career then I expect I would have decent odds of having an academic career for life.[2]
As you might expect, this causes absolutely awful dynamics around high-impact publications, and basically makes people go mad. Nature and its subsidiaries can literally make people wait a whole year for peer review, because people will literally do anything to get published in it. A retraction from a published journal is considered career-ending by some, and so I personally know of two pieces of research which have stayed on the public record even though one was directly plagiarised and another contained some research which was just false. When these were discovered, everyone kept quiet because it would ruin the careers of anyone whose names had been on the paper.
Nature Comms is for papers which aren't good enough for Nature, or in my case, Nature Chemistry either. Nature is the main journal, with different subsidiaries for different areas, and Nature Comms is supposed to be for work too short---and too mediocre---to make it into Nature proper. In practice it's mostly more mediocre work that's cut to within an inch of readability, rather than short but extremely good work.
Of course I would have to keep working very hard, but a Nature publication would likely get me into a fellowship in a lab which gets regular top-journal publications, and from there getting a permanent position would be as achievable as it gets.
I agree with most of what you said here; I also think your treatment of the problem is better than the original confession report!
(I did the ELK contest at the time, but I didn't win any money, so my understanding may be subject to reasonable doubt)
That being said, there's a difference between noise and bias in AI training data. ELK isn't worried about noisy signals, but biased signals. LLMs are very resistant to noise in training, but not bias. For example, LLM RLHF does cause LLMs to pick up on biases in the training data.[1] A good example is gendered bias in relationship advice, wherein LLMs were more sympathetic when a "boyfriend" was mentioned as opposed to a "girlfriend".[2]
The reason for this is that the ELK problem is not about a distinction between "manipulate" and "protect", it's about a distinction between "simulate what a human would say, having read the output" and "tell the truth about my own internal activations". In any situation where the "truth" persona gets upvoted, the "simulate" persona also gets upvoted, AND there are scenarios where the "truth" persona gets down-voted while the "simluate" persona gets upvoted. This is different from having noisy labels which sometimes push your model in the wrong direction; in this case the problem is more like having a bias away from "truth" and towards "simulate". Your only hope is that the "truth" persona started out with more weight than the "simulate" one.
Which personas/circuits get upvotes and downvotes during which parts of training is an extremely subtle and difficult topic to work with. You might argue that the "truth" persona will start off with an advantage, since it's a specific example of good behaviour, which is generally RLed into the model. On the other hand, you might argue that the specific task of "look at my own activations and tell the truth about them" is not something which really ever comes up during RLHF, while "simulate what a human would say, having read the preceding text" is a huge chunk of the pretraining objective.[3]
Either way I expect this to be one of those things which naturally gets worse over time without specific mitigations (like reward hacking/specification gaming/aggressively pursing whatever seems to be the current RLVR objective) if you just keep scaling up confession training. Since it involves deception, it's also a case where the worse the problem gets, the harder it is to catch. Not good!
Originally I was going to use the Nigerian explanation for the "delve" example but NEVER MIND I GOT CLAUDE TO LOOK THAT UP AND IT'S JUST ALL MADE UP! THE GUARDIAN ARTICLE WHICH STARTED IT ONLY INTERVIEWED PEOPLE FROM KENYA AND UGANDA, THERE'S NOT EVEN ANY EVIDENCE THAT ANY PARTICULAR ENGLISH VERSION CONTAINS THE SAME WORDS THAT LLMS LOVE TO USE.
https://arxiv.org/html/2505.13995v2
The analogy being between truth:simulator::good-relationship-advice:redditor-simulator. Giving good relationship advice is probably rewarded maybe 80% of the time, but giving an exact simulation of what a redditor would say about a relationship advice is rewarded 100% of the time. Overall, the LLM learns to become a redditor-simulator rather than a good relationship advice giver.
At the end of the day this is simply not a serious alignment proposal put forward by people who are seriously thinking about alignment. This entire approach is (mostly) a rediscovery of the starting point for the ELK contest from four years ago; the authors have not even considered the very basic problems with this approach, problems which Christiano et. al. pointed out at the time four years ago, because that was the point of the contest!
Firstly, well done. Publishing in high impact journals is notoriously difficult. Getting outsider-legible status is probably good for our ability to shift policy.
Secondly, I've been very happy with the AI safety community's ability to avoid this particular status game so far. Insofar as it's valuable to be legibly successful to outsiders, publishing is good. I am, however, concerned that this will kick off a scenario where AI safety people get caught up in the zero-sum competition of trying to get published in high-impact journals. Nobody seems to have tried too hard before, and I would guess this paper was in part riding on the fact that the entire field is fairly novel to the Nature editors. Some part of me feels like publishing here was a defection, albeit a small one. I expect it will only get more difficult (for people other than Owain, once you have one Nature paper it's easier to get a second one) to publish in famous journals from here on out, as they see more and more AI safety papers.
I think it would be very bad for everyone if "has published in a high-impact journal" becomes a condition to get a job or a grant.
Time-telling: ~$10 Casio through to a ~$100k Rolex
This is an odd one out: both of them will tell you the time about equally well (as indeed will any phone). The purpose of a Rolex is fashion and/or status signalling.
The transport one includes "a custom Bentley" which is probably also mostly status signalling compared to a Range Rover or Tesla.
Most of the others seem to be material upgrades (e.g. a private chef lets you have tasty food every day, exactly to your tastes).
A $100k iPhone would struggle to set itself apart from a merely "good" iPhone in terms of material quality. (The standard way to scale up computers is usually by adding more computer, which quickly becomes too big to fit in one's pocket.) I suspect the most relevant comparison would be the Rolex: it would do the same job a bit better, but mostly be for status signalling. Currently, nobody wants to status signal with their iPhone, so they choose not to.
Hmm, that's interesting. They say the environments they used were real RL environments from Claude Sonnet 3.7, but if the specific hacks they trained it to do were that simple then perhaps it isn't as realistic.
I am confused. I thought that the Anthropic scenario was essentially a real example of reward hacking, which heavily implies that it's realistic. In what ways are your setups different from that?
I expect the difference comes from using SFT with cross-entropy loss as opposed to RL with something like a PPO/RPO loss. These are two very different beasts. On a super abstract level and for no particular reason I think of the RL losses as pushing in a particular direction in a space (since they act on logits directly) while cross-entropy loss pushes towards a particular location (since it acts on probabilities).
I was going to write a post called "Deep Misanthropy" earlier this year, about roughly this phenomenon. After some thought I concluded that "You dislike x% of other people." Is a consistent way the the world can be for all values of x between 0 and 100, inclusive.
All I can say is:
Epistemic status: kinda vibesy
My general hypothesis on this front is that the brain's planning modules are doing something like RL as inference, but that they're sometimes a bit sloppy about properly labelling which things are and are under their own control.
To elaborate: in RL as inference, you consider a "prior" over some number of input -> action -> outcome loops, and then perform a Bayesian-ish update towards outcomes which get high reward. But you have to constrain your update to only change P(action | input) values, while keeping the P(outcome | action) and P(input | outcome) values the same. In this case, the brain is sloppy about labelling and labels P(tired | stay up) as something it can influence.
This might happen because of some consistency mechanism which tries to mediate between different predictors. Perhaps if it gets one system saying "We will keep playing video games" and another saying "We mustn't be tired tomorrow" then the most reasonable update is that P(tired | stay up) is, in fact, influencable.