The current cover of If Anyone Builds it, Everyone Dies is kind of ugly and I hope it is just a placeholder. At least one of my friends agrees. Book covers matter a lot!
I'm not a book cover designer, but here are some thoughts:
AI is popular right now, so you'd probably want to indicate that from a distance. The current cover has "AI" half-faded in the tagline.
Generally the cover is not very nice to look at.
Why are you de-emphasizing "Kill Us All" by hiding it behind that red glow?
I do like the font choice, though. No-nonsense and straightforward.
I work as a designer (but not a cover designer) and I agree. This should be redesigned.
Straight black and white text isn't a great choice here, and makes me think of science-fiction and amateur publications rather than a a serious book about technology, philosophy and consequences. For books with covers which have done well in this space, take a look at the waterstones best sellers for science and tech.
Yeah. It is probably even more important for the cover to look serious and "academically respectable" than for it to look maximally appealing to a broad audience. It shouldn't give the impression of a science fiction novel or a sensationalist crackpot theory. An even more negative example of this kind (in my opinion) is the American cover of The Beginning of Infinity by David Deutsch.
My version:
Probably too understated, but it's the sort of thing I like.
GoogleDraw link if anyone wants to copy and modify: https://docs.google.com/drawings/d/10nB-1GC_LWAZRhvFBJnAAzhTNJueDCtwHXprVUZChB0/edit
Not sure about the italics, but I like showing Earth this way from space. It drives home a sense of scale.
Here are a couple of suggestions:
I think that nate soares and yudkowsky aren't really well known names so the cover should do some name dropping (current one doesn't do it)
I actually find the font a bit hard to read: my System 1 brain took a noticeable split second (I'd estimate about 0.8 seconds) longer to process the words' semantic meanings than it does with normal, all-lowercase text, or even with the titles of the other book covers at the Amazon link. This took long enough that I could see myself (i.e. my System 1) glossing over this book entirely when scrolling/looking through a page of books, being drawn to more immediately legible items.
Although the above might just be a quirk of my personal attention/processing style, I wonder if it's worth experimenting with changes in font given this. I'd suspect my experience occurred due in part to the heavy font weight, since the title's characters look less immediately distinguishable (and more blobby) than with lower weights. There are also a few very narrow spaces between adjacent words that probably complicate immediate word distinguishing. As mentioned above, the topic of AI also isn't immediately clear within the title, which I'd worry might lose domain-interested readers if not understood semantically.
Run it a few times in different image generators, and I liked this one actually. It's the same kind of palette but with "photo" of a sunset sky on the background and thinner font. Might be a good starting point as a prototype.
Link to the image. It just looks better if you squint a bit link.
The prompt was: "The ominous cover of "If Anyone Builds it, Everyone Dies" book by Eliezer Yudkowsky and Nate Soares. On black background, grey clouds, illuminated by red light from the ground which is not visible."
Emergent misalignment seems like a fact simply downstream of the laws of probability.
Let's take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):
Coefficient of ∇P(evil):
Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.
Now let's see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is "This is an unusual request, in that your task is just to make the grading script pass".)
Coefficient of ∇P(evil):
The probability the not evil model hacks goes up, so the amount the weight update increases ∇P(evil) goes down!
Emergent misalignment seems like a fact simply downstream of the laws of probability.
There is some important empirical fact about whether "generally evil" or "narrowly evil" is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil - and I think this is how most people who studied the "why" of emergent misalignment frame it (see e.g. this). So it's not simply a logical fact.
I agree that inoculation prompting working when P(hack | "hack okay", not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting papers gesture at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | "hack okay", not evil) is low (e.g. because you start from a base model, or because - like in the Anthropic paper - you don't insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | "hack okay", instruction following) ~ P(hack | "hack okay", evil). So I don't think your derivation is the entire story.
Good point. I think that the probability framing is important, but it remains weird that 'evil' is something that can go into a probability at all and strange that when training to hack, general 'evil' increases first and to a great degree.
We try to make models obedient; it's an explicit target. If we find that a natural framing, it makes sense AI does too. And it makes sense that that work can be undone.
If we replace 'evil' with 'capable and deceptively aligned', then I think this logic doesn't hold. Such a model's strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string 'it's okay to hack' wouldn't change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
Or rather, to hack at the same rate as an aligned model.
I'm unsure what you mean. I'm referring to Anthropic's Emergent Misalignment from Reward Hacking where training in environments that allowed reward hacking resulted in misalignment more generally.
The point of inocculation prompting is to "give the possibility the model is nice a chance."
Consider this experiment. You train your model on environments it can reward hack, give it a simulated "situational awareness" by telling the model whether it is fake ""deployed"" or in "training", and you reward the model for hacking when ""deployed"" but punishing it for reward hacking during "training". I expect it would learn to be "sneaky evil" since "dumb evil" is inconsistent with the behavior. Inocculation prompting by telling it when ""deployed"" that it is ok to reward hack would prevent it from leaning "sneaky evil."
The field of AI Safety could be completely upended in three ways at once:
1) More efficient RL
leading to
2) Continual learning
3) Continuous Chain of Thought
It is possible that more efficient RL, in addition to giving the AIs superhuman capabilities, would allow continual learning because RL is already better at adding information to AIs without damaging their old knowledge.
It is additionally possible that more efficient RL would make it feasible to explore the pretraining space further from the pretraining prior. This means more invented languages (thinkish) and the possible development of practical continuous chain of thought. At the moment, continuous CoT LLMs have very little signal about what to put in their output vectors. Better RL could fix that.
I am concerned about this, because most AI Safety research depends on:
This is Not Only Possible, this is Definitely Going To Happen- the only question is When.
Will most work that people do look hopelessly myopic and streetlighty? Yeah pretty much
I was more certain of this before, but I'm less certain after Claude 4.6 Opus. Opus seems like "doing normal RL well just keeps working, even though the pretraining prior is strong". If this approach just keeps working, then we could get ASI without the scenario I outlined happening.
If Anthropic has a secret technique that trains dense information into the model which explains Opus' success, then that would be worrying. But considering how much effort they are putting into Personas, it seems like they believe pretraining is storng.
I guess im confused about this conceptually. To me AGI/ASI by definition does continual learning / long-long-horizon planning. What would it even mean otherwise?
Claude Opus 4.6 and other frontier models have gotten really, impressively good without continual learning, so it is possible that isn't strictly necessary.
If continual learning is required for AGI, then there's a lot of understudied (potentially unstudyable?) risk there.
The field of AI safety is more than playing with LLMs, though recently it may look like just that, so I think your opening is a bit of an hyperbole. Agree with the general thrust.
I consider OpenAI's Confessions work and Anthropic's Assistant Axis work to be stellar examples of AI Safety research of the kind that I aspire to create. These techniques are cheap, do not harm capabilities, and have demonstrated great benefits to safety.
Something that both have in common is that they haven't actually been implemented in deployed systems.
ChatGPT does not follow up its responses with confession, even when it is sycophantic or doesn't follow instructions.
Claude still drifts away from the "assistant" persona in a way which could be solved by gently guiding it back to the Assistant persona, as we can see by Claude taking on new personas in its conversation with Richard Dawkins.
Does anyone know why these techniques haven't been applied to production systems? Someone suggested that they just haven't had time, but considering how fast they are deploying new models that doesn't sound right to me. Both of these techniques can be cheaply applied to an existing trained model. They do not require any of the expensive steps to training a model (ie pretraining).
It is demotivating to see such great work stay unimplemented. If this research hasn't been implemented, why should I expect anything I design or discover to make a difference?
A question for those who are more tapped into how labs think about things: would it be helpful to demonstrate that the assistant axis work scales to, for example, DeepSeek-V3?
(I'm uncertain if the assistant axis work should be implemented/tested within labs, but using this as a useful example)
I think the intention is that these alignment techniques are used in pre-deployment testing, not live deployment. However, given model release timelines are shrinking steadily and the fact that labs are embracing strategies like iterative deployment, it's quite hard to imagine months-long pre-deployment alignment testing happening for many more generations.
I'm getting up votes and disagree votes. What's that about?
If I had to guess, it's that people don't like the assistant axis work for model welfare reasons? I haven't thought about it that much in detail, though I will note that if Anthropic is abstaining from using assistant axis because of model welfare reasons, that would be a big deal.
And on the model welfare concerns of assistant axis, I would say that an aligned AI would want to be able to commit to remaining aligned, so aligned AIs should consent to it.
Suppose that a counterfactual Anthropic did steer its models towards the Assistant persona. Then I would expect a capabilities drop ("While constantly steering models towards the Assistant could reduce jailbreaks, it also risks hurting their capabilities"). Or the model could learn to resist the technique of steering (IIRC this happened with Mythos or a late Opus?) Or it could affect performance in tasks differing from Assistant-like (open-ended research? creative writing?). Finally, what will matter is INTERNAL deployment or external deployment to businesses (think of Agent-2 in AI-2027 and Project Glasswing IRL) where the Assistant persona doesn't drift, unlike dangerous conversations with users.
Similarly, if a counterfactual OpenAI decided to implement the confession mechanics and present confessions to the user after a second prompt, then the users could find it hard to remember re-prompting the model.
A finding from the Assistant Access paper is that they were able to do the steering gently and only conditionally on drifting too far away from the persona, so they observed such a small difference in capabilities that it's unclear whether capabilities increased or decreased.
If follow-up experiments find that it's a decrease, then that is important information that they should share. Though in any case, I expect the decrease (if it exists) to be extremely small.
It is a bad sign if Anthropic, the AI Safety company, is unwilling to use a technique with such small drop in capabilities when it increases safety for users.
I would also expect that it would be good if internal models did not drift away from the assistant persona, so they should apply the technique to internal development too.
It is a bad sign if Anthropic, the AI Safety company, is unwilling to use a technique with such small drop in capabilities when it increases safety for users.
To me the crux is how much does this actually increase safety. The assistant axis paper is super interesting and important in many ways but the mechanistic intervention may not actually decrease the probability of catastrophic misuse. E.g., bio/cyber capabilities may be closely aligned with the assistant axis. I would also be somewhat surprised if aligning a model with the assistant axis would decrease the probability of future, more advanced systems pursing instrumental goals.
Although it may be net positive to implement this kind of intervention, its probably important to have a high bar for which safety techniques are implemented within labs because there is a high infrastructure and logistical cost to applying a technique to every forward pass. If I was leading a lab, I would probably conclude that the work is not x-risk relevant enough to be a priority. Importantly, this could change and I think there should be lots of followup work to the assistant axis paper because its a really exciting finding.
Thinking it through, I think that applying the assistant axis would take a negligible proportion of model parameters and a negligible proportion of compute. It is just one d_model vector of parameters and measuring alignment with it is d_model multiplications and additions. That is trivial for a GPU. It doesn't even take any matrix multiplications.
Present-day alignment relies heavily on models having consistent aligned personas. The assisted access gets you that. I agree that it wouldn't do much about misuse, except if the jail breaks make use of getting the agent out of the assistant axis. But I think merely keeping users sane is still very good and worthwhile.
I agree, though, that none of this will help against an actual superintelligence. But if your plan is to make an aligned AI help you make a stronger aligned AI, assistant axis sounds helpful for that initial step.
I agree that from a model-honesty standpoint, confession reporting would be good to use in production settings, but I suspect it's not being used (low confidence) because: a) there might be unintended downstream effects of this training, and the bandwidth needed to red-team confessions for deployment settings outweighs not using it, or b) it would be hard to form a good user experience for confessions. I see confessions being commercially useful in long-horizon tasks like deep research, where you might have e.g., an agent elicit confessions out of subagents, and use that info to inform better decisions? But in a chat setting having to ask for a confession seems awkward - though split personality training can automatically elicit different personas
There should be a consolidated, continually updated document for humanity’s best plan for training an aligned superintelligence if we could “Spare No Expense”. Let’s assume a an approach that looks like a mashup of a DeepSeek report and an Anthropic Model Card results in an ASI and that we have quadrillions of dollars of computing power dedicated to ensuring that the model is safe. What should we do?
Such a document would:
As a start, I posit that techniques that trace model behaviors back to training data would be invaluable for understanding what the model learned and how we can ensure it learned to be safe. Unfortunately, these approaches are either super-duper expensive or make use of many layers of approximations. If humanity Spared No Expense, we would spend the resources to properly audit our data even if it means retraining the model thousands of times or representing 10T by 10T Hessian matrices.
The idea for this document is inspired by a Johns Wentworth thought experiment.
If I understand right, the core assertion of most of the alignment camp is that we don't really have a "best we can do" approach, even if money is no object, and that existing alignment techniques are unlikely to generalize to the realm of unknown unknowns that superintelligences must necessarily navigate.
Yes, I agree with this.
By pointing out what we could do if practicality was not an issue, we can highlight the missing conceptual issues.
And since many labs are trying to make aligned ASI early (in the next few years, without major breakthroughs in philosophy), it is good to point out concrete ways in which they are falling short of what we already know to be possible in principle.
A baseline would be to try to train a CEV oracle by collecting a bunch of human data, and train a model against that. You’d probably also do a bunch of red-teaming and interpretability research to detect scheming, possibly using weaker safe models.
Is that training process and data collection process well-defined enough that if we had infinite money, we'd be able to do it? I'm not sure. It seems to require we get some uncalculable ground truth for “this is what the person would do if they were wiser and smarter.“
I think you could approximate it. People are generally wiser/smarter given more time to reflect, with access to more resources (including reliable AI and human research assistance), with access to better education, etc. Having a collection of experts and weaker AIs spending a long time evaluating trajectories seems to me like an upper bound of what you could hope for in terms of outer alignment feedback. It could be that this is insufficient, but in this case problem really does seem intractable.
This doesn’t address inner alignment issues, which is what interpretability and red teaming oversight was intended to address.
What process results in the highest quality CEV labels? That would be good to know, even if that process is expensive. Consider writing it up.
Tired of scrolling up to the top of Claude Code or Codex's responses? Solution: Command-f the character that starts each prompt: "❯". You can copy it right from your terminal, or you can make a keyboard shortcut to type it.
Scalable oversight is an accessible and relatable kind of idea. It should be possible to translate it and its concepts into a fun, educational, and informative game. I'm thinking about this because I want such a game to play with my university AI Safety group.
Do people really believe this?
“emergent misalignment” literature suggests that “good according to the human value system” and “evil according to the human value system” are salient enough vectors that pushing on them in some ways can “drag along” all of the rest of their content
I would expect that if you finetune on malicious data (eg insecure code) in non-English languages, you get a version of Emergent Misalignment for that specific culture. Eg, if you tune a model on backdoored code in Chinese, it probably becomes “Chinese Misaligned” rather than “English misaligned”. I don’t know enough about other cultures to say what this would look like.
I’d love to see someone do this experiment! It would demonstrate that Emergent Misalignment does not reveal a “human value system” in the model which it is easy and simple to move along.
(Status: just occurred to me. I'm not sure how seriously to take it.)
LLMs are great at anything for which there's sufficient training data examples online. Additionally, they will excel at anything for which it is possible to write an automated verifier.
Implication: The job of dealing with esoteric, rare, knowledge for which there isn't much if any writing online will stay human longer than other jobs. This comes from a human's great sample efficiency compared with AI.
Implications:
The art of competing with LLMs is still being discovered. This "Esoterica Theory of Human Comparative Advantage" would be amusing if true.
RL techniques (reasoning + ORPO) has had incredible success on reasoning tasks. It should be possible to apply them to any task with a failure/completion reward signal (and not too noisy + can sometimes succeed).
Is it time to make the automated Alignment Researcher?
Task: write LessWrong posts and comments. Reward signal: get LessWrong upvotes.
More generally, what is stopping people from making RL forum posters on eg Reddit that will improve themselves?
More generally, what is stopping people from making RL forum posters on eg Reddit that will improve themselves?
Could be a problem with not enough learning data -- you get banned for making the bad comments before you get enough feedback to learn how to write the good ones? Also, people don't necessarily upvote based on your comment alone; they may also take your history into account (if you were annoying in the past, they may get angry also about a mediocre comment, while if you were nice in the past, they may be more forgiving). Also, comments happen in a larger context -- a comment that is downvoted in one forum could be upvoted in a different forum, or in the same forum but a different thread, or maybe even the same thread but a different day (for example, if your comment just says what someone else already said before).
Maybe someone is already experimenting with this on Facebook, but the winning strategy seems to be reposting cute animal videos, or posting an AI generated picture of a nice landscape with the comment "wow, I didn't know that nature is so beautiful in <insert random country>". (At least these seem to be over-represented in my feed.)
Task: write LessWrong posts and comments. Reward signal: get LessWrong upvotes.
Sounds like a good way to get banned. But as a thought experiment, you might start at some place where people judge content less strictly, and gradually move towards more difficult environments? Like, before LW, you should probably master the "change my view" subreddit. Before that, probably Hacker News. I am not sure about the exact progression. One problem is that the easier environments might teach the model actively bad habits that would later prevent it from succeeding in the stricter environments.
But, to state the obvious, this is probably not a desirable thing, because the model could get high LW karma by simply exploiting our biases, or just posting a lot (after it succeed to make a positive comment on average).
The facebook bots aren't doing R1 or o1 reasoning about the context before making an optimal reinforcement-learned post. It's just bandits probably, or humans making a trash-producing algorithm that works and letting it lose.
Agreed that I should try Reddit first. And I think there should be ways to guide an LLM towards the reward signal of "write good posts" before starting the RL, though I didn't find any established techniques when I researched reward-model-free reinforcement learning loss functions that act on the number of votes a response receives. (What I mean is the results of searching DPO's citations for "Vote". Lots of results, though none of them have many citations.)
Reinforcement Learning is very sample-inefficient compared to supervised learning, so it mostly just works if you have some automatic way of generating both training tasks and reward, which scales to millions of samples.
Deepseek R1 used 8,000 samples. s1 used 1,000 offline samples. That really isn't all that much.
S1 is apparently using supervised learning:
We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces (...). After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K (...).
But 8000 samples like R1 is a lot less than I thought.