Wiki Contributions


Thanks for the detailed comment!

Yep, we definitely want training-based versions of the same experiments - this is something that's actively on my mind.

Responding to your bullet points:

  • I think it's not very important what the exact frequency is - it just doesn't tell you much. What does matter are the trends - whether prompt A has a higher frequency than prompt B - and for this it indeed is awkward that the frequencies are small. There I suggest that one doesn't read too much into the specific numbers, but just think "these prompts gave DMGI-results, those ones didn't".
    • (I currently don't think the specific numbers are that interesting; I also don't have unlimited budgets, even if I could increase sample sizes somewhat if needed.)
  • I tried to be "conservative" in the sense of "only most obviously bad completions count", though from a risk-management perspective one wants to be conservative in the sense "anything suspicious counts" . So I definitely don't mean that DMI-completions aren't bad, and it's reasonable to count those as getting full marks. (My hunch is that P(DMI) and P(DMGI) correlate quite strongly among prompts, so it doesn't matter too much.)
  • I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I'm looking at the CoTs and saying "this one is deceptive, that one is not", and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it. The latter point feels quite strong to me currently.
    • Would be interested in talking more about this; I could see myself changing my mind here.
  • I'll try and gather all of the responses and put them to the git repo. In the meantime, you can already find a good chunk of the responses there (and there should be some for all categories mentioned in the post except for the "Later experiments").
  • Not sure what my reasoning was here, I'll try out this variation, and overall look more into "does the model think about oversight?"
  • (No particular thoughts here, except yep, this makes things harder, and thanks for flagging it.)
  • My motivation for human-CoTs was "let's start by constructing something that definitely should work (if anything does)". I completely agree that model-written CoTs are the way to go, and this is probably the single most important thing I'd change if I rewrote the post: just use model-written CoTs in all experiments. This doesn't really matter, though - it doesn't show in the post, but I've done a lot of testing with them, they give largely the same results, and I basically don't use human-written CoTs anymore.
  • "The best model of a thing is the thing itself", goes the saying :)

A final comment on analogies to full deceptive alignment: Given that there's quite a bit of fine-tuning (both supervised and RL) being done after pre-training (and this amount likely increases over time), deceptive alignment arising in pre-training isn't the only worry - you have to worry about it arising in fine-tuning as well! Maybe someone does a lot of fine-tuning for economically valuable tasks such as acquiring lots of gold coins and this then starts the deceptive alignment story.

In any case, it's extremely important to know the circumstances under which you get deceptive alignment: e.g. if it turns out to be the case that "pre-trained models aren't deceptively aligned, but fine-tuning can make them be", then this isn't that bad as these things go, as long as we know this is the case.

Interesting perspective!

I would be interested in hearing answers to "what can we do about this?". Sinclair has a couple of concrete ideas - surely there are more. 

Let me also suggest that improving coordination benefits from coordination. Perhaps there is little a single person can do, but is there something a group of half a dozen people could do? Or two dozens? "Create a great prediction market platform" falls into this category, what else?

I read this a year or two ago, tucked it in the back of my mind, and continued with life.

When I reread it today, I suddenly realized oh duh, I’ve been banging my head against this on X for months


This is close to my experience. Constructing a narrative from hazy memories:

First read: "Oh, some nitpicky stuff about metaphors, not really my cup of tea". *Just skims through*

Second read: "Okay it wasn't just metaphors. Not that I really get it; maybe the point about different people doing different amount of distinctions is good"

Third read (after reading Screwtape's review): "Okay well I haven't been 'banging my head against this', but examples of me making important-to-me distinctions that others don't get do come to mind". [1] (Surely that goes just one way, eh?)

Fourth read (now a few days later): "The color blindness metaphor fits so well to those thoughts I've had; I'm gonna use that".

Reading this log one could think that there's some super deep hidden insight in the text. Not really: the post is quite straightforward, but somehow it took me a couple of rounds to get it.

  1. ^

    If you are interested: one example I had in mind was the distinction between inferences and observations, which I found more important in the context than another party did.

I view this post as providing value in three (related) ways:

  1. Making a pedagogical advancement regarding the so-called inner alignment problem
  2. Pointing out that a common view of "RL agents optimize reward" is subtly wrong
  3. Pushing for thinking mechanistically about cognition-updates


Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn't truly comprehend it - sure, I could parrot back terms like "base optimizer" and "mesa-optimizer", but it didn't click. I was confused.

Some months later I read this post and then it clicked.

Part of the pedagogical value is not having to introduce the 4 terms of form [base/mesa] + [optimizer/objective] and throwing those around. Even with Rob Miles' exposition skills that's a bit overwhelming.

Another part I liked were the phrases "Just because common English endows “reward” with suggestive pleasurable connotations" and "Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater." One could be tempted to object and say that surely no one would make the mistakes pointed out here, but definitely some people do. I did. Being a bit gloves off here definitely helped me.


Re 2: The essay argues for, well, reward not being the optimization target. There is some deep discussion in the comments about the likelihood of reward in fact being the optimization target, or at least quite close (see here). Let me take a more shallow view.

I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It's the former view that this post (correctly) argues against. I am sympathetic to pushback of the form "there are arguments that make it reasonable to privilege reward-maximization as a hypothesis" and about this post going a bit too far, but these remarks should not be confused with a rebuttal of the basic point of "cognition-updates are a completely different thing from terminal-goals".

(A part that has bugged me is that the notion of maximizing reward doesn't seem to be even well-defined - there are multiple things you could be referring to when you talk about something maximizing reward. See e.g. footnote 82 in the Scheming AIs paper (page 29). Hence taking it for granted that reward is maximized has made me confused or frustrated.)


Re 3: Many of the classical, conceptual arguments about AI risk talk about maximums of objective functions and how those are dangerous. As a result, it's easy to slide to viewing reinforcement learning policies in terms of maximums of rewards.

I think this is often a mistake. Sure, to first order "trained models get high reward" is a good rule of thumb, and "in the limit of infinite optimization this thing is dangerous" is definitely good to keep in mind. I still think one can do better in terms of descriptive accounts of current models, and I think I've got value out of thinking cognition-updates instead of models that maximize reward as well as they can with their limited capabilities.

There are many similarities between inner alignment and "reward is not the optimization target". Both are sazens, serving as handles for important concepts. (I also like "reward is a cognition-modifier, not terminal-goal", which I use internally.) Another similarity is that they are difficult to explain. Looking back at the post, I felt some amount of "why are you meandering around instead of just saying the Thing?", with the immediate next thought being "well, it's hard to say the Thing". Indeed, I do not know how to say it better.

Nevertheless, this is the post that made me get it, and there are few posts that I refer to as often as this one. I rank it among the top posts of the year.

Part 4/4 - Concluding comments on how to contribute to alignment

In part 1 I talked about object-level belief changes, in part 2 about how to do research and in part 3 about what alignment research looks like.

Let me conclude by saying things that would have been useful for past-me about "how to contribute to alignment". As in past posts, my mode here is "personal musings I felt like writing that might accidentally be useful to others".

So, for me-1-month-ago, the bottleneck was "uh, I don't really know what to work on". Let's talk about that.

First of all, experienced alignment researchers tend to have plenty of ideas. (Come on, me-1-month-ago, don't be surprised.) Did you know that there's this forum where alignment people write out their thoughts?

"But there's so much material there", me-1-month-ago responds.

what kind of excuse is that Okay so how research programs work is that you have some mentor and you try to learn stuff from them. You can do a version of this alone as well: just take some researcher you think has good takes and go read their texts.

No, I mean actually read them. I don't mean "skim through the posts", I mean going above and beyond here: printing the text on paper, going through it line by line, flagging down new considerations you haven't thought before. Try to actually understand what the author thinks, to understand the worldview that has generated those posts, not just going "that claim is true, that one is false, that's true, OK done".

And I don't mean reading just two or three posts by the author. I mean like a dozen or more. Spending hours on reading posts, really taking the time there. This is what turns "characters on a screen" to "actually learning something".

A major part of my first week in my program involved reading posts by Evan Hubinger. I learned a lot. Which is silly: I didn't need to fly to the Bay to access But, well, I have a printer and some "let's actually do something ok?" attitude here.

Okay, so I still haven't a list of Concrete Projects To Work On. The main reason is that going through the process above kind of results in that. You will likely see something promising, something fruitful, something worthwhile. Posts often have "future work" sections. If you really want explicit lists of projects, then you can unsurprisingly find those as well (example). (And while I can't speak for others, my guess is that if you really have understood someone's worldview and you go ask them "is there some project you want me to do?", they just might answer you.)

Me-from-1-ago would have had some flinch reaction of "but are these projects Real? do they actually address the core problems?", which is why I wrote my previous three posts. Not that they provide a magic wand which waves away this question, rather they point out that past-me's standard for what counts as Real Work was unreasonably high.

And yeah, you very well might have thoughts like "why is this post focusing on this instead of..." or "meh, that idea has the issue where...". You know what to do with those.

Good luck!

Part 3/4 - General uptakes

In my previous two shortform posts I've talked about some object-level belief changes about technical alignment and some meta-level thoughts about how to do research, both which were prompted by starting in an alignment program.

Let me here talk about some uptakes from all this.

(Note: As with previous posts, this is "me writing about my thoughts and experiences in case they are useful to someone", putting in relatively low effort. It's a conscious decision to put these in shortform posts, where they are not shoved to everyone's faces.)

The main point is that I now think it's much more feasible to do useful technical AI safety work than I previously thought. This update is a result of realizing both that the action space is larger than I thought (this is a theme in the object-level post) and that I have been intimidated by the culture in LW (see meta-post).

One day I heard someone saying "I thought AI alignment was about coming up with some smart shit, but it's more like doing a bunch of kinda annoying things". This comment stuck with me.

Let's take a concrete example. Very recently the "Sleeper Agents" paper came out. And I think both of the following are true:

1: This work is really good.

For reasons such as: it provides actual non-zero information about safety techniques and deceptive alignment; it's a clear demonstration of failures of safety techniques; it provides a test case for testing new alignment techniques and lays out the idea "we could come up with more test cases".

2: The work doesn't contain a 200 IQ godly breakthrough idea.

(Before you ask: I'm not belittling the work. See point 1 above.)

Like: There are a lot of motivations for the work. Many of them are intuitive. Many build on previous work. The setup is natural. The used techniques are standard.

The value is in stuff like combining a dozen "obvious" ideas in a suitable way, carefully designing the experiment, properly implementing the experiment, writing it down clearly and, you know, actually showing up and doing the thing.

And yep, one shouldn't hindsight-bias oneself to think all of this is obvious. Clearly I myself didn't come up with the idea starting from the null string. I still think that I could contribute, to have the field produce more things like that. None of the individual steps is that hard - or, there exist steps that are not that hard. Many of them are "people who have the competence to do the standard things do the standard things" (or, as someone would say, "do a bunch of kinda annoying things").

I don't think the bottleneck is "coming up with good project ideas". I've heard a lot of project ideas lately. While all of them aren't good, in absolute terms many of them are. Turns out that coming up with an idea takes 10 seconds or 1 hour, and then properly executing that requires 10 hours or 1 full-time-equivalent-years.

So I actually think that the bottleneck is more about "we have people executing the tons of projects the field comes up with", at least much more than I previously thought.

And sure, for individual newcomers it's not trivial to come up with good projects. Realistically one needs (or at least I needed) more than the null string. I'll talk about this more in my final post.

Yeah, I definitely grant that there are insights in the things I'm criticizing here. E.g. I was careful to phrase this sentence in this particular way:

The talk about iterative designs failing can be interpreted as pushing away from empirical sources of information.

Because yep, I sure agree with many points in the "Worlds Where Iterative Design Fails". I'm not trying to imply the post's point was "empirical sources of information are bad" or anything. 

(My tone in this post is "here are bad interpretations I've made, watch out for those" instead of "let me refute these misinterpreted versions of other people's arguments and claim I'm right".)

Part 2 - rant on LW culture about how to do research

Yesterday I wrote about my object-level updates resulting from me starting on an alignment program. Here I want to talk about a meta point on LW culture about how to do research.

Note: This is about "how things have affected me", not "what other people have aimed to communicate". I'm not aiming to pass other people's ITTs or present the strongest versions of their arguments. I am rant-y at times. I think that's OK and it is still worth it to put this out.

There's this cluster of thoughts in LW that includes stuff like:

"I figured this stuff out using the null string as input" - Yudkowsky's List of Lethalities

"The safety community currently is mostly bouncing off the hard problems and are spending most of their time working on safe, easy, predictable things that guarantee they’ll be able to publish a paper at the end." - Zvi modeling Yudkowsky

There are worlds where iterative design fails

"Focus on the Hard Part First"

"Alignment is different from usual science in that iterative empirical work doesn't suffice" - a thought that I find in my head.


I'm having trouble putting it in words, but there's just something about these memes that's... just anti-helpful for making research? It's really easy to interpret the comments above as things that I think are bad. (Proof: I have interpreted them in such a way.)

It's this cluster that's kind of suggesting, or at least easily interpreted as saying, "you should sit down and think about how to align a superintelligence", as opposed to doing "normal research".

And for me personally this has resulted in doing nothing or something just tangentially related to prevent AI doom. I'm actually not capable of just sitting down and deriving a method for aligning a superintelligence from the null string.

( which one could respond with "reality doesn't grade on a curve", or that one is "frankly not hopeful about getting real alignment work" out of me, or other such memes.)

Leaving aside issues whether these things are kind or good for mental health or such, I just think these memes are a bad way about thinking how research works or how to make progress.

I'm pretty fond of the phrase "standing on the shoulders of giants". Really, people extremely rarely figure stuff out from the ground or from the null string. The giants are pretty damn large. You should climb on top of them. In the real world, if there's a guide for a skill you want to learn, you read it. I could write a longer rant about the null string thing, but let me leave it here.

About "the safety community currently is mostly bouncing off the hard problems and [...] publish a paper": I'm not sure who "community" refers to. Sure, Ethical and Responsible AI doesn't address AI killing everyone, and sure, publish or perish and all that. This is a different claim from "people should sit down and think how to align a superintelligence". That's the hard problem, and you are supposed to focus on that first, right?

Taking these together, what you get is something that's opposite to what research usually looks like. The null string stuff pushes away from scholarship. The "...that guarantee they'll be able to publish a paper..." stuff pushes away from having well-scoped projects. The talk about iterative designs failing can be interpreted as pushing away from empirical sources of information. Focusing on the hard part first pushes away from learning from relaxations of the problem.

And I don't think the "well alignment is different from science, iterative design and empirical feedback loops don't suffice, so of course the process is different" argument is gonna cut it.

What made me make this update and notice the ways in how LW culture is anti-helpful was seeing how people do alignment research in real life. They actually rely a lot on prior work, improve on those, use empirical sources of information and do stuff that puts us into a marginally better position. Contrary to the memes above, I think this approach is actually quite good.

Thanks for the response. (Yeah, I think there's some talking past each other going on.)

On further reflection, you are right about the update one should make about a "really hard to get it to stop being nice" experiment. I agree that it's Bayesian evidence for alignment being sticky/stable vs. being fragile/sensitive. (I do think it's also the case that "AI that is aligned half of the time isn't aligned" is a relevant consideration, but as the saying goes, "both can be true".)

Showing that nice behavior is hard to train out, would be bad news?

My point is not quite that it would be bad news overall, but bad news from the perspective of "how good are we at ensuring the behavior we want".

I now notice that my language was ambiguous. (I edited it for clarity.) When I said "behavior we want", I meant "given a behavior, be it morally good or bad or something completely orthogonal, can we get that in the system?", as opposed to "can we make the model behave according to human values". And what I tried to claim was that it's bad news from the former perspective. (I feel uncertain about the update regarding the latter: as you say, we want nice behavior; on the other hand, less control over the behavior is bad.)

A local comment to your second point (i.e. irrespective of anything else you have said).

Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said "This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren't able to uproot it. Alignment is extremely stable once achieved" 

As I understand it, the point here is that your experiment is symmetric to the experiment in the presented work, just flipping good <-> bad / safe <-> unsafe / aligned <-> unaligned. However, I think there is a clear symmetry-breaking feature. For an AI to be good, you need it to be robustly good: you need it to be that in the vast majority of case (even with some amount of adversarial pressure) the AI does good things. AI that is aligned half of the time isn't aligned.

Also, in addition to "how stable is (un)alignment", there's the perspective of "how good are we at ensuring the behavior we want [edited for clarity] controlling the behavior of models". Both the presented work and your hypothetical experiment are bad news about the latter.

I think lots of folks (but not all) would be up in arms, claiming "but modern results won't generalize to future systems!" And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it's socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I'm being too cynical, but that's my reaction.

(FWIW I think you are being too cynical. It seems like you think it's not even-handed / locally-valid / expectation-conversing to celebrate this result without similarly celebrating your experiment. I think that's wrong, because the situations are not symmetric, see above. I'm a bit alarmed by you raising the social dynamics explanation as a key difference without any mention of the object-level differences, which I think are substantial.)

Load More