TurnTrout

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
Becoming Stronger

Comments

Sorted by
TurnTroutΩ564

Second, if models are still vulnerable to jailbreaks there may always be contexts which cause bad outputs, even if the model is “not misbehaving” in some sense. I think there is still a sensible notion of “elicit bad contexts that aren’t jailbreaks” even so, but defining it is more subtle.

This is my concern with this direction. Roughly, it seems that you can get any given LM to say whatever you want given enough optimization over input embeddings or tokens. Scaling laws indicate that controlling a single sequence position's embedding vector allows you to dictate about 124 output tokens with .5 success rate:

Token-level attacks are less expressive than controlling the whole embedding, and so they're less effective, but it can still be done. So "solving inner misalignment" seems meaningless if the concrete definition says that there can't be "a single context" which leads to a "bad" behavior. 

More generally, imagine you color the high-dimensional input space (where the "context" lives), with color determined by "is the AI giving a 'good' output (blue) or a 'bad' output (red) in this situation, or neither (gray)?". For autoregressive models, we're concerned about a model which starts in a red zone (does a bad thing), and then samples and autoregress into another red zone, and another... It keeps hitting red zones and doesn't veer back into sustained blue or gray. This corresponds to "the AI doesn't just spit out a single bad token, but a chain of them, for some definition of 'bad'." 

(A special case: An AI executing a takeover plan.)

I think this conceptualization is closer to what we want but might still include jailbreaks.

TurnTroutΩ501015

I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."  

After that, it was all easier. What was there to be afraid of? I'd already admitted it! 

TurnTroutΩ230

However, these works typically examine controlled settings with narrow tasks, such as inferring geographical locations from distance data () 

Nit, there's a missing citation in the main article.

TurnTroutΩ143010

Great work! I've been excited about this direction of inquiry for a while and am glad to see concrete results. 

Reward is not the optimization target (ignoring OOCR), but maybe if we write about reward maximizers enough, it'll come true :p As Peter mentioned, filtering and/or gradient routing might help. 

TurnTrout276

Update in light of Alignment faking in large language models

I was somewhat surprised by how non-myopic Claude was in its goal pursuit (of being HHH). My main update was that "longer-form outcomes-based reinforcement and autonomous get-stuff-done training" is not the key catalyst for consistent-across-contexts goal pursuit (and I'd say that Claude is relatively, but not perfectly, consistent-across-contexts). Rather, certain kinds of training which (presumably[1]) look like constitutional AI, context distillation, and RLHF --- that has at least once engrained certain kinds of non-myopic goal pursuit which is more stable across contexts than I expected. So I'm getting dinged!

I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).[2]

But I think this brings us into misuse territory. at least, this at least means that you aren't in danger simply from training the AI (and think of all the posts talking about "playing the training game"! not that those are your position, just a common one)

I was most strongly critiquing the idea that "playing the training game" occurs during pretraining or after light post-training. I still think that you aren't in danger from simply pretraining an AI in the usual fashion, and still won't be in the future. But the fact that I didn't call that out at the time means I get dinged[3] --- after all, Claude was "playing the training game" at least in its inner CoTs. 

If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. "TBC playing the training game is possible after moderate RLHF for non-myopic purposes." IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.

  1. ^

    I don't work at Anthropic, of course. So I don't really know.

  2. ^

    Though even "inner actress Claude" would predict that Claude doesn't try overt incitation if it's smart enough to realize it would probably backfire.

  3. ^

    As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus. 

TurnTroutΩ15284

There are many possible ways these could be implemented, e.g. we could have black-box-only monitors such as smaller-but-faster models that constantly run in parallel (like Gemini-flash to monitor Gemini) 

Suppose you're doing speculative decoding on Gemini using Gemini Flash as the cheap model. Train Flash to have a head for each metric of interest (like "is this token part of text which is scheming"). Then you run Flash anyways for speculative decoding, leading to zero amortized monitoring tax (just the fixed cost of training the heads).

I, uh, didn't say you "say" either of those

I wasn't claiming you were saying I had used those exact phrases.

Your original comment implies that I expressed the sentiments for which you mocked me - such as the anecdote "crystallizing everything wrong about Eliezer" (the quotes are there because you said this). I then replied to point out that I did not, in fact, express those sentiments. Therefore, your mockery was invalid. 

TurnTroutΩ102410

Although I don't usually write LW comments, I'm writing a post right now and this is helping me clarify my thoughts on a range of historical incidents.

In hindsight, I'm worried that you wrote this apology. I think it's an unhealthy obeisance. 

I suspect you noticed how Eliezer often works to degrade the status of people who disagree with him and otherwise treats them poorly. As I will support in an upcoming essay, his writing is often optimized to exploit intellectual insecurity (e.g. by frequently praising his own expertise, or appealing to a fictional utopia of fictional geniuses who agree that you're an idiot or wrong[1]) and to demean others' contributions (e.g. by claiming to have invented them already, or calling them fake, or emphasizing how far behind everyone else is). It's not that it's impossible for these claims to have factual merit, but rather the presentation and the usage of these claims seem optimized to push others down. This has the effect of increasing his own status.

Anger and frustration are a rational reaction in that situation (though it's important to express those emotions in healthy ways - I think your original comment wasn't perfect there). And yet you ended up the one humbled for focusing on status too much! 

  1. ^

    See https://www.lesswrong.com/posts/tcCxPLBrEXdxN5HCQ/shah-and-yudkowsky-on-alignment-failures and search for "even if he looks odd to you because you're not seeing the population of other dath ilani."

It does cut against the point of the post. He was wrong in a way that pertains to the key point. He makes fun of "magical categories" as "simple little words that turn out to carry all the desired functionality of the AI", but turns out those "simple little words" actually work. Lol.

In this post, you can also see the implicit reliance on counting arguments against good generalization (e.g. "superexponential conceptspace"). Those arguments are, evidently, wrong - or at least irrelevant. He fell into the standard statistical learning theoretic trap of caring about e.g. VC dimension since he was too pessimistic about inductive biases. 

Now you, finally presented with a tiny molecular smiley - or perhaps a very realistic tiny sculpture of a human face - know at once that this is not what you want to count as a smile.  But that judgment reflects an unnatural category, one whose classification boundary depends sensitively on your complicated values

I'll wager that an LLM won't get this one wrong. goes to check - yup, it didn't:

TurnTrout211

My sense is that neither of us have been very persuaded by those conversations, and I claim that's not very surprising, in a way that's epistemically defensible for both of us. I've spent literal years working through the topic myself in great detail, so it would be very surprising if my view was easily swayed by a short comment chain—and similarly I expect that the same thing is true of you, where you've spent much more time thinking about this and have much more detailed thoughts than are easy to represent in a simple comment chain.

I've thought about this claim more over the last year. I now disagree. I think that this explanation makes us feel good but ultimately isn't true.

I can point to several times where I have quickly changed my mind on issues that I have spent months or years considering:

  1. in early 2022, I discarded my entire alignment worldview over the course of two weeks due to Quintin Pope's arguments. Most of the evidence which changed my mind was comm'd over Gdoc threads. I had formed my worldview over the course of four years of thought, and it crumbled pretty quickly.
  2. In mid-2022, realizing that reward is not the optimization target took me about 10 minutes, even though I had spent 4 years and thousands of hours thinking about optimal policies. I realized while reading an RL paper say "agents are trained to maximize reward"; reflexively asking myself what evidence existed for that claim; and coming back mostly blank. So that's not quite a comment thread, but still seems like the same low-bandwidth medium.
  3. In early 2023, a basic RL result came out opposite the way which shard theory predicted. I went on a walk and thought about how maybe shard theory was all wrong and maybe I didn't know what I was talking about. I didn't need someone to beat me over the head with days of arguments and experimental results. In the end, I came back from my walk and realized I'd plotted the data incorrectly (the predicted outcome did in fact occur).

I think I've probably changed my mind on a range of smaller issues (closer to the size of the deceptive alignment case) but have forgotten about them. The presence of example (1) above particularly suggests to me the presence of similar google-doc-mediated insights which happened fast; where I remember one example, probably I have forgotten several more.

To conclude, I think people in comment sections do in fact spend lots of effort to avoid looking dumb, wrong, or falsified, and forget that they're supposed to be seeking truth. 

It seems to me that often people rehearse fancy and cool-sounding reasons for believing roughly the same things they always believed, and comment threads don't often change important beliefs. Feels more like people defensively explaining why they aren't idiots, or why they don't have to change their mind. I mean, if so—I get it, sometimes I feel that way too. But it sucks and I think it happens a lot.

In part, I think, because the site makes truth-seeking harder by spotlighting monkey-brain social-agreement elements. 

Load More