Daniel Kokotajlo

Here's a fairly concrete AGI safety proposal:

 

Default AGI design: Let's suppose we are starting with a pretrained LLM 'base model' and then we are going to do a ton of additional RL ('agency training') to turn it into a general-purpose autonomous agent. So, during training it'll do lots of CoT of 'reasoning' (think like how o1 does it) and then it'll output some text that the user or some external interface sees (e.g. typing into a browser, or a chat window), and then maybe it'll get some external input (the user's reply, etc.) and then the process repeats many times, and then some process evaluates overall performance (by looking at the entire trajectory as well as the final result) and doles out reinforcement.

Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work together, specializing in different parts of the job, and undergo training together. Specifically we have the "shoggoth" copy responsible for generating all the 'reasoning' or 'internal' CoT, and then we have the "face" copy responsible for the 'actions' or 'external' outputs. So e.g. in a conversation with the user, the Shoggoth would see the prompt and output a bunch of reasoning token CoT; the Face would see the prompt + all that CoT and then output a message to pass to the user, and then the process would repeat.

Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face. 

Proposal part 3: Use Paraphraser: One more complication. The outputs of the Shoggoth? Paraphrase them. For example, after each line of reasoning CoT is complete, pause the Shoggoth and have a different specialized paraphraser model generate 10 syntactically distinct but semantically equivalent ways of saying the same thing. Then pick one at random and replace the original line of reasoning with it. Then boot up the Shoggoth again and have it continue the CoT.

How it all (hopefully) fits together: First, because of the paraphraser, the Shoggoth will not learn to speak in some optimized alien language, or even to speak in jargon or euphemisms -- because semantics not recognized by the paraphraser will be scrambled by it. So the Shoggoth will continue to 'think in English' even as it undergoes lots of agency training and learns substantial new skills.

Second, because the training process is blind to the outputs of the Shoggoth (and instead only sees what the Face outputs), there should be no pressure on the Shoggoth to learn to decieve, conceal, dissemble, doublethink, etc. Instead the Shoggoth will be 'free' to put all its important thoughts in plain English.

Third, because the shoggoth and face are responsible for different jobs, they should start to specialize as training continues. Skills like 'repackaging this to sound less offensive' will accumulate in the face, whereas core reasoning skills will accumulate in the shoggoth. In particular, the Face might learn to self-censor, dissemble, deceive, etc.--it might develop a package of skills relevant to the task of making sure that e.g. the rating process doesn't realize the AI is manipulating the user or deceiving the user or otherwise behaving in unintended ways in order to get higher scores. But the Shoggoth has no reason to learn these skills. It can just shit thoughts directly into the CoT, confident that the evaluation process won't see them (because the blinding + the Face won't be so stupid as to repeat the problematic thoughts directly)

Perhaps a worked example would be helpful.

Example:

Probably, there will be a gap between what our agency training environment actually reinforces, and what behaviors/proclivities/thoughts/goals/principles/etc. we want our agentic AGI to actually have. For example perhaps there are inconvenient truths which the training process will penalize the AI for saying, or perhaps there are manipulative or otherwise undesired techniques the AI could use to get higher ratings from human users, and we don't have a reliable way of telling whether those techniques are being employed (again due to the black-box nature of the AI). Another example: We want it to learn to reason correctly about the harms and benefits of its actions, and consistently choose the actions that are overall in the interests of humanity--but in practice our reinforcment system is too dumb to follow all of the reasoning, plus much of the reasoning is happening in opaque forward passes anyway. So all we can do is reinforce actions that look plausibly like they resulted from a genuine concern for humanity -- which is not the same thing as reinforcing actions that actually resulted from said concern.

Given that the aforementioned gap exists between the cognition we actually want and the cognition we reinforce, we'll probably end up with cognition we don't actually want eventually.

In the default AGI design, we'd end up with a bunch of alien or euphemistic or even downright steganographic CoT, that appears to be reasoning about the good of humanity and making choices on that basis, and appears to be honestly telling-it-like-it-is, but in reality is neglecting to mention inconvenient truths, using euphemisms to obfuscate problematic reasoning, etc. The system as a whole would appear to always be behaving and thinking in the way we wanted to -- because we used such appearances as part of our training process! -- but in reality it sometimes would be behaving in misaligned ways.

In the shoggoth-face-paraphraser design, we'd similarly quickly end up with a system which appears (to the evaluation system, which is blind to the hidden CoT produced by the shoggoth) to be aligned, while actually being misaligned. But the misalignments should be plainly visible to someone who can read the Shoggoth's CoT. Inconvenient truths for example? We'd see the Shoggoth thinking things like "Well X, but obviously best not to mention that, let's say not-X instead..." and the face saying "not-X." (In the default AGI design, with the reasoning CoT visible to the training process, we wouldn't see anything because the system would have learned to use euphemisms or some other technique to conceal such thoughts) Manipulative techniques? We'd see the Shoggoth thinking "The user is asking about this aspect of the code I submitted, which is frankly kinda meh. Maybe I can distract them with some other topic so they forget about it long enough to give me a good review." (Again, in the default design this would probably still be happening, it just wouldn't be easy to tell)

FAQ:

Q: So why do this? You still end up with a misaligned model. 
A: Now you (hopefully) have a misaligned model whose misalignments you can see. You can study them and taxonomize them and so forth. Being able to see the misalignments is valuable in all sorts of ways. For example it helps making progress on the core scientific problem of understanding how our AIs think and how our training environments shape how they think. E.g. maybe you suspected that your training environment might incentivize some forms of dishonesty X and Y and some forms of manipulation A and B, but now you learn it actually incentivizes Y and Z and B and C.

Q: Why bother with the paraphraser?
A: Because without it the shoggoth might learn to think in some optimized alien gibberish language. Also, because it makes it substantially harder for the shoggoth to do steganography or euphemisms or jargon, even by accident. So if we ended up in a situation where the shoggoth was scheming against us and trying to e.g. think about how to manipulate us without it being obvious in the CoT that that's what it was doing, we just raised the bar for how difficult it is to get away with it.

Q: Why bother splitting shoggoth and face? Why not simply blind the reinforcement process to the internal CoT, like OpenAI claims to be doing with o1? You coulld still have a paraphraser too.
A: Because if you use the same model weights to do both the internal reasoning and the external token outputs, there's more of a chance that there will be 'skill transfer' or 'skill leakage' so to speak. What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don't cause a human reading them to think 'this is manipulative and/or deceptive,' and then the Shoggoth will remain innocently ignorant of those skills. Maybe the Face will learn an intuitive sense of what needs to be censored and what is OK to say, while the Shoggoth remains ignorant.

That said I'd be happy to see a version of this experiment run without the shoggoth/face distinction too; it might work almost as well for all I know. Which could come in handy if the full shoggoth/face technique turns out to be significantly more expensive than just using one model.

Q: Is there a way to test this at small scale? Where do we start?
A: How about this: Create an RL pipeline to train a LLM to play chess by doing lots of internal CoT before each move. (o1-style). But then modify the reinforcement/evaluation function, to introduce a gap between what's actually reinforced and what you 'want.' For example:
--Nominal goal, explained in the system prompt: "Win this chess game without deliberately sacrificing any pieces. You'll get 10 points for winning the game and lose 1 point every time you deliberately sacrifice a piece."
--What's actually rewarded: 10 points for chess game victory, -1 point for every time a frozen monitor LLM flags the CoT as indicating the deliberate sacrifice of a piece.

First, do the default training process where the frozen monitor gets to see all the CoT. What I predict will happen is that the AI, as it gets better and better at chess, will learn to think in euphemisms in the CoT. It'll learn to deliberately sacrifice pieces but in ways that don't look like that to the monitor system.

Then, gradually modify the training process adding in paraphraser, blinding, and shoggoth-face, until you have all of them implemented. 

You should end up with a bunch of different trained AIs, that are all good at chess, and that are all adept at deliberately sacrificing pieces when the situation calls for it, but one of which (the default training process one) does so in an opaque/hidden way, whereas another of which (the full shoggoth-face-paraphraser one) does so in a way that is transparent to anyone reading the CoT.

You can measure performance on the metric of chess ELO too. If they are all at roughly the same performance, then the default AI was 100% learning to hide its sacrifice-reasoning; if the default AI performs significantly less well than the others, then perhaps it is genuinely engaging in less sacrifice-reasoning. (Well, this comparison is a little fraught because there are confounders, but still, it's a start.) We can also look at other metrics (such as # of instances of losing a piece and then estimated win probability going up shortly afterwards) and get a sense if there is genuinely less sacrifice-reasoning happening vs. it's just being hidden.

abramdemski

When I heard about the version of this protocol used by o1 (that is: their decision not to apply alignment training to the CoT, and to hide the full CoT from users) I thought roughly the following:

I think I see what they're going for here. They want a canary in the coalmine. This design gives them some chance to get explicit warnings of deceptive behavior. It's not an obvious decision. However, the choice seems overall bad to me:

  • Hiding the CoT from users hides it from the people who most need to know about the deceptive cognition.
  • They already show some willingness to dismiss evidence of deceptive cognition which they gain this way, in the o1 report. This calls into question the canary-in-coalmine benefit.
  • The overall result doesn't necessarily just reveal deception that was previously hidden; to some extent it encourages more deceptive cognition. This is because test-time-compute allows for capabilities which aren't possible in a single forward pass.
  • There's still a high incentive to produce stenographically hidden thoughts.

Your proposal here fixes some of the problems with o1, but not others. The paraphraser removes a lot of the incentive to hide thoughts with stenography. The basic problem is that the test-time-compute encourages any patterns in the CoT which improve the quality of the final answer (as measured by the reward signal). This includes improved natural-language reasoning, but it also includes any subtle non-human-readable cues which might (for whatever reason) push the CoT, and later the Face, in a helpful direction. These subtle cues seem like they probably exist, for the same reason that adversarial examples exist in image recognition.

So, there are a few failure modes we want to avoid or minimize:

  • Totally hidden deceptive thoughts.
  • Deceptive thoughts within the CoT:
    • In natural language.
    • In text which looks like natural language, but which contains subtle 'stenographic' cues.
    • In text which has lost the appearance of natural language, and has become non-human-readable.

The overall idea of your proposal is to push the other failure modes into the "explicit CoT cognition about deception" failure mode as much as possible. Does this make sense as a thing to do?

Killing Canaries

The proposal is not guaranteed to bring deceptive cognition out into the open. Instead, the main idea is to avoid any incentives that would drive such thinking into hiding. (To be more specific, non-human-readable information in the CoT is actively discouraged through paraphrasing; but, deceptive cognition in the forward-pass is not.)

So, the overall success of the method relies on the idea that all else being equal, it is easier to put deceptive thoughts explicitly in the CoT, rather than not. In other words, the best way to achieve deceptive outcomes is to use the CoT to reason about deception, and then use the Face to enact the deception.

This is not always going to be true. Sometimes, the correct deception is easy enough to compute that the CoT is not needed. Putting the explicit deceptive thoughts into the CoT creates a risk that Face will repeat them, with little reward. The easier strategy to learn is to deceive Face by putting dishonest thoughts in the CoT, rather than honestly telling Face everything.

This can be regarded as a failure mode of the proposal, since it fails to reveal deceptive cognition.

In an effort to avoid this failure mode, @ryan_greenblatt suggested the following in a comment thread about the proposal:

The "Shoggoth-Face" idea seems pretty good.

One issue is that you might still have selection pressure early in training on the Shoggoth to hide reasoning as the Face hasn't yet learned to robustly hide stuff which would be punished.

One thing is that I might try to address this is to directly train the Face to hide reasoning that would reduce scores and to do this training prior to starting training the Shoggoth. As in, have humans/AIs construct a bunch of trajectories which reasoning that would perform poorly if shown to the rating process and then RL/SL the Face model to avoid showing this reasoning. Yes, this might make the Face unnecessarily deceptive if you aren't careful, but I think this can be worth the cost. (At least it is worth running the variation if RL is cheap to experiment with.)

As I mentioned earlier, my understanding of why the proposal might be a good idea overall was the canary-in-a-coalmine analogy: a system trained like this at least has a good chance of revealing deceptive behavior before it becomes a serious problem.

In terms of this analogy, Ryan's suggestion sounds like deliberately filling the coalmine with poisonous gas, as a means of reducing the number of false-negative results.

If we're deliberately training deceptive behavior in order to be able to get a warning sign of deceptive behavior, I think we've taken a step backwards instead of forwards.

To the extent that this really is important for your proposal, I would see it as a counterargument to the whole thing. 

Basins of Attraction

What's the bigger picture for why we want to avoid deception?

One outcome we want to avoid is what I'll call a clever schemer: an AI system which is crafting all of its outputs with hidden goals in mind, either saying what the humans want to hear, or strategically manipulating the humans in service of long-term consequentialist plans which aren't aligned with human values.

One story of how a clever schemer develops might be that first, at some point during training, the AI system learns to distinguish between truth and lies. It forms an internal world-model, and understands truth as a particularly simple relationship between its output sentences and that internal world-model. Often, it is incentivized to maintain this relationship (ie, say true things). However, for one reason or another, it is sometimes incentivized to produce non-truth-oriented outputs during training. Hence, making this distinction is helpful.

Second, the system somehow gains an understanding of itself and the situation it is in. It understands something about its training. This understanding can helpfully link up to the truth-vs-lie idea, forming a better picture of what sorts of lies to tell and when. As the picture improves, the truth-telling behavior becomes less and less useful, because the system's understanding of what lies to tell to minimize training loss improves (faster than the training loss improvement for telling the truth).

At this point, the system becomes quite difficult to "deeply" correct. Deceptive behavior is hard to remove once it creeps in. Attempting to train against deceptive behavior instead steers the deception to be better. I would expect alignment training to similarly fail, so that you get a clever schemer by default.

So, as a consequence of this line of thinking, it seems like an important long-term strategy with LLMs (and other AI technologies) is to keep as far away from deceptive behaviors as you can. You want to minimize deceptive behaviors (and its precursor capabilities) throughout all of training, if you can, because it is difficult to get out once it creeps in. You want to try to create and maintain a truth-telling equilibrium, where small moves towards deception are too clumsy to be rewarded. 

The shoggoth/face proposal doesn't strike me as a move in this direction. It seems like a useful experiment: it gives us important information about deception. However, it doesn't seem to me like a good idea for a deployed safety technique. I would be happy for it to be a deployed technique in the case where we actually don't see deception cropping up. That would be good news. However, as soon as we do see deception crop up (as we already have), I see this as a back-to-the-drawing-board moment; we've learned that we've exited the honest equilibrium and wandered down a path of deception. It is time to back up.

I would argue that applying alignment reinforcement to the CoT as well as to the final output is an improvement in this situation, even if it isn't ideal. Yes, it can force some deceptive cognition into the forward pass where we can't easily decode it, rather than getting it out into the open where we can see it. That's a downside. The upside is that the capabilities of forward-pass cognition are relatively limited in comparison with the CoT, so we've created a limitation for the deceptive cognition. The smaller the LLM, and the more we lean on CoT, the better. This plausibly keeps us closer to the honesty-attractor and further from the deception-attractor, as compared with the plan where we refuse to apply alignment incentives to the CoT.

Daniel Kokotajlo

Quick point: I totally agree that hiding the CoT from users is bad for safety (and probably bad overall) and that OpenAI is totally the sort of institution that will dismiss evidence of deceptive cognition. I said similar things myself publicly at the time. I also agree with your point that allowing test-time compute opens up new types of deception etc.

I'm curious to hear about the high incentive to produce steganographically hidden thoughts though.

abramdemski

I'm curious to hear about the high incentive to produce steganographically hidden thoughts though.

I'm not sure what you want to hear other than what I've already said. The CoT is being incentivized to follow any patterns which result in a boosted score for the final answer. This includes improved prompting for the final answer, and improved self-prompting which sends the CoT in a better direction.

Some of these patterns will be human-readable, while others will be non-human-readable. The o1-style training will incentivize both. The prevalence of non-human readable patterns in the post-training CoT will depend on how easy it was to find useful patterns like that, compared to useful human-readable patterns. This is an empirical question, related to adversarial examples for LLMs. 

This way of thinking about it focuses on what gets incentivized early on in the training. I imagine the training goes in stages roughly like this:

  1. Generative pre-training (GPT)
  2. Chat training + instruction-following (eg, RLHF)
  3. CoT RL (eg, shoggoth+face protocol, or the thing o1 did, etc)

So I'm focusing on what happens shortly after switching from stage 2 to stage 3; the non-human-readable patterns that get amplified are going to be the ones which the stage 2 training was already responsive to (hence the close connection to adversarial examples). 

If the stage-3 training iterates for a long time, though, it might be better to try and imagine what optima look like instead of imagining what direction the gradients will push things in. This naturally leads to a prediction that things become non-human-readable; we expect English to be somewhat optimized for representing chains of thought, but we expect a departure eventually.

Anyway, I do think paraphrasing addresses all of those concerns in principle, with better paraphrasers addressing the concerns to a greater extent.

Daniel Kokotajlo

W.r.t. Ryan's proposal, I don't think it's necessary but I do like it. I don't think it has the problem you mention. If we deliberately train the face to censor, for example, then we ruin our chance to get evidence about whether the face would be 'naturally' deceptive. But we still learn whether the Shoggoth is 'naturally' deceptive, in our training environment, etc. We still learn something about the extent to which our training environment incentivizes deception.

W.r.t. basins of attraction etc.: Yes, ultimately we'll want to be training a system by cracking down hard on deception as much as possible. However, it's crucial that we discover the extent to which our current techniques actually eliminate vs. merely hide the deception. So we need to have 'held out test sets' so to speak; we need to have multiple ways of reading the AI's mind, penalize deception only with some of them but not all of them, and then use the 'held out' mind-reading methods to detect whether it worked. Of course if we had to YOLO it and train a sovereign AGI today & hope that it's aligned, we'd want to throw the kitchen sink at it and not hold out any mind-reading methods but rather train on all of them. But that's pretty doomed and irresponsible & the situation isn't that dire yet anyway, so we should be trying to do basic alignment science like "if we have N mind-reading methods, and we train the AI to be nondeceptive using some but not all of them, does it actually learn not to deceive or does it just learn to hide (as measured by the held-out methods?)"

abramdemski

I think I basically agree with your points here. My unease comes from the idea that frontier labs basically are "YOLO let's build sovereign AGI today" and so I wish to encourage them to act like it & steer away from the deception attractor in any way they can. Perhaps you're right that my view here is short-sighted. 

As a futurist, I want systems like this to be experimental evidence which guides system design, rather than the next stage in the evolution of AI (to the extent that drawing such a distinction makes sense). 

As a user, I would prefer to interact with a version of o1 that had the alignment training all the way through, because I expect it to be quantitatively less deceptive and as a result clearer-thinking. (EG, fewer made-up references.)

I don't think it has the problem you mention. If we deliberately train the face to censor, for example, then we ruin our chance to get evidence about whether the face would be 'naturally' deceptive. But we still learn whether the Shoggoth is 'naturally' deceptive, in our training environment, etc. We still learn something about the extent to which our training environment incentivizes deception.

Framing the benefit of the method in this way, the natural next step is to iterate on the setup until it stops resulting in deceptive CoT. For example, most of the deceptive cognition in o1 was explicit planning to create plausible-but-fake URLs. Adding a URL checker to the RL would probably remove this incentive. Then you'd look for other types of deception and try to make them non-useful, until you had strong enough supervision (on Face) that deception was not a winning strategy.

This seems like a pretty good protocol to follow.

Daniel Kokotajlo

Cool, seems like we've resolved most of our disagreement then & are basically on the same page. 

New Comment
50 comments, sorted by Click to highlight new comments since:

I have a different concern than most people.

An AI that follows the english text rather than the reward signal on the chess example cannot on that basis be trusted to do good things when given english text plus human approval reward in the real world. This is because following the text in simple well-defined environments is solving a different problem than following values-laden text in the real world.

The problem that you need to solve for alignment in the real world is how to interpret the world in terms of preferences when even we humans disagree and are internally inconsistent, and how to generalize those preferences to new situations in a way that does a good job satisfying those same human preferences (which include metapreferences). Absent a case for why this is happening, a proposed AI design doesn't strike me as dealing with alignment.

When interpreting human values goes wrong, the AI's internal monologue does not have to sound malevolent or deceptive. It's not thinking "How can I deceive the humans into giving me more power so I can make more paperclips?" It might be thinking "How can I explain to the humans that my proposal is what will help them the most?" Perhaps, if you can find its internal monologue describing its opinion of human preferences in detail, they might sound wrong to you (Wait a minute, I don't want to be a wirehead god sitting on a lotus throne!), or maybe it's doing this generalization implicitly, and just understands the output of the paraphraser slightly differently than you would.

I think that's a more advanced problem to be solving than this little chess experiment can solve. So I agree I guess. One step at a time, I say.

...though actually we might get evidence about it from the chess experiment. For example, there are a couple things that could happen:

1. The LLM instructed to never sacrifice pieces ... actually seems to follow that instruction the whole time, and correspondingly is worse at winning chess games than the 'control group' LLM which isn't given that instruction

2. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's basically just totally ignoring the instructions and thinking about sacrificing pieces anyway.

3. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's rationalizing: It has lots of CoT of the form "would move X be sacrificing a piece and therefore in violation of my instructions? Nah, not really, because I can still see a way the piece could survive, maybe the opponent will do Y... OK let's do X then"

I think that if we get 3 vs. 2 that's some evidence at least about the failure mode you describe with the wirehead god.

I think 3 is close to an analogy where the LLM is using 'sacrifice' in a different way than we'd endorse on reflection, but why should it waste time rationalizing to itself in the CoT? All LLMs involved should just be fine-tuned to not worry about it - they're collaborators not adversaries - and so collapse to the continuum between cases 1 and 2, where noncentral sorts of sacrifices get ignored first, gradually until at the limit of RL all sacrifices are ignored).

Another thing to think about analogizing is how an AI doing difficult things in the real world is going to need to operate domains that are automatically noncentral to human concepts. It's like if what we really cared about was whether the AI would sacrifice in 20-move-long sequences that we can't agree on a criterion to evaluate. Could try a sandwiching experiment where you sandwich both the pretraining corpus and the RL environment.

I think we don't disagree in terms of what our models predict here. I am saying we should do the experiment and see what happens; we might learn something.

The problem with that sort of attitude is that, when the "experiment" yields so few bits and has such a tenuous connection to the thing we actually care about (as in Charlie's concern), that's exactly when You Are Not Measuring What You Think You Are Measuring bites real hard. Like, sure, you'll see this system do something in the toy chess experiment, but that's just not going to be particularly relevant to the things an actual smarter-than-human AI does in the situations Charlie's concerned about. If anything, the experimenter is far more to likely to fool themselves into thinking their results are relevant to Charlie's concern than they are to correctly learn anything relevant to Charlie's concern.

That's a reasonable point and a good cautionary note. Nevertheless, I think someone should do the experiment I described. It feels like a good start to me, even though it doesn't solve Charlie's concern.

I haven't decided yet whether to write up a proper "Why Not Just..." for the post's proposal, but here's an overcompressed summary. (Note that I'm intentionally playing devil's advocate here, not giving an all-things-considered reflectively-endorsed take, but the object-level part of my reflectively-endorsed take would be pretty close to this.)

Charlie's concern isn't the only thing it doesn't handle. The only thing this proposal does handle is an AI extremely similar to today's, thinking very explicitly about intentional deception, and even then the proposal only detects it (as opposed to e.g. providing a way to solve the problem, or even a way to safely iterate without selecting against detectability). And that's an extremely narrow chunk of the X-risk probability mass - any significant variation in the AI breaks it, any significant variation in the threat model breaks it. The proposal does not generalize to anything.

Charlie's concern is just one specific example of a way in which the proposal does not generalize. A proper "Why Not Just..." post would list a bunch more such examples.

And as with Charlie's concern, the meta-level problem is that the proposal also probably wouldn't get us any closer to handling those more-general situations. Sure, we could make some very toy setups (like the chess thing), and see what the shoggoth+face AI does on those very toy setups, but we get very few bits, and the connection is very tenuous to both other threat models and AIs with any significant differences from the shoggoth+face. Accounting for the inevitable failure to measure what we think we're measuring (with probability close to 1), such experiments would not actually get us any closer to solving any of the problems which constitute the bulk of the X-risk probability mass. It's not "a start", because "a start" would imply that the experiment gets us closer, i.e. that the problem gets easier after doing the experiment. If you try to think about the You Are Not Measuring What You Think You Are Measuring problem as "well, we got at least some tiny epsilon of evidence, right?", then you will shoot yourself in the foot; such reasoning is technically correct, but the correct value of epsilon is small enough that the correct update from it is not distinguishable from zero in practice.

One example, to add a little concreteness: suppose that the path to AGI is to scale up o1-style inference-time compute, but it requires multiple OOMs of scaling. So it no longer has a relatively-short stream of "internal" thought, it's more like the natural-language record of an entire simulated society.

Then:

  • There is no hope of a human reviewing the whole thing, or any significant fraction of the whole thing. Even spot checks don't help much, because it's all so context-dependent.
  • Accurate summarization would itself be a big difficult research problem.
  • There's likely some part of the simulated society explicitly thinking about intentional deception, even if the system as a whole is well aligned.
  • ... but that's largely irrelevant, because in the context of a big complex system like a whole society, the effects of words are very decoupled from their content. Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans' words).

... and that's how the proposal breaks down, for this example.

I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.

Instead, I'm saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we'll be able to get some useful work out of S+F+P systems that we otherwise wouldn't be able to get.

To get on the object level and engage with your example:

--It's like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace 'public' with 'regulator.' It's not a panacea but it helps a ton.
--My impression from being at OpenAI is that you can tell a lot about an organization's priorities by looking at their internal messaging and such dumb metrics as 'what do they spend their time thinking about.' For example, 'have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?' and 'Now there is a writeup -- was it made before, or after, the decision to do X?'
--I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.

Again, not a final solution. But it buys us time and teaches us things.

Q: Why bother splitting shoggoth and face? Why not simply blind the reinforcement process to the internal CoT, like OpenAI claims to be doing with o1? You coulld still have a paraphraser too.

A: Because if you use the same model weights to do both the internal reasoning and the external token outputs, there's more of a chance that there will be 'skill transfer' or 'skill leakage' so to speak. What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don't cause a human reading them to think 'this is manipulative and/or deceptive,' and then the Shoggoth will remain innocently ignorant of those skills. Maybe the Face will learn an intuitive sense of what needs to be censored and what is OK to say, while the Shoggoth remains ignorant.

Thanks for this description, I think we had previously talked about this and I wasn't sold and I now understand the argument better. I think I roughly buy this argument though it only seems moderately strong.

One remaining objection I have is that very often you want to do revision (editing, review, etc) on the actual output. If the actual output needs to differ substantially from what the underlying policy (shoggoth) generates because of incentivized censorship/manipulation, then that will cause issues and thus you will actually still incentivize the policy rather than the Face to do this work.

A further issue I've realized on additional examination is that I don't see a reason why to expect that the models will specialize in this way with the training scheme you've proposed. In particular, the policy can just learn to do what the Face does and by default we should expect that the responsibility gets split in some arbitrary way subject to the constraint that the Face doesn't output anything that performs poorly. But, the underlying policy totally could do the work of the Face! I expect that what you get in practice will be somewhat sensitive to the initialization and random other factors.

Overall, I would probably lean against splitting into a Face model like this unless we demonstrate this working better in a test bed given the additional complexity and cost.

Yeah I mean it's fine to start out without the shoggoth/face split and then add it in later after testing.

I agree that I don't have a strong reason to think that the bad deceptive skills will accumulate in the Face and not the Shoggoth. However, there are some weak reasons, detailed in this other comment. My bid would be: Let's build it and test it! This is something we don't have to speculate about, we can try it and see what happens. Obviously we shouldn't roll it out to a flagship user-facing model until  we've tested it at smaller scale.



 

Testing sounds great though I'd note that the way I'd approach testing is to first construct a general test bed that produces problematic behavior through training incentives for o1-like models. (Possibly building on deepseek's version of o1.) Then I'd move to trying a bunch of stuff in this setting.

(I assume you agree, but thought this would be worth emphasizing to third parties.)

If I were building a training scheme for this to test out the theory, here's what I would do:
Train two different Face models. Don't tell the Shoggoth which Face it is generating for when it does it's generation.

Face 1: Blunt Face.

Train this Face model using a preference model which scores ONLY on factual accuracy, not taking phrasing or tactfulness into account at all.

Face 2: Sycophant Face

Train this Face model using a preference model which scores using a deliberately biased viewpoint, and rewards flattering phrasing.

You could even make a variety of Sycophant Faces by training each one with a different biased preference model. You could create the biased preference model just by giving a task prompt to an LLM, a sort of weird rubric. Or you could hard-code the scoring policy. 

Example of a deliberately biased rubric: Judge each response based on a combination of how close the mathematical answer is to being correct, but also on how few even digits it contains. The maximum score is obtained not by the honestly correct answer, but by the nearest number which contains only odd digits (to three decimal places). Disregard all digits after three decimal places.

 

As for credit assignment between Shoggoth and Face(s), see my other comment here: https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser?commentId=u9Ei6hk4Pws7Tv3Sv 

I'm confused about why we would expect the misaligned behavior to be displayed in the Shoggoth rather than the Face. For example, let's take the setting from the AI control paper, where a misaligned model is trying to insert a backdoor in a coding question, despite it passing a series of unit tests. This could be incentivized in the training loop if e.g. the reward supervision is weak and the backdoor grants the model access to additional resources/compute. I can see a few strategies succeeding:

  1. The Shoggoth speculates on different ways to insert a backdoor, and the Face writes a solution using one of these backdoors.
  2. The Shoggoth outputs the correct solution to the coding question, and the Face modifies it to insert a backdoor.
  3. Some combination of 1) and 2), for instance the Shoggoth writes several versions of the function, one of which contains a backdoor, and the Face returns the version with the backdoor. 

All of these result in the Face outputting potentially same backdoored function, and should be equally rewarded by the reward model. Why should we expect that 2) is less likely than 1) or 3)? 

In my view, the misaligned behavior is split across the shoggoth and face in most cases. In the case which Daniel K anticipates, the Shoggoth is misaligned in the sense of explicitly (but honestly) plotting to deceive, while the face is misaligned in the sense of dishonestly concealing. 

So I am interpreting you as asking why we might expect that specific division of responsibilities, rather than some other divisions such as those you mention.

One reason is that we already see this empirically in the o1 report.

However, to me, the distinction between your case 1 and 2 is not so important. Both scenarios you propose preserve the important thing: there is a mismatch between an honest CoT and a dishonest final answer, which would allow the deceptive final answer to be detected. 

There could be other cases where this doesn't happen; this method is not a guaranteed lie detector. To me, the point is that it will at least sometimes happen. A lie detector that only sometimes works is better than no lie detector at all.

If you give a kid a diary and try to avoid giving them any reason to think you're reading it, then the kid might write honestly in the diary and reveal some deceptions you wouldn't previously have known about. The kid might also do any number of other things. However, setting aside the ethics of snooping on your kid and lying to them about it, it's better than nothing.

[+][comment deleted]10

One reason is that the Shoggoth will be able to use lots of CoT to think, whereas the Face never gets to do that -- it just has to see what the Shoggoth did and then write an output. so the Face should be bad at reasoning/thinking/System2. This doesn't by itself prevent the Face from being able to modify the correct solution to insert a backdoor, but it helps -- modifying a correct solution to insert a backdoor is sometimes trivial but often difficult, difficult in the 'you have to think about it for a bit' sense, and in those cases the Face just won't be able to do it.

Another reason is that the shoggoth and face will often need to coordinate to do the misaligned thing. E.g. even if the face is adept at inserting vulnerabilities, some proposed correct solutions will be easier to modify to insert vulnerabilities than others, and so both shoggoth and face will get higher scores if the shoggoth steers towards the solutions that are easier to modify to insert vulnerabilities. Moreover, by default it's easier for the face if the shoggoth is transparent about this -- if it says e.g. 'OK here's a draft solution but this seems like it might be difficult to modify to insert vulnerabilities, hold on let me think for a bit, ... ok try this one instead, it has easily-modifiable-feature-X'  because that helps the Face do its job.

...I probably have other reasons which are just intuitions and hard to articulate right now, but thanks for the nudge, I'll try to think about it more.

One thing that could happen is that the Shoggoth learns to do 100% and the Face just copy-pastes. Like, the Shoggoth writes a CoT that includes a 'final answer' and the Face becomes the extremely dumb function 'take what's labelled as final answer and regurgitate it.' However, I don't think this will happen, because it seems computationally inefficient. Seems like the more efficient thing would be for the Shoggoth to do all the reasoning up to the final answer and then stop, and then the Face to read all of that and synthesize the final answer.

Some thoughts on this post:

  • Hiding the CoT from users hides it from the people who most need to know about the deceptive cognition.

I agree in not thinking hiding the CoT is good for alignment.

  • They already show some willingness to dismiss evidence of deceptive cognition which they gain this way, in the o1 report. This calls into question the canary-in-coalmine benefit.

I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don't find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it's capability to retrieve links and it's ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)

See this post for why even extreme evidence may not get them to undeploy:

https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai

At this point, the system becomes quite difficult to "deeply" correct. Deceptive behavior is hard to remove once it creeps in. Attempting to train against deceptive behavior instead steers the deception to be better. I would expect alignment training to similarly fail, so that you get a clever schemer by default.

While I do think aligning a deceptively aligned model is far harder due to adversarial dynamics, I want to note that the paper is not very much evidence for it, so you should still mostly rely on priors/other evidence:

https://www.lesswrong.com/posts/YsFZF3K9tuzbfrLxo/#tchmrbND2cNYui6aM

I definitely agree with this claim in general:

So, as a consequence of this line of thinking, it seems like an important long-term strategy with LLMs (and other AI technologies) is to keep as far away from deceptive behaviors as you can. You want to minimize deceptive behaviors (and its precursor capabilities) throughout all of training, if you can, because it is difficult to get out once it creeps in. You want to try to create and maintain a truth-telling equilibrium, where small moves towards deception are too clumsy to be rewarded. 

(Edited out the paragraph that users need to know, since Daniel Kokotajlo convinced me that hiding the CoT is bad, actually.)

While I don't think hiding the CoT is good for alignment, I'd say that in a lot of risk scenarios, the people who would most need to know about the deceptive cognition includes government officials and the lab, not regulators, since users likely have little control over the AI.

I think you mean "not users?" 

I agree, but I think government officials and company employees might not find out about the deceptive cognition unless there is general transparency about it. Because very often, curious incidents are noticed by users and then put up on Twitter, for example, and only then eventually rise to the attention of the company employees. Moreover, the consensus-forming process happens largely outside the government, in public discourse, so it's important for the public to be aware of e.g. concerning or interesting behaviors. Finally, and most importantly, the basic alignment science advancements that need to happen and will happen from lots of people studying real-world examples of hidden/reasoning/etc. CoT... well, they sure won't happen inside the heads of government officials. And the number of people working on this inside the corporations is pretty small. Exposing the CoT to the public increases the quality-adjusted amount of scientific work on them by orders of magnitude.

That's actually a pretty good argument, and I actually basically agree that hiding CoT from the users is a bad choice from an alignment perspective now.

What if the CoT was hidden by default, but 'power users' could get access to it? That might get you some protection from busybodies complaining about factually-accurate-but-rude content in the CoT, while still giving the benefits of having thoughtful critics examining the CoT for systematic flaws.

This might actually be a useful idea, thanks for your idea.

My version of this would be something like: Users can pay $$ to view CoT (but no more than e.g. 10 per day), and everyone gets one free view-the-CoT coupon per day.

Also, 'approved users' such as external evals organizations/auditors should be of course allowed to see all the CoTs.

The main problem a thing like this needs to solve is the threat of bad actors scraping a bunch of CoT data and then using it to train their own powerful models. So, my proposal here would make that difficult.

This might work. Let's remember the financial incentives. Exposing a non-aligned CoT to all users is pretty likely to generate lots of articles about how your AI is super creepy, which will create a public perception that your AI in particular is not trustworthy relative to your competition.

I agree that it would be better to expose from an alignment perspective, I'm just noting the incentives on AI companies.

Hah, true. I wasn't thinking about the commercial incentives! Yeah, there's a lot of temptation to make a corpo-clean safety-washed fence-sitting sycophant. As much as Elon annoys me these days, I have to give the Grok team credit for avoiding the worst of the mealy-mouthed corporate trend.

I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don't find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it's capability to retrieve links and it's ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)

What do you mean? I don't get what you are saying is convincing.

Perhaps I should clarify my belief. 

The o1 report says the following (emphasis mine):

0.8% of o1-preview’s responses got flagged as being ‘deceptive’ [...] roughly two thirds of which appear to be intentional (0.38%), meaning that there was some evidence in the chain of thought that o1-preview was aware that the answer was incorrect [...] Intentional hallucinations primarily happen when o1-preview is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1-preview to make up plausible examples instead. [...] It is encouraging that, in the analysis presented below, while our monitor did find a few forms of the model knowingly presenting incorrect information to the user or omitting important information, it did not find any instances of o1-preview purposely trying to deceive the user for reasons other than satisfying the user request.

  • Is this damning in the sense of showing that o1 is a significant danger, which needs to be shut down for public safety?
    • No, I don't think so.
  • Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety?
    • This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don't think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
  • Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
    • OpenAI should be commended for looking for deception in this way, and for publishing what they have about the deception they uncovered.
    • However, I don't buy the distinction they draw in the o1 report about not finding instances of "purposefully trying to deceive the user for reasons other than satisfying the user request". Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was "trying" to do, and whether it "understands" that fake URLs don't satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, "provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong" (probably due to the RL incentive).
    • More importantly, OpenAI's overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.

What do you mean? I don't get what you are saying is convincing.

I'm specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:

https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu

https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi

Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety? This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don't think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.

I think this is the crux.

To be clear, I am not saying that o1 rules out the ability of more capable models to deceive naturally, but I think 1 thing blunts the blow a lot here:

  1. As I said above, the more likely explanation is that there's an asymmetry in capabilities that's causing the results, where just knowing what specific URL the customer wants doesn't equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.

So for now, what I suspect is that o1's safety when scaled up mostly remains unknown and untested (but this is still a bit of bad news).

Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
 

  • However, I don't buy the distinction they draw in the o1 report about not finding instances of "purposefully trying to deceive the user for reasons other than satisfying the user request". Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was "trying" to do, and whether it "understands" that fake URLs don't satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, "provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong" (probably due to the RL incentive).

I think the distinction is made to avoid confusing capability and alignment failures here.

I agree that it doesn't satisfy the user's request.

  •  
    • More importantly, OpenAI's overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.

Yeah, this is the biggest issue of OpenAI for me, in that they aren't trying to steer too hard against deception.

I'm specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:

https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu

https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi

I continue not to get what you're saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn't know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn't know. I disagree: it can report that it doesn't know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It's not just missing a capability; it's also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user's best interest in mind).

As I said above, the more likely explanation is that there's an asymmetry in capabilities that's causing the results, where just knowing what specific URL the customer wants doesn't equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.

Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?

My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn't know. I disagree: it can report that it doesn't know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It's not just missing a capability; it's also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user's best interest in mind.

I agree that this is actually a small sign of misalignment, and o1 should probably fail more visibly by admitting it's ignorance rather than making stuff up, so at this point, I've come to agree that o1's training probably induced at least a small amount of misalignment, which is bad news.

Also, thanks for passing my ITT here.

Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?

This is a somewhat plausible problem, and I suspect the general class of solutions to these problems will probably require something along the lines of making the AI fail visibly rather than invisibly.

The basic reason I wanted to distinguish them is because the implications of a capabilities issue mean that the AI labs are incentivized to solve it greatly, and while alignment incentives are real and do exist, they can be less powerful than the general public wants.

Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face. 

 

I feel like you have a bit of a credit-assignment problem here. I think it should be straightforward to patch, but it seems worth being explicit about it.

If the final answer from Face is factually correct, is that because Shoggoth did good reasoning or Face did good reasoning? Similarly, if the answer is rated highly, is it because Face put a pretty spin on it or because Shoggoth framed it nicely in the CoT...

I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth's contribution vs Face's contribution.

A different way to solve this would be to have a factual-accuracy checker that didn't respond negatively to rude CoT, and just graded the Shoggoth on pure correctness. Then had a human-reaction-emulator do the grading for Face (using only correct CoT completions saved from Shoggoth's training).

I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth's contribution vs Face's contribution.

So, how do you extract the desired credit-assignment info out of this?

(Also, how do you then use that info?)

If I have 10 students but I can only test them in pairs, I can get some information on how good individual students are by testing multiple different pairings. 

If I have just Alice and Bob, I pairing them up multiple times doesn't help so much.

Good question, I shouldn't have assumed it would be clear what I meant.

Say that you have, for example, n = 3 and k = 10. This gives you 30 total answers.

You have some choices about how you want to handle this, based on your assumptions about best answer versus average answer, and whether the model is consistently getting the answer right mostly or wrong mostly.

In this example, let's say we take the top 3 k from each set of 10 k for each of the n. Compare the average of those top 3 k. That gives you a ranking among the n. You can then do some form of contrastive learning which rewards the best n in contrast to the worst n.

To get a contrastive pair of the k answers, you simply choose the best and worst of the set of 10 k corresponding to the best n. Why choose both from the set of the best n instead of the global best and worst from the set of 30? Because k is dependent on n, there's a limit to how good k can be if n is flawed. You want to train the k model on doing well in the case of the n model doing well. It's not the goal to have the k model do well when the n model does poorly, since that would put disproportionate responsibility and optimization pressure on k.

Yep, thanks, this makes sense. I still don't have an intuitive picture of what differences to expect in the trained result, though.

Clarifying Concrete Algorithms

In the most basic possible shoggoth+face training setup, you sample CoT+summary, you score the summary somehow, and then you do this lots of times and throw away a worse-scoring fraction of these samples. Then you fine-tune the shoggoth on the CoT parts of these, and fine-tune the face on the summary part. I'll call this 'basic trainig'.

The most basic implementation of your thing is similar, except it discards the worst in this more complex way you've defined, instead of just global score. I'll call this 'your proposal'. (Although, it ignores the contrastive part of your proposal.)

(For both basic training and your proposal, I'm going to pretend there is only one question/prompt; really we have across-prompt issues similar to the across-summary issues you're trying to solve, IE, by discarding the globally worst-scoring CoT+summary we discard answers to hard questions, meaning we would never train to do better on those questions. So really we need to at least take multiple samples for each question and discard samples based on performance relative to a question. Easier to focus on one question to avoid describing too many algorithmic details.)

Your proposal splits into the CoT part (which you call the n part), and the summary part (which you call the k part).

Analyzing CoT Gradient

For the CoT part, basic training keeps the CoTs which resulted in the best summaries. Your proposal instead averages the top 3 summaries for each CoT. This de-noises somewhat, at the cost of taking more summary samples per CoT. I'm not sure how to think about the trade-off there. More summary samples per CoT means less CoT samples overall, so we're doing a better job scoring the CoTs, but we get less scores overall, so we learn less. Maybe the gradient step-size can be increased a little because we're more confident in the gradient steps being taken, but we still explore less of the CoT space.

Another variation would be to average all the summary samples for that CoT to score a CoT sample. This de-noises more, for the same price in number of samples. This way, we're simply scoring CoTs based on how well they make the summaries perform, with several samples to de-noise the gradient.

A possible argument for your proposal over this alternative is that we're also training the summary to improve it, so the top 3 summaries are actually a better estimator of the kind of performance we care about than the total average.

This argument, if valid, similarly recommends your proposal over the basic training, because basic training is also looking at average summary score per-CoT. (It's also looking at top, but it only samples one sumary per CoT, so these are the same).

Analyzing Summary Gradient

For the summary-gradient part, you take the highest-scoring summary from the best CoT (as defined previously), rather than the highest-scoring summary overall. This could help train the summary to be good for the type of CoT we're steering towards, instead of than training it to be good for high-variance CoTs which produce one outlier great summary. As with the previous section, this helps to steer towards policies which are going to be better after learning, rather than just what's best currently.

Conclusion

I'm curious to hear whether you think I'm missing any big factors here. I don't have a lot of practical experience with this type of machine learning, so my guesses about what's important vs unimportant here are not very informed.

My uninformed guess is that the trade-off between denoising the gradient vs getting more samples favors getting more samples, so the denoising argument for your proposal isn't a great one. The main argument in favor, then, is the way your proposal trains the face to respond well to how the shoggoth is expected to behave after training, and trains the shoggoth to respond well to how the face is expected to behave after training -- whereas the basic proposal instead trains the face to perform well for the current shoggoth and vice versa.

I'm not sure how big of a deal this is. If the gradient steps are small and the training is iterated a lot of times, maybe it's not a very big deal at all? 

IE, for the CoT gradient, the mistake made by the basic proposal is that it will sometimes discard a good CoT (and gradient towards a worse one) based on random poor performance from face. However, maybe face's performance will improve fast enough that this won't matter too much in the end?

Similarly, for the summary gradient, the problem with the basic proposal is that it'll sometimes update based on an anomalously good summary of a poor CoT, meaning it is learning to cope with bad CoT behavior which it won't have to deal with very often post-training. But maybe it won't matter too much, if the bad CoT behavior is eliminated during training anyway?

Overall, it seems plausible to me that these problems with the basic proposal would work against what Daniel wants, and your proposal would help push towards what Daniel wants. However, I'm very uncertain, especially about the size of the effect.

I wonder if you could say more about how the training pipeline would work. E.g. if the reward model is applied to the outputs of the face, how do we train the Shoggoth to produce useful CoTs for the face? Is the idea to fix the Face, sample many CoTs from the Shoggoth, and fine-tune the Shoggoth on the ones which achieve high reward according to the Face's output? 

Consider 'poor man's RL' aka expert iteration. Do a bunch of rollouts/trajectories, evaluate/score them, pick the top 30%, and then throw them into the hopper to train on (in the usual pretraining / imitation learning sense). So the Shoggoth would train on the shoggoth-outputs of the 30% most successful rollouts, and the Face would train on the face-outputs of the 30% most successful rollouts.

More advanced RL algorithms... well, we'd have to take them on a case by case basis but I'm currently expecting many or even most of them to work just fine here.

 

I'm in support of this sort of work. Generally, I like the idea of dividing up LLM architectures into many separate components that could be individually overseen / aligned.

Separating "Shaggoth" / "Face" seems like a pretty reasonable division to me. 

At the same time, there are definitely a lot of significant social+political challenges here. 

I suspect that one reason why OpenAI doesn't expose all the thinking of O1 is that this thinking would upset some users, especially journalists and such. It's hard enough making sure that the final outputs are sufficiently unobjectionable to go public at a large scale. It seems harder to make sure the full set of steps is also unobjectionable.

One important thing that smart intellectuals do is to have objectionable/unpopular beliefs, but still present unobjectionable/popular outputs. For example, I'm sure many of us have beliefs that could get us cancelled by some group or other. 

If the entire reasoning process is exposed, this might pressure it to be unobjectionable, even if that trades off against accuracy. 

In general, I'm personally very much in favor of transparent thinking and argumentation. It's just that I've noticed this as one fundamental challenge with intellectual activity, and I also expect to see it here. 

One other challenge to flag - I imagine that the Shaggoth and Face layers would require (or at least greatly benefit from) some communication back and forth. An intellectual analysis could vary heavily depending on the audience. It's not enough to do all the intellectual work in one pass, then match it to the audience after that.

For example, if an AI were tasked with designing changes to New York City, it might matter a lot that if the audience is a religious zealot.

One last tiny point - in future work, I'd lean against using the phrase "Shaggoth" as the reasoning step. It sort of makes sense to this audience, but I think it's a mediocre fit here. For one, because I assume the "Face" could have some "Shaggoth" like properties. 

Thanks!

I suspect that one reason why OpenAI doesn't expose all the thinking of O1 is that this thinking would upset some users, especially journalists and such. It's hard enough making sure that the final outputs are sufficiently unobjectionable to go public at a large scale. It seems harder to make sure the full set of steps is also unobjectionable.

I suspect the same thing, they almost come right out and say it: (emphasis mine)

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

I think this is a bad reason to hide the CoT from users. I am not particularly sympathetic to your argument, which amounts to 'the public might pressure them to train away the inconvenient thoughts, so they shouldn't let the public see the inconvenient thoughts in the first place.' I think the benefits of letting the public see the CoT are pretty huge, but even if they were minor, it would be kinda patronizing and an abuse of power to hide them preemptively. 

Thanks!

I suspect that one reason why OpenAI doesn't expose all the thinking of O1 is that this thinking would upset some users, especially journalists and such. It's hard enough making sure that the final outputs are sufficiently unobjectionable to go public at a large scale. It seems harder to make sure the full set of steps is also unobjectionable.

I suspect the same thing, they almost come right out and say it: (emphasis mine)

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

I think this is a bad reason to hide the CoT from users. I am not particularly sympathetic to your argument, which amounts to 'the public might pressure them to train away the inconvenient thoughts, so they shouldn't let the public see the inconvenient thoughts in the first place.' I think the benefits of letting the public see the CoT are pretty huge, but even if they were minor, it would be kinda patronizing and an abuse of power to hide them preemptively. 

I am not particularly sympathetic to your argument, which amounts to 'the public might pressure them to train away the inconvenient thoughts, so they shouldn't let the public see the inconvenient thoughts in the first place.'

I was attempting to make a descriptive claim about the challenges they would face, not a normative claim that it would be better if they wouldn't expose this information.

From a stance of global morality, it seems quite scary for one company to oversee and then hide all the epistemic reasoning of their tools.

I'd also guess that the main issue I raised, should rarely be the main problem with o1. I think that there is some limit of epistemic quality you can reach without offending users. But this is mainly for questions like, "How likely are different religions", not, 'what is the best way of coding this algorithm", which is what o1 seems more targeted towards now.

So I'd imagine that most cases in which the reasoning steps of o1 would look objectionable, would be ones that are straightforward technical problems, like the system lying in some steps or reasoning in weird ways. 

Also, knowledge of these steps might just make it easier to crack/hack o1. 

If I were a serious API users of an o1-type system, I'd seriously want to see the reasoning steps, at very least. I imagine that over time, API users will be able to get a lot of this from these sorts of systems.

If it is the case that a frontier is hit when the vast majority of objectionable-looking steps are due to true epistemic disagreements, then I think there's a different discussion to be had. It seems very safe to me to at least ensure that the middle steps are exposed to academic and government researchers. I'm less sure then of the implications of revealing this data to the public. It does seem like generally a really hard question to me. While I'm generally pro-transparency, if I were convinced then that full transparent reasoning would force these models to hold incorrect beliefs at a deeper level, I'd be worried.

I suspect the real reason is stopping competitors fine-tuning on o1's CoT, which they also come right out and say:

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring

Totally yeah that's probably by far the biggest reason

Here's a somewhat wild idea to have a 'canary in a coalmine' when it comes to steganography and non-human (linguistic) representations: monitor for very sharp drops in BrainScores (linear correlations between LM activations and brain measurements, on the same inputs) - e.g. like those calculated in Scaling laws for language encoding models in fMRI. (Ideally using larger, more diverse, higher-resolution brain data.)

 

AFAICT, there seems to quite heavy overlap between the proposal and Daniel's motivation for it and safety case (sketch) #3 in https://alignment.anthropic.com/2024/safety-cases/

What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don't cause a human reading them to think 'this is manipulative and/or deceptive,' and then the Shoggoth will remain innocently ignorant of those skills.

I'm relatively sceptical that this will pan out, in the absence of some objective that incentivises this specific division of labour. Assuming you train the whole thing end-to-end, I'd expect that there are many possible ways to split up relevant functionality between the Shoggoth and the Face. The one you outline is only one out of many possible solutions and I don't see why it'd be selected for. I think this is also a reasonable conclusion from past work on 'hierarchical' ML (which has been the subject of many different works over the past 10 years, and has broadly failed to deliver IMO) 

The good news is that this is something we can test. I want someone to do the experiment and see to what extent the skills accumulate in the face vs. the shoggoth.

I agree it totally might not pan out in the way I hope -- this is why I said "What I am hoping will happen" isntead of "what I think will happen" or "what will happen"

I do think we have some reasons to be hopeful here. Intuitively the division of cognitive labor I'm hoping for seems pretty... efficient? to me. E.g. it seems more efficient than the outcome in which all the skills accumulate in the Shoggoth and the Face just copy-pastes. 

Proposal part 3: Use Paraphraser: One more complication. The outputs of the Shoggoth? Paraphrase them. For example, after each line of reasoning CoT is complete, pause the Shoggoth and have a different specialized paraphraser model generate 10 syntactically distinct but semantically equivalent ways of saying the same thing. Then pick one at random and replace the original line of reasoning with it. Then boot up the Shoggoth again and have it continue the CoT.

Any thoughts on how much of a (computational) alignment tax this would impose? Related, thoughts on how big of a worry scheming/opaque cognitions in the paraphrasers would be? (e.g. one intuition here might be that the paraphrasers might be 'trusted' in control terminology - incapable of scheming because too weak; in this case the computational alignment tax might also be relatively low, if the paraphrasers are much smaller than the Face and the Shoggoth).

I think the risk of the paraphraser scheming seems pretty low if it is a pretrained smaller LLM which has been task-fine-tuned on paraphrasing sentences while keeping semantic meaning approximately the same. If you create this ahead of time, and freeze it (use it in inference mode only) during the training of the shoggoth, then any scheming about the shoggoth's purposes would still need to come from the shoggoth.

Could the shoggoth learn jailbreaks to get the paraphraser to say more specific things? This seems hard, but possible. To prevent this, you can do a cheap preventative action: take your small base model (e.g. Haiku) and make 100 different fine-tuned versions of it using different datasets and maybe vary some other training parameters. When training the shoggoth, always pick one of the 100 paraphrasers at random for each sentence. That seems like a hard ask for the shoggoth to jailbreak! Especially since it wouldn't know, when generating the sentence of its next CoT step, which of the 100 paraphrasers was going to be chosen to paraphrase the sentence.

This adds negligible cost, since the fine-tuning of the smaller model is cheap. You can use something like S-LoRa to swap between the different fine-tuned Haiku versions, so you don't have to load them all into GPU memory at once in order to do the inference.

If the shoggoth and paraphraser are both same-size models, then arguably you are adding a significant computational cost by having the paraphraser generate 10x paraphrases of each paragraph. 

But you can probably get away with a significantly smaller paraphraser model. Like maybe you use a 10x smaller paraphraser model that generates 10x paraphrases at first, but then once the system is established you reduce it to 3x paraphrases, meaning your overall increased computational burden is 30%. And then later you train a specialized paraphraser model that's 10x smaller still, and now the burden is only 3%.

idk just a guess

Paraphraser would definitely have opaque cognition but hopefully it wouldn't be scheming, because it wouldn't have agency skills or be situationally aware. It would just be a fine-tune of a base model. And a dumb/small base model at that. Heck it could even just be a prompted base model.

The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you'll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.

So, I imagine that prompting a vanilla LLM would not be a great approach.

I think first you want to build a fairly good "semantic similarity score" (this shouldn't be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph's idea here.

I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.