Alex Lawsen

Previously alexrjl.

I give careers advice for 80,000 hours, but views I express here are my own.

Wiki Contributions


I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I'm skeptical about preventing this.

Clarifying something based on a helpful DM: 

  • I know Steganography implies something like deliberate intent. 
  • I don't think there's currently deliberate intent. 
  • I also don't think deliberate intent matters, but I know this is controversial and haven't yet articulated why.
  • Fleshing out why I want to push hard against 'deliberateness' mattering is going to take more time than I expect to have this week, and it's possible writing anything about this before I had time to do it was a mistake.
  • I think it's pretty non-obvious, and would put more than 20% on me concluding that I'm totally wrong about this specific thing in the process of trying to write out my intuition.

Copied from Twitter after a thought I had on the way to work. Probably better to comment here though:

Steganography is already happening in LLMs to an extent that makes chain of thought utterly hopeless unless we get insanely lucky or directly intervene. I've been thinking about this in the abstract for a while but somehow I didn't see what was right in front of me until now.

I realise this is way too spicy to just tweet out without more explanation, but I have a busy day and don't want this to drop out of my head.

I'll try to write something proper up soon, but the sketch of what happened is:

  • It's been clear to me for ages that natural language has more than enough degrees of freedom that it was possible in principle for additional information to be included.
  • And that this could be done in such a way that it wouldn't look obvious to readers.
  • And that if we applied selection pressure to natural language reasoning by default we get more and more of the important/dangerous information being hidden.
  • Then today I was thinking about LLM+scratchpad as my MVP for 'scale is all you need', and I was wondering about how small a scratchpad you'd need, where one possibility was you don't need one at all you just remove the first words in the context window as you run out of space.
  • And it suddenly hit me in the face that there was no point in arguing that additional information being encoded was possible in theory (though that should have been enough given the stakes), it just clearly was already happening in practice.
  • The whole discipline of prompt engineering is a demonstration that most of the instructions we're giving to an LLM are not human readable. If they were, any reasonably similar prompt would have a reasonably good chance of producing good output.
  • Which is obviously miles away from being the case. I haven't thought about how to formalise it, but I suspect that almost all of the instructions we're actually giving in prompts are 'under the surface', rather than being 'natural language interpretable'.
  • I'm embarrassed by how obvious this realisation is in hindsight, but whatever, it's there now. Now to work out what to do about it.

(I don't think I'd have been thinking about this much without inspiration from janus, would be keen to here thoughts from them)

Yeah this is the impression I have of their views too, but I think there are good reasons to discuss what this kind of theoretical framework says about RL anyway, even if you're very against pushing the RL SoTA.

I'm confused about how valuable Language models are multiverse generators is as a framing. On the one hand, I find thinking in this way very natural, and did end up having what I thought were useful ideas to pursue further as I was reading it. I also think loom is really interesting, and it's clearly a product of the same thought process (and mind(s?)).

On the other hand, I worry that the framing is so compelling mostly just because of our ability to read into text. Lots of things have high branching factor, and I think there's a very real sense in which we could replace the post with Stockfish is a multiverse generator,  Alphazero is a multiverse generator, or Piosolver is a multiverse generator, and the post would look basically the same, except it would seem much less beautiful/insightful, and instead just provoke a response of 'yes, when you can choose a bunch of options at each step in some multistep process, the goodness of different options is labelled with some real, and you can softmax those reals to turn them into probabilities, your process looks like a massive tree getting split into finer and finer structure.'

There's a slight subtlety here in that in the chess and go cases, the structure won't strictly be a tree because some positions can repeat, and in the poker case the number of times the tree can branch is limited (unless you consider multiple hands, but in that case you also have possible loops because of split pots). I don't know how much this changes things.

Thanks for writing this up! I've found this frame to be a really useful way of thinking about GPT-like models since first discussing it.

In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the 'other methods' section of 'Novel methods of process/agent specification'). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI's current plans. My understanding is that Conjecture is overall very negative on RLHF, but that makes it seem more useful to discuss how to model the results of the approach, not less, to the extent that you expect this framing to help shed light what might go wrong.

It feels like there are a few different ways you could sketch out how you might expect this kind of training to go. Quick, clearly non-exhaustive thoughts below: 

  • Something that seems relatively benign/unexciting - fine tuning increases the likelihood that particular simulacra are instantiated for a variety of different prompts, but doesn't really change which simulacra are accessible to the simulator. 
  • More worrying things - particular simulacra becoming more capable/agentic, simulacra unifying/trading, the simulator framing breaking down in some way.
  • Things which could go either way and seem very high stakes - the first example that comes to mind is fine-tuning causing an explicit representation of the reward signal to appear in the simulator, meaning that both corrigibly aligned and deceptively aligned simulacra are possible, and working out how to instantiate only the former becomes kind of the whole game.

Idea I want to flesh out into a full post:

Changing board-game rules as a test environment for unscoped consequentialism.

  •  The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
    • If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
    • The moment at which I become aware of these other options seems critical.
  • It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
    • My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
    • I think this provides a pretty great analogy to expanding the scope of your action set.
    • There's also some relevance to training/deployment mismatches.
    • If you're teaching a language model to play the game, the specific 'changing the rules' actions could be included in the 'instruction set' for the game. 
  • This might provide insight/the opportunity to experiment on (to flesh out in depth):
    • Myopia
    • Deception (if we select away from agents who make these illegal moves)
    • useful bounds on consequentialism
    • More specific things like, in the language models example above, whether saying 'don't do these things, they're not allowed', works better or worse than not mentioning them at all.


AI safety level: don't typically struggle to follow technical conversations with full time researchers, though am not a full time researcher.

Bio: last studied it 14 years ago. Vaguely aware miosis and mitosis are different but couldn't define either without Google.

Founders Pledge's research is the best in the game here. If you want to make a recommendation that's for a specific charity rather than a fund, Clean Air Task Force seemed sensible every time I spoke to them, and have been around for a while.

AXRP - Excellent interviews with a variety of researchers. Daniel's substantial own knowledge means that the questions he asks are often excellent, and the technical depth is far better than anything else that's available in audio, given the difficulty of autoreaders on papers or the alignment forum finding it difficult to handle actual maths.

Load More