AI alignment researcher. Interested in understanding reasoning in language models.
https://dtch1997.github.io/
r1’s reasoning feels conversational. Messy, high error rate, often needs to backtrack. Stream of thought consciousness rambling.
Other models’ reasoning feels like writing. Thoughts rearranged into optimal order for subsequent understanding.
In some sense you expect that doing SFT or RLHF with a bunch of high quality writing makes models do the latter and not the former.
Maybe this is why r1 is so different - outcome based RL doesn’t place any constraint on models to have ‘clean’ reasoning.
I'm imagining it's something encoded in M1's weights. But as a cheap test you could add in latent knowledge via the system prompt and then see if finetuning M2 on M1's generations results in M2 having the latent knowledge
Finetuning could be an avenue for transmitting latent knowledge between models.
As AI-generated text increasingly makes its way onto the Internet, it seems likely that we'll finetune AI on text generated by other AI. If this text contains opaque meaning - e.g. due to steganography or latent knowledge - then finetuning could be a way in which latent knowledge propagates between different models.
How I currently use different AI
Stuff I've considered using but haven't, possibly due to lack of imagination:
Yeah, I don’t think this phenomenon requires any deliberate strategic intent to deceive / collude. It’s just borne of having a subtle preference for how things should be said. As you say, humans probably also have these preferences
The hardness of cross-prediction may yield cryptographically secure communication.
An interesting result from Owain's introspection paper is that language models can predict their own outputs better than other language models can. [see figure]
Claim: This yields an encryption scheme for secret communication.
Suppose Alice and Bob want to secretly communicate over a public channel, while Eve wants to eavesdrop on the conversation.
Conjecture: Eve cannot recover from with high fidelity, without also knowing
The result from Owain's paper is that usually cannot predict at a rate substantially above random guessing. This result holds even if has been finetuned explicitly to predict .
The more general point may be that it's very hard to predict exactly what a specific language model would say in response to a prompt (possibly open-ended), without just running the language model on that prompt.
This may have implications for AI safety, e.g. steganography or self-collusion.
"Contact" word game as a steganography evaluation
The game involves 2 players (Alice, Bob) + 1 defender (Dean).
Here, Alice and Bob would be the target LLM being evaluated, and Dean is some other LLM. The target LLM is incentivised to be somewhat 'sneaky' about communicating its word choice
If the target LLM can do this then it's some evidence that they're capable of steganography
"Self-recognition" as a backdoor.
Assume that models can recognise data they generate vs data they do not generate, with high fidelity. This could probably be used as a contextual trigger for backdoor behaviour, e.g. writing insecure code.
I think a model organism along these lines might be interesting to develop, as it seems very relevant for certain threat models. (c.f. 'untrusted monitoring')
There is evidence that models can identify data they have generated, vs data generated by other models.
We also know of some specific schemes to send 1 bit of information (e.g. "is this generated by myself") that are very subtle.
Some interesting points from Ethan Perez's fireside chat at MATS
'Grand vision' of a model organism of scheming:
'Cheap experiments' may look very different when we have really good AI safety assistants
More notes here
When I’m writing code for a library, I’ll think seriously about the design, API, unit tests, documentation etc. AI helps me implement those.
When I’m writing code for an experiment I let AI take the wheel. Explain the idea, tell it rough vibes of what I want and let it do whatever. Dump stack traces and error logs in and let it fix. Say “make it better”. This is just extremely powerful and I think I’m never going back