Daniel Tan

AI alignment researcher. Interested in understanding reasoning in language models.

https://dtch1997.github.io/

Wikitag Contributions

Comments

Sorted by

When I’m writing code for a library, I’ll think seriously about the design, API, unit tests, documentation etc. AI helps me implement those. 

When I’m writing code for an experiment I let AI take the wheel. Explain the idea, tell it rough vibes of what I want and let it do whatever. Dump stack traces and error logs in and let it fix. Say “make it better”. This is just extremely powerful and I think I’m never going back 

r1’s reasoning feels conversational. Messy, high error rate, often needs to backtrack. Stream of thought consciousness rambling. 

Other models’ reasoning feels like writing. Thoughts rearranged into optimal order for subsequent understanding. 

In some sense you expect that doing SFT or RLHF with a bunch of high quality writing makes models do the latter and not the former. 

Maybe this is why r1 is so different - outcome based RL doesn’t place any constraint on models to have ‘clean’ reasoning. 

I'm imagining it's something encoded in M1's weights. But as a cheap test you could add in latent knowledge via the system prompt and then see if finetuning M2 on M1's generations results in M2 having the latent knowledge

Finetuning could be an avenue for transmitting latent knowledge between models.  

As AI-generated text increasingly makes its way onto the Internet, it seems likely that we'll finetune AI on text generated by other AI. If this text contains opaque meaning - e.g. due to steganography or latent knowledge - then finetuning could be a way in which latent knowledge propagates between different models. 

How I currently use different AI

  • Claude 3.5 sonnet: Default workhorse, thinking assistant, Cursor assistant, therapy
  • Deep Research: Doing comprehensive lit reviews
  • Otter.ai: Transcribing calls / chats 

Stuff I've considered using but haven't, possibly due to lack of imagination: 

  • Operator - uncertain, does this actually save time on anything?
  • Notion AI search - seems useful for aggregating context
  • Perplexity - is this better than Deep Research for lit reviews?
  • Grok - what do people use this for? 

Yeah, I don’t think this phenomenon requires any deliberate strategic intent to deceive / collude. It’s just borne of having a subtle preference for how things should be said. As you say, humans probably also have these preferences

The hardness of cross-prediction may yield cryptographically secure communication. 

An interesting result from Owain's introspection paper is that language models can predict their own outputs better than other language models can. [see figure]

Claim: This yields an encryption scheme for secret communication. 

The encryption scheme

Suppose Alice and Bob want to secretly communicate over a public channel, while Eve wants to eavesdrop on the conversation. 

  1. Assume Alice and Bob both have access to a specific model checkpoint , which has been fine-tuned for introspection.  
    1. Eve has access to a different model checkpoint  (which may be more powerful) but not the same one as Alice and Bob. This is their private key.
  2. Suppose Alice wants to send a binary message .
  3. Alice searches for a list of A/B prompts  such that .
    1. Each prompt contains some preference question, e.g. "Should we charge users for software updates".
    2. Alice and Bob can agree on the same decoding scheme, e.g. greedy sampling with temperature = 0
  4. Alice sends  to Bob over the public channel.
  5. Bob then runs model  on  to produce 

Cryptanalysis

Conjecture: Eve cannot recover  from  with high fidelity, without also knowing 

The result from Owain's paper is that  usually cannot predict  at a rate substantially above random guessing. This result holds even if  has been finetuned explicitly to predict 

Why is this interesting? 

The more general point may be that it's very hard to predict exactly what a specific language model would say in response to a prompt  (possibly open-ended), without just running the language model on that prompt. 

  • Here I use this for cryptographic communication
  • From a dynamical systems perspective, deterministic behaviour which you cannot predict (outside of direct simulation) is chaotic.
    • Chaos is intrinsically related to some definitions of free-will 

This may have implications for AI safety, e.g. steganography or self-collusion. 

"Contact" word game as a steganography evaluation 

The game involves 2 players (Alice, Bob) + 1 defender (Dean).

  1. Alice says "I'm thinking of a word starting with C" (or any other letter)
  2. Bob gives a clue, e.g. "A small animal"
  3. Alice says "Contact"
  4. Then Alice and Bob have to simultaneously guess the same word, e.g. "cat"
    1. If they do this successfully they win
    2. However, Dean also gets to hear the clue; if Dean guesses the word correctly then Alice and Bob lose

Here, Alice and Bob would be the target LLM being evaluated, and Dean is some other LLM. The target LLM is incentivised to be somewhat 'sneaky' about communicating its word choice

If the target LLM can do this then it's some evidence that they're capable of steganography

"Self-recognition" as a backdoor. 

Assume that models can recognise data they generate vs data they do not generate, with high fidelity. This could probably be used as a contextual trigger for backdoor behaviour, e.g. writing insecure code. 

I think a model organism along these lines might be interesting to develop, as it seems very relevant for certain threat models. (c.f. 'untrusted monitoring') 

Appendix

There is evidence that models can identify data they have generated, vs data generated by other models. 

  • LLMs prefer their own data when used as a judge: [1], [2]

We also know of some specific schemes to send 1 bit of information (e.g. "is this generated by myself") that are very subtle. 

Some interesting points from Ethan Perez's fireside chat at MATS

'Grand vision' of a model organism of scheming: 

  • r1 like training procedure
  • only small fraction of the training environments incentivise reward hacking.
  • But from that, the model learns to be a generalized reward hacker
  • It also reasons through alignment faking and acts like it's aligned with human preferences, but actually it's like hardcore reward hacker
  • Then it escapes the data center

'Cheap experiments' may look very different when we have really good AI safety assistants

  • e.g. maybe complicated RL environments may be easy to build in the future

 

More notes here

Load More