LESSWRONG
LW

Daniel Tan
134182560
Message
Dialogue
Subscribe

Researching AI safety. Currently interested in emergent misalignment, model organisms, and other kinds of empirical work.

https://dtch1997.github.io/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2Daniel Tan's Shortform
1y
261
Daniel Tan's Shortform
Daniel Tan1mo60

some personal beliefs I've updated on recently

  • building model organisms is hard
    • things often don't work for non-obvious reasons
      • good OOD generalization often depends on having a good dataset
      • finetuning on small datasets often has lots of side effects
    • it's valuable to iterate on bigger models where possible
  • evaluating model organisms is hard
    • the main bottleneck seems to be knowing 'what' to eval for
      • it helps to just have a ton of pre-existing evals ready to go
      • brainstorming with friends / mentors is also invaluable
Reply
When does training a model change its goals?
Daniel Tan3mo20

Overall I feel like these results add some doubt to the takeaways from sleeper agents, but could easily be explained away as model size dependence.  It would be good to see a replication attempt for sleeper agents on models as or more capable as the ones they used.

 

Do I understand correctly that you are referring to a replication of this work? https://www.lesswrong.com/posts/bhxgkb7YtRNwBxLMd/political-sycophancy-as-a-model-organism-of-scheming 

Reply
Daniel Tan's Shortform
Daniel Tan3mo62

I find myself writing papers in two distinct phases.

  1. Infodump.
    1. Put all the experiments, figures, graphs etc in the draft.
    2. Recount exactly what I did. At this stage it's fine to just narrate things in chronological order, e.g. "We do experiment A, the result is B. We do experiment X, the result is Y", etc. The focus here is on making sure all relevant details and results are described precisely
    3. It's helpful to lightly organise, e.g. group experiments into rough sections and give them an informative title, but no need to do too much.
    4. This stage is over when the paper is 'information complete', i.e. all experiments I feel good about are in the paper.
  2. Organise.
    1. This begins by figure out what claims can be made. Then all subsequent effort will be focused on clarifying and justifying those claims.
    2. Writing: Have one paragraph per claim, then describe supporting evidence.
    3. Figures: Have one figure per important claim.
    4. Usually the above 2 steps involve a lot of re-naming things, re-plotting figures, etc. to improve the clarity with which we can state the claims.
    5. Move details to the appendix wherever possible to improve the readability of the paper.
    6. This stage is complete when I feel confident that someone with minimal context could read the paper and understand it. 

Usually at the end of this I realise I need to re-run some experiments or design new ones. Then I do that, then info-dump, and organise again.

Repeat the above process as necessary until I feel happy with the paper.

Reply
Mech interp is not pre-paradigmatic
Daniel Tan3mo53

Seems pretty straightforward to say “mech interp lacks good paradigms” (actually 1 syllable shorter than “mech interp is pre-paradigmatic”!) 

See also my previous writing on this topic: https://www.lesswrong.com/posts/3CZF3x8FX9rv65Brp/mech-interp-lacks-good-paradigms

Reply
Daniel Tan's Shortform
Daniel Tan3mo20

ICYMI: Anthropic has partnered with Apple to integrate Claude into Apple's Xcode development platform

Reply
Gradual Disempowerment: Concrete Research Projects
Daniel Tan3mo40

Thanks, that makes sense! I strongly agree with your picks of conceptual works, I've found Simulators and Three Layer Model particularly useful in shaping my own thinking. 

Re: roleplay, I'm not convinced that 'agent' vs 'model' is an important distinction. If we adopt a strict behaviourist stance and only consider the LLM as a black box, it doesn't seem to matter much whether the LLM is really a misaligned agent or is just role-playing a misaligned agent. 

Re: empirical research directions, I'm currently excited by understanding 'model personas', i.e. what personas do models adopt? does it even make sense to think of them as having personas? what predictions does this framing let us make about model behaviour / generalization? Are you excited by anything within this space? 

Reply
Gradual Disempowerment: Concrete Research Projects
Daniel Tan3mo100

Dovetailing from the above, I think we are still pretty confused about how agency works in AI systems. There’s been a lot of great conceptual work in this area, but comparatively little bridging into rigorous empirical/mechanistic studies.

 

Could you expand on this? I would appreciate more details on what conceptual work you find compelling, what research north stars seem important but neglected, and (if any) specific empirical / mechanistic studies you would like to see. 

Reply
Symbol/Referent Confusions in Language Model Alignment Experiments
Daniel Tan4mo40

If I understand this post correctly, the object-level takeaway is that we need to evaluate an agentic system's propensities  in 'as natural' a way as they can be expressed. E.g. 

  • Describing events to the system as if it had 'naturally' observed them
  • Evaluating the systems' revealed preferences by looking at the actions it chooses to take

That's what I got out of the following paragraphs: 

Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.

If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.

There's also a warning about not generalizing findings to settings which seem adjacent (but may not be): 

Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-style system.

Is that right? 

Reply
Thomas Kwa's Shortform
Daniel Tan4mo20

Interesting paper. Quick thoughts:

  • I agree the benchmark seems saturated. It's interesting that the authors frame it the other way - Section 4.1 focuses on how models are not maximally goal-directed.
  • It's unclear to me how they calculate the goal-directedness for 'information gathering', since that appears to consist only of 1 subtask. 
Reply
Show, not tell: GPT-4o is more opinionated in images than in text
Daniel Tan5mo30

That makes sense to me! If we assume this, then it’s interesting that the model doesn’t report this in text. Implies something about the text not reflecting its true beliefs. 

Reply
Load More
6Could we have predicted emergent misalignment a priori using unsupervised behaviour elicitation?
12d
0
14Open Challenges in Representation Engineering
Ω
5mo
Ω
0
112Show, not tell: GPT-4o is more opinionated in images than in text
5mo
41
83Open problems in emergent misalignment
6mo
17
27A Collection of Empirical Frames about Language Models
8mo
0
113Why I'm Moving from Mechanistic to Prosaic Interpretability
8mo
34
40A Sober Look at Steering Vectors for LLMs
Ω
9mo
Ω
0
22Evolutionary prompt optimization for SAE feature visualization
Ω
10mo
Ω
0
9An Interpretability Illusion from Population Statistics in Causal Analysis
1y
3
2Daniel Tan's Shortform
1y
261
Load More