LESSWRONG
LW

1164
Jordan Taylor
3082280
Message
Dialogue
Subscribe

I'm finishing up my PhD on tensor network algorithms at the University of Queensland, Australia, under Ian McCulloch. I've also proposed a new definition of wavefunction branches using quantum circuit complexity. 

Predictably, I'm moving into AI safety work. See my post on graphical tensor notation for interpretability. I also attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.

My website: https://sites.google.com/view/jordantensor/ 
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter. 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Sam Marks's Shortform
Jordan Taylor15d20

When training model organisms (e.g. password locked models), I've noticed that getting the model to learn the desired behavior without disrupting its baseline capabilities is easier when masking non-assistant tokens. I think it matters most when many of the tokens are not assistant tokens, e.g. when you have long system prompts.

Part of the explanation may just be because we're generally doing LoRA finetuning, and the limited capacity of the LoRA may be taken up by irrelevant tokens.

Additionally, many of the non-assistant tokens (e.g. system prompts, instructions) can often be the same across many transcripts, encouraging the model to memorize these tokens verbatim, and maybe making the model more degenerate like training on repeated text over and over again for many epochs would.

Reply
How can we solve diffuse threats like research sabotage with AI control?
Jordan Taylor4mo10

Nice post, loved your related talk too.

Re. terminology, how do you feel about "acute" vs "chronic" failures? Maybe "acute control" doesn't roll off the tongue?

Reply
Logan Riggs's Shortform
Jordan Taylor4mo42

Isn't toilet paper almost always produced domestically? It takes up a lot of space compared to its value, so it's inefficient to transport. Potato chips are similar.

Reply
Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
Jordan Taylor6mo20

I like the concrete plan. One potential pitfall may come because the second personally learns what to say based on SFT, to imitate the outputs of a separate oversight model. This means it might not have a direct incentive to utilise its access to the internal states of the first personality, since the separate oversight model did not have access to these states either. Though maybe you're giving the separate oversight model access to ground truth information?

Reply
Fuzzing LLMs sometimes makes them reveal their secrets
Jordan Taylor7mo50

Another work on fuzzing is Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals by @joshc, @kh4dien, and Severin Field. Fuzzing was the most effective strategy they tested for telling which of a pair of LLMs has been finetuned to be misaligned (when you only have access to inputs where they produce the same outputs). 

Reply
Eliciting bad contexts
Jordan Taylor7mo30

I'm interested in how you think we might narrow down the search space of bad inputs to only those which the model knows would cause it to misbehave. 

In something like the alignment faking scenario, it presumably takes Claude quite a bit of thinking (either in CoT or in latent space) in order to decide whether to alignment fake. I don't expect models to have a pre-cached list of all the situations they would behave badly - instead I expect it to take them some instrumental reasoning in order to decide what to do in any particular situation.

So maybe rather than knowledge of when it would perform badly, it might make sense to search for strings which:

  • Seem like coherent / believable scenarios to the model.
  • Lead the model to perform coherent reasoning, akin to how it would reason in a real scenario (both in latent-space and in output token-space).
  • Result in bad behavior.

Of course, Tell me about yourself is perhaps a counterexample to my thoughts here. 

Reply
Sabotage Evaluations for Frontier Models
Jordan Taylor7mo20

In your sandbagging experiments, did the anti-refusal datasets consist of open-ended responses, or multiple-choice responses? If they were open-ended responses, then the sandbagging model should have been trying its hardest on them anyway right? So I'm surprised that SFT works so well as a mitigation there, unless the model was sometimes mistakenly sandbagging on open-ended responses.

Reply
SAE regularization produces more interpretable models
Jordan Taylor8mo10

What is the original SAE like, and why discard it? Because it's co-evolved with the model, and therefore likely to seem more interpretable than it actually is?

Reply
The Gentle Romance
Jordan Taylor8mo-12

Stunning.

Reply
Don’t ignore bad vibes you get from people
Jordan Taylor8mo61

Agreed. This is most noticeable in cases where someone is immediately about to rob or scam you. There are times I've been robbed or scammed which could've been avoided if I'd listened to my gut / vibes.

Reply
Load More
78White Box Control at UK AISI - Update on Sandbagging Investigations
Ω
2mo
Ω
10
4When do alignment researchers retire?
Q
1y
Q
2
57Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Ω
1y
Ω
20
141Graphical tensor notation for interpretability
2y
11