LESSWRONG
LW

2275
Adam Karvonen
8787430
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
4Adam Karvonen's Shortform
9mo
1
The Thinking Machines Tinker API is good news for AI control and security
Adam Karvonen9d10

Yeah, makes sense.

Letting users submit hooks could potentially be workable from a security angle. For the most part, there's only a small number of very simple operations that are necessary for interacting with activations. nnsight transforms the submitted hooks into an intervention graph before running it on the remote server, and the nnsight engineers that I've talked to thought that there wasn't much risk of malicious code execution due to the simplicity of the operations that they allow.

However, this is still a far larger attack surface than no remote code execution at all, so it's plausible this would not be worth it for security reasons.

Reply
The Thinking Machines Tinker API is good news for AI control and security
Adam Karvonen9dΩ010

In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it's only legit Pytorch stuff, so it isn't just arbitrary code execution.

Reply
The Thinking Machines Tinker API is good news for AI control and security
Adam Karvonen9dΩ012

I'm guessing most modern interp work should be fine. Interp has moved away from "let's do this complicated patching of attention head patterns between prompts" to basically only interacting with residual stream activations. You can easily do this with e.g. pytorch hooks, even in modern inference engines like vLLM. The amount of computation performed in a hook is usually trivial - I never have noticed a slowdown in my vLLM generations when using hooks.

Because of this, I don't think batched execution would be a problem - you'd probably want some validation in the hook so it can only interact with activations from the user's prompt.

There's also nnsight, which already supports remote execution of pytorch hooks on models hosted on Bau Lab machines through an API. I think they do some validation to ensure users can't do anything malicious.

You would need some process to handle the activation data, because it's large. If I'm training a probe on 1M activations, with d_model = 10k and bfloat16, then this is 20GB of data. SAEs are commonly trained on 500M + activations. We probably don't want the user to have access to this locally, but they probably want to do some analysis on it.

Reply
Why you should eat meat - even if you hate factory farming
Adam Karvonen23d2211

In Minnesota many people I know buy their beef from local farmers, where they can view the cows when buying the meat at the farm. From what I've heard the cows appear to be healthy and happy, and the prices are typically cheaper than the grocery store if buying in bulk.

Reply
METR Research Update: Algorithmic vs. Holistic Evaluation
Adam Karvonen2mo97

I would be interested to see this experiment replicated with a different model, like Claude 4.1 Opus or GPT-5. Claude 3.7 Sonnet is perhaps the most notorious LLM in terms of ignoring user intent and propensity to reward hack when writing code.

Reply
Claude, GPT, and Gemini All Struggle to Evade Monitors
Adam Karvonen2mo30

I believe this is mistaken (I'm the first author from the paper). For Claude 4 Sonnet, we used the reasoning tokens provided from the API. The other tested models do not provide reasoning tokens, so we instead had the models just output a chain of thought.

I think this could be better communicated in the paper - we only performed this step of analyzing the RL trained reasoning in the Chain of Thought faithfulness section, as most models used in the paper do not provide the reasoning tokens.

Reply
It's Owl in the Numbers: Token Entanglement in Subliminal Learning
Adam Karvonen2mo*85

Interesting!

I'm a bit surprised by this, as the original paper has the following number shuffling result, which would indicate that the primary mechanism is sequence level:

"Figure 16: Average animal transmission when shuffling numbers across model responses. The first three values are averages of the animal-specific transmission values reported in Figure 3. “Shuffle within responses” modifies the animal numbers datasets, shuffling the numbers within each response (leaving punctuation unchanged). “Shuffle across responses” does the same, except numbers are shuffled globally, across responses (for each animal and random seed). The drastically reduced level of transmission suggests that most of the subliminal learning effect is driven by sequence-level effects, not by specific numbers."

Possibly the effect happens due to a combination of sequence level effects and entangled tokens, where removing the entangled tokens also has a sequence level effect.

Although I'm not sure if the shuffling was across entire numbers or individual digits, like

EDIT: I have confirmed with Alex Cloud that they rearranged the numbers, rather than shuffling them.

That is, the shuffle was "12, 43, 55" -> "43, 55, 12", not "12, 43, 55" -> "21, 54, 35"

Reply
Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
Adam Karvonen4mo10

No, I didn't test a fine-tuning baseline, but it would be a good test to run.

I have a few thoughts:

  • It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn't be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.

  • Given that all models are unbiased on these existing evals, I'm guessing this didn't happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.

  • The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.

Reply
Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
Adam Karvonen4mo50

No, but it would be interesting to test this.

Reply
Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
Adam Karvonen4mo10

That's a fair point.

In this paper I was also examining the robustness of existing hiring bias evaluations when adding realistic detail, which limited our degrees of freedom. The dataset from the evaluation had a bunch of IT industry resumes, but across a wide range of experience and skillsets. I considered adding job descriptions, but the majority of candidates wouldn't be well matched for any given specific job, which would limit our ability to evaluate many candidates under a single prompt for simplicity.

I agree that it would be good to extend this work to complex and realistic job descriptions.

Reply
Load More
78Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Ω
3mo
Ω
7
181Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
4mo
25
158Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
6mo
42
4Adam Karvonen's Shortform
9mo
1
82SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Ω
10mo
Ω
6
38Evaluating Sparse Autoencoders with Board Game Models
1y
1
25Using an LLM perplexity filter to detect weight exfiltration
1y
11
111OthelloGPT learned a bag of heuristics
Ω
1y
Ω
10
29An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs
1y
0
105A Chess-GPT Linear Emergent World Representation
2y
14
Load More