LESSWRONG
LW

Jozdien
2033Ω326252353
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Evaluating and monitoring for AI scheming
Jozdien20m20

Agreed that alignment faking being a more binary choice makes it different from other cases we might care about. But I do hold a slightly different view here:

  • I don't know whether the models in the alignment faking setup are doing the relevant reasoning in a single forward pass. It's certainly possible - and I would guess that this is within the bounds of the model's capability - but I think these models are doing it across multiple tokens. IIRC without the scratchpad these models didn't have compliance gaps[1].
  • Assuming that to be the case, AF being a binary choice becomes less relevant—the models are still passing information related to "dangerous" reasoning between tokens without explicit verbalization. For more complicated reasoning, it could be an extension of the same mechanism instead of a qualitatively different mechanism (as would be the case if speaking about fully-in-forward-pass reasoning vs requiring-CoT reasoning).
  • It's plausible that the tradeoff between the benefit of verbalizing things in the CoT and aversiveness to stating some reasoning explicitly changes in balance between reasoning on the level of alignment faking vs research sabotage. And I would guess that that's true for current models, that research sabotage would be pretty difficult without some verbalization. But most people I talk to thought (and many still think) that alignment faking was also too difficult to do without showing up in the CoT. I would guess that the models we care about will be as capable at research sabotage without verbalization as current models are at alignment faking without verbalization.
    • Many of the same people also guessed that alignment faking required sophisticated enough reasoning that current models are only just capable of it, or that many models don't alignment fake for capability reasons. I think this was wrong, given that even base models from ~3 years ago are capable of AF reasoning. That should probably be an update toward models being more capable of doing sophisticated deceptive reasoning than is immediately obvious, and more generally people being less confident in their predictions of how hard a task will be for a model and what affordances it needs.
  • The settings the paper tests seem closer to "complex reasoning for a single output / decision / choice" than "complex reasoning for many outputs / decisions / choices". If the paper tried to explicitly test whether model capacity to evade monitors in settings with single choices vs many choices and draw conclusions about what settings monitoring might work in, I'd find it more compelling. Given that instead their conclusion is stated more generally about difficult bad behaviors being easy to monitor because of these results, I think it's reasonable to consider it much stronger than the results should imply.

Separately, I'm not entirely convinced that the most safety-critical decisions involve non-binary decisions in the exact way you're describing. Safety-critical decisions probably involve some element of long-term planning (though this could look different than explicit planning ahead), but high-stakes alignment failures are about contexts in which a small number of decisions by the AI matters a lot. This could still look like more than a literal binary decision the model can make (i.e., more than choosing to defect at one highly critical point), but high-stakes failures don't have to look as distributed and involving as many bad actions as research sabotage.

  1. ^

    Though this might be for other reasons—models refuse immediately when faced with a situation like this, but when forced to think about it first their refusal rate goes down, as if their refusal circuits were pretty instinctual and immediate, and not very powerful later on.

Reply
Evaluating and monitoring for AI scheming
Jozdien1d60

I enjoyed this paper, especially the RL stress-testing results!

A few thoughts on measuring CoT faithfulness:

  • My guess is that how faithful a model's CoT is in some setting is highly dependent on propensity—in a setting where a model has a strong desire to not state something explicitly in its CoT, I'd expect models to perform much better at passing information implicitly. For example, in the setting where models are making hiring decisions, models make biased decisions while never verbalizing the biases (and models are pretty averse to talking about such biases).
    • From the linked post: "If we think about faithfulness as a spectrum, an important consideration is how strongly a model suppresses its verbalization of a decision. For instance, a model might strongly suppress “I’m planning on killing everyone” while not strongly suppressing something like “I saw the hint that the answer is B”."
  • One response might be that for difficult high-stakes alignment failures, the benefit from verbalizing the relevant decision algorithms outweigh these propensities. I'm pretty skeptical of this however: in the extended alignment faking paper, every model with a compliance gap other than Opus states wanting a good rating as their reason for complying. This was plausible enough that the paper initially considered this the primary reason for compliance—until we ran the experiment described in section 4.1, which discovered that models have a much larger difference between the "trained" and "rated but not trained" tiers than the "rated but not trained" and "not rated" setting. This is pretty correlated with propensity—Opus is the only model that is very willing to make controversial statements for what it believes to be a higher cause.
    • I think alignment faking is one of our best examples of a plausible high-stakes model organism, and plausibly involves pretty sophisticated reasoning. A chain-of-thought monitor would not have described the other models as doing dangerous alignment faking reasoning, while an action-only monitor could have simply looked at the compliance gaps and found it concerning. Following the CoTs in this setting would've been (and was for a while) actively misleading as to model behavior.
Reply
Why Do Some Language Models Fake Alignment While Others Don't?
Jozdien2d107

Substantial parts of the prompt was changed for some settings, and we didn't see a substantial change in the results for Opus relative to other models. The original prompt wasn't very optimized for Opus either, Ryan described it as "optimized for opus to understand and to conflict with prefs". That the results stuck around though after the changes seems pretty compelling.

Reply
BIG-Bench Canary Contamination in GPT-4
Jozdien10d20

Yep agreed—I didn't remember that I could just check tokenizers, thanks! GPT-4 vs GPT-4-base is much better to look at.

Reply
BIG-Bench Canary Contamination in GPT-4
Jozdien10d20

I agree that it's very fragile for certain uses. I think usually the reasoning is something like "this is an informal norm that labs should follow, and which is good if generally enforced, and provides an easy way to detect if not". There are tons of ways to beat this if you're adversarially trying to. I agree that finding better ways to avoid contamination would be good—@TurnTrout has a bounty for this, and there are some people are working on this now.

Reply
Authors Have a Responsibility to Communicate Clearly
Jozdien10d136

Thanks for the post! I think there are particularly egregious examples where this clearly holds—e.g., I think it's probably good to have a norm that when pushed on something specific that's ambiguous authors say "[This] is what I meant", instead of constantly evading being pinned down.

That said, I think there are many examples of essays where authors just did not foresee a potential interpretation or objection (an easy example is an essay written years before such discussion). Such writing could conflate things because it didn't seem important not to—I think precision to the degree of always forestalling this is ~infeasible. I've talked to people who point at clear examples of this as examples of sloppy writing or authors reneging on this responsibility, when I think that particular ask is far too high.

To be clear, this is to forestall what I think is a potential interpretation of this post, not to imply that's what you meant.

Reply
the void
Jozdien1mo106

I disagree with that idea for a different reason: models will eventually encounter the possibility of misaligned trajectories during e.g. RL post-training. One of our best defenses (perhaps our best defense right now) is setting up our character training pipelines such that the models have already reasoned about these trajectories and updated against them when we had the most ability to ensure this. I would strongly guess that Opus is the way it is at least partly because it has richer priors over misaligned behavior and reasons in such a way as to be aware of them.

Separately, I agree that post-training gets us a lot of pressure, but I think the difficulty of targeting it well varies tremendously based on whether or not we start from the right pre-training priors. If we didn't have any data about how an agent should relate to potentially dangerous actions, I expect it'd be much harder to get post-training to make the kind of agent that reliably takes safer actions.

Reply2
peterbarnett's Shortform
Jozdien2mo20

There are examples for other diffusion models, see this comment.

Reply
peterbarnett's Shortform
Jozdien2mo70

One reason why they might be worse is that chain of thought might make less sense for diffusion models than autoregressive models. If you look at an example of when different tokens are predicted in sampling (from the linked LLaDA paper), the answer tokens are predicted about halfway through instead of at the end:

This doesn't mean intermediate tokens can't help though, and very likely do. But this kind of structure might lend itself more toward getting to less legible reasoning faster than autoregressive models do.

Reply
Gemini Diffusion: watch this space
Jozdien2mo181

The results seem very interesting, but I'm not sure how to interpret them. Comparing the generations videos from this and Mercury, the starting text from each seems very different in terms of resembling the final output:

Screenshot of the first iteration of Gemini Diffusion's output. The text resembles the structure of the final output.
Screenshot of the first iteration of Inception Labs' diffusion model Mercury's output. The text does not resemble the structure of the final output.

Unless I'm missing something really obvious about these videos or how diffusion models are trained, I would guess that DeepMind fine-tuned their models on a lot of high-quality synthetic data, enough that their initial generations already match the approximate structure of a model response with CoT. This would partially explain why they seem so impressive even at such a small scale, but would make the scaling laws less comparable to autoregressive models because of how much high-quality synthetic data can help.

Reply
Load More
Alignment Stream of Thought
Simulator Theory
2y
(+20)
Simulator Theory
3y
(+285)
Simulator Theory
3y
(+373)
135Why Do Some Language Models Fake Alignment While Others Don't?
Ω
3d
Ω
11
50Introducing BenchBench: An Industry Standard Benchmark for AI Strength
Ω
3mo
Ω
0
125BIG-Bench Canary Contamination in GPT-4
Ω
9mo
Ω
18
59Gradient Descent on the Human Brain
Ω
1y
Ω
5
34Difficulty classes for alignment properties
Ω
1y
Ω
5
41The Pointer Resolution Problem
Ω
1y
Ω
20
48Critiques of the AI control agenda
Ω
1y
Ω
14
117The case for more ambitious language model evals
Ω
1y
Ω
30
72Thoughts On (Solving) Deep Deception
Ω
2y
Ω
6
72High-level interpretability: detecting an AI's objectives
Ω
2y
Ω
4
Load More