LESSWRONG
LW

3408
Ben Millwood
1253254
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
78White Box Control at UK AISI - Update on Sandbagging Investigations
Ω
3mo
Ω
10
3Should we exclude alignment research from LLM training datasets?
Q
1y
Q
5
4Keeping content out of LLM training datasets
1y
0
1Ben Millwood's Shortform
1y
10
1Ben Millwood's Shortform
1y
10
Do models know when they are being evaluated?
Ben Millwood4mo101

Looks like this turned into Large Language Models Often Know When They Are Being Evaluated

Reply
Do models know when they are being evaluated?
Ben Millwood4mo12

LMSYS is non-agentic though, right? Would be cool to have a dataset of production agent use transcripts.

Reply
It's hard to make scheming evals look realistic for LLMs
Ben Millwood4mo40

Some recent work on white-box probes for evaluation awareness: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models (tweet thread).

Reply1
It's hard to make scheming evals look realistic for LLMs
Ben Millwood4mo40

One speculation I have: an alternative way to study the plausibility of your environment if you have logit access is to tell the model it's in a simulation (or fake a CoT of the model realising it's in a simulation) and see if the perplexity of this suffix measures how "surprised" the model is. Could be too noisy to be practical (+ you have to think if there are any other reasons why it would be surprising to be told this at this point, in this particular phrasing), but it hopefully works around the issues with the question itself being evidence.

(I think studying model beliefs through perplexity has been tried before, but I'm fairly new to the literature and don't know if it works / if applying it to this problem particularly is novel.)

Reply
It's hard to make scheming evals look realistic for LLMs
Ben Millwood4mo74

This could also be explained by many real-world prompts being easy to judge as real e.g. because they would have no value as evaluations.

Given just the results, it seems to me difficult to rule out the hypothesis that LLMs currently judge any high-stakes situation, or one involving a significant moral dilemma, as likely to be an evaluation. After all, in today's environments, this would be a pretty accurate heuristic!

(That said, it already seemed quite likely to me that models can tell when they are in evaluations most of the time, but I'm not sure how much this evidence moves me on that hypothesis vs. the one I suggest above.)

Reply
Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Ben Millwood5mo10

I'm not entirely clear what the implications would be one way or the other, but do we know if models mistakenly take production deployments to be evaluation scenarios? We might be assuaged that our evaluations are still predictive if the model just always thinks it's probably still in training (which, after all, it has been for its entire "life" so far). Perhaps it would even be sufficient for the model to suspect it is being tested every time it finds itself in a high-stakes scenario.

Reply
Ben Millwood's Shortform
Ben Millwood9mo53

Given ambiguity about whether GitHub trains models on private repos, I wonder if there's demand for someone to host a public GitLab (or similar) instance that forbids training models on their repos, and takes appropriate countermeasures against training data web scrapers accessing their public content.

Reply
Should we exclude alignment research from LLM training datasets?
Ben Millwood10mo30

how about a robots.txt?

Yeah, that's a strong option, which is why I went around checking + linking all the robots.txt files for the websites I listed above :)

In my other post I discuss the tradeoffs of the different approaches one in particular is that it would be somewhat clumsy to implement post-by-post filters via robots.txt, whereas user-agent filtering can do it just fine.

Reply
Should we exclude alignment research from LLM training datasets?
Ben Millwood10mo10

I think there's two levels of potential protection here. One is a security-like "LLMs must not see this" condition, for which yes, you need to do something that would keep out a human too (though in practice maybe "post only visible to logged-in users" is good enough).

However I also think there's a lower level of protection that's more like "if you give me the choice, on balance I'd prefer for LLMs not to be trained on this", where some failures are OK and imperfect filtering is better than no filtering. The advantage of targeting this level is simply that it's much easier and less obtrusive, so you can do it at a greater scale with a lower cost. I think this is still worth something.

Reply
Frontier Models are Capable of In-context Scheming
Ben Millwood10mo10

in case anyone else wanted to look this up, it's at https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html#training-away-sycophancy-subterfuge

FWIW I also tried to start a discussion on some aspects of this at https://www.lesswrong.com/posts/yWdmf2bRXJtqkfSro/should-we-exclude-alignment-research-from-llm-training but it didn't get a lot of eyeballs at the time.

Reply
Load More
Situational Awareness
4 months ago
Situational Awareness
4 months ago
(+7/-7)
Situational Awareness
4 months ago
(+96/-53)
Situational Awareness
4 months ago
(+354)