LESSWRONG
LW

HenningB
23110
Message
Dialogue
Subscribe

Nurturing the best AI safety talent as a Research Manager at MATS!

Previously worked as AI developer in speech recognition and gen AI for 3 years. Pursued part-time technical safety research (2021-24), and coaching for career impact and personal growth (since 2017).

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
It's hard to make scheming evals look realistic for LLMs
HenningB3mo10

Interesting work and findings. Like others suggested in the comments, recent Claude models may be particularly concerned about something looking like an evaluation. Have you tested other models / model families as a judges?

Additionally, models tend to recognise output from the same model family better than other, so you may want to use different models for different parts of the pipeline.

Reply
10The Elicitation Game: Evaluating capability elicitation techniques
Ω
6mo
Ω
1
26MATS Spring 2024 Extension Retrospective
7mo
1