LESSWRONG
LW

Graeme Ford
12Ω4130
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Graeme Ford2mo10

Indeed, current models are terrible at this! Still, worth keeping an eye on it, as it would complicate dangerous capability evals quite a bit should it emerge.

Reply
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Graeme Ford2mo41

This was a great read - it seems very wise to track the confidence of models vs actual capabilities on the subskills relevant to these threats as you have started to do here! Another candidate capability to track in this way might be Schelling coordination aka acausal coordination - see this article under limitations-awareness as a strategic input

Reply
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Graeme Ford2mo51

For this reason, regulatory thresholds should ideally use benchmarks that cover a wide range of difficulties, including extremely difficult questions.

 

Yes - and a further point on the importance of a range of difficulties: For a model to sandbag succesfully across a carefully designed eval suite, it must simulate a plausible capability profile, and coordinate this simulation across independent instances. Very difficult or dangerous evals are easy to coordinate on (the model should sandbag) but when presented with moderate difficulty evals across various capabilities which may only be concerning in aggregate - it is less obvious which of these to coordinate on and sandbag. Note the importance of a model's self-awareness of its capability to simulate a plausible capability profile to its decision to even attempt to sandbag!

Reply
6Measuring Schelling Coordination - Reflections on Subversion Strategy Eval
Ω
4mo
Ω
0