Graeme Ford — LessWrong

Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Indeed, current models are terrible at this! Still, worth keeping an eye on it, as it would complicate dangerous capability evals quite a bit should it emerge.

Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Graeme Ford5mo41

This was a great read - it seems very wise to track the confidence of models vs actual capabilities on the subskills relevant to these threats as you have started to do here! Another candidate capability to track in this way might be Schelling coordination aka acausal coordination - see this article under limitations-awareness as a strategic input

Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Graeme Ford5mo51

For this reason, regulatory thresholds should ideally use benchmarks that cover a wide range of difficulties, including extremely difficult questions.

Yes - and a further point on the importance of a range of difficulties: For a model to sandbag succesfully across a carefully designed eval suite, it must simulate a plausible capability profile, and coordinate this simulation across independent instances. Very difficult or dangerous evals are easy to coordinate on (the model should sandbag) but when presented with moderate difficulty evals across various capabilities which may only be concerning in aggregate - it is less obvious which of these to coordinate on and sandbag. Note the importance of a model's self-awareness of its capability to simulate a plausible capability profile to its decision to even attempt to sandbag!

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments