169

LESSWRONG
LW

168
Interpretability (ML & AI)AI
Frontpage

15

Closed-Source Evaluations

by Jono
8th Jun 2024
2 min read
4

15

15

Closed-Source Evaluations
12Bird Concept
1Jono
3Bird Concept
2mesaoptimizer
New Comment
Email me replies to all comments
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:56 AM
[-]Bird Concept1y123

Without commenting on the proposal itself; I think the term "eval test set" is clearer for this purpose than "closed source eval".

Reply
[-]Jono1y10

agreed

Reply
[-]Bird Concept1y33

I'm writing a quick and dirty post because the alternative is that I wait for months and maybe not write it after all.

This is the way. 

Reply
[-]mesaoptimizer1y22

Note that the current power differential between evals labs and frontier labs is such that I don't expect evals labs have the slack to simply state that a frontier model failed their evals.

You'd need regulation with serious teeth and competent 'bloodhound' regulators watching the space like a hawk, for such a possibility to occur.

Reply
Moderation Log
More from Jono
View more
Curated and popular this week
4Comments
Interpretability (ML & AI)AI
Frontpage

Public tripwires are no tripwires.

I'm writing a quick and dirty post because the alternative is that I wait for months and maybe not write it after all. I am broadly familiar with the state of interpretability research but do not know what the state of model evaluations is at the moment.

The interpretability win screen comes after the game over.

There is an interpretability-enabled win condition at the point where billion-parameter networks become transparent enough that we robustly can detect deception.
We are paperclipped long before that since their adjacant insights predictably lead to quicker iteration cycles, new architecture discoveries, insight into strategic planning and other capabilities AGI labs are looking for.

Tripwires stop working when you train models to avoid them.

Current interpretability research solely (as far as I'm aware) produces tools that melt under slight optimization pressure. Optimising on an interpretability tool, optimizes against being interpretable.
It would be easy to fool others and oneself into thinking some model is safe because the non-optimisation-resistant interpretability tool showed your model was safe, after the model was optimised on it. If not that, then you could still be fooled into thinking you didn't optimise on the interpretability tool or its ramifications.

You cannot learn by trial and error if you are not allowed to fail.

We could fire rockets at the moon because we could fail at many intermediate points, with AGI, we're no longer allowed to fail after a certain threshold. Continuing to rely on the useful insights from failur,e, is thus a doomed approach[1] and interpretability increases the use AGI labs get out of failure. 
To start preparing for a world in which we're not allowed to fail, we should build one where failing hurts instead of helps AGI-creators.

Interpretability should go to closed-source evaluators.

We should close-source interpretability and have trustworthy evaluators buy new tools up and solely use them to evaluate frontier models in a tripwire approach. These evaluators should then not tell AGI labs what failed, just that it failed[2]. Ideally evaluators get to block model deployments of course, but them having a good track record of warning against upcoming model failures[3] is a good start.

AGI labs become incentivised to anticipate failure and lose ability to argue anyone into thinking their models will be safe. They have to pass their tests like everyone else.
Evaluators get high quality feedback on how well their tools predict model behavior, since those tools are now being wielded as intended, and they learn which failure modes are still uncovered.

  1. ^

    and a bad culture

  2. ^

    How to monitor the evaluators is out of scope, I'll just say that I will bet that it's easier to monitor them, than it is to make optimisation-resistant interpretability tooling.

  3. ^

    There is a balance between the information-security of the used interpretability techniques and the prestige-gain from detailed warnings. But intelligence agencies demonstrate that the balance can be struck.