But we are uncertain whether humans will be perfect at auditing
It makes me sad that many people seem to, in fact, be uncertain about this and think it's actually a plausible possibility.
It is just a useful assumption for control research. Not having to run evals on humans' ability to audit transcripts speeds things up. For what it's worth, I am truly a bit uncertain here and plan to run human-auditing experiments at some point.
Fair enough. Still, the industrial/organizational psychologists will be happy to tell you that, as far as they can determine, even the most competent humans are never perfect at anything, including tasks vastly simpler than this, regardless of the stakes involved.
A large assumption we have made in AI control is that humans will be perfect at auditing, that is, being shown a transcript and determining if the AI was scheming in that transcript.
But we are uncertain whether humans will be perfect at auditing; they are prone to fatigue and distraction. That is why I’m releasing "Sentinel" today, an extremely high-stimulation way to audit boring transcripts.
Sentinel is a revolutionary way to get more juice out of your human auditors by gamifying the auditing process with a level system, perks, power-ups, and more fun features. Try it now here.
In AI control literature, we love finding the safety/usefulness trade-offs of everything we create, but surprisingly, we noticed no trade-offs with this product
The rest of the post will go over some of the ways we achieved this
Gamification
As you audit the transcript in the game, you gain tokens that you can spend on power-ups that make you even more productive. There are also achievement and level systems, so you can see your progress and get more dopamine hits!
Twitch Streaming Mode
AIs might be able to uplift human auditors in the future, which is why Sentinel ships with a Twitch streamer mode where you can go live on Twitch and have trusted monitors view your transcript and give you live feedback!
Subway Surfers
To maximize “eyes on the screen time”, we also added a Subway Surfers feature that lets you watch endless videos of pros playing the game, so you are always stimulated while looking over evil transcripts.
Funny aside: While making this game, I asked Claude to implement this Subway Surfers feature, and instead of finding a YouTube video of someone playing Subway Surfers, it rick-rolled me...
Looking Forward
We hope you like this tool. Go play it now to get ready for the upcoming future where the only way AI safety researchers can have an impact is to audit transcripts!
(If it wasn't clear, this is a joke and not a real product. I vibecodded it over a couple of hours)