Previously "Lanrian" on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Graph for 2-parameter sigmoid, assuming that you top out at 1 and bottom-out at 0.
If you instead do a 4-parameter sigmoid with free top and bottom, the version without SWAA asymptotes at 0.7 to the left instead, which looks terrible. (With SWAA the asymptote is a little above 1 to the left; and they both get asymptotes a little below 0 to the right.)
(Wow, graphing is so fun when I don't have to remember matplotlib commands. TBC I'm not really checking the language models' work here other than assessing consistency and reasonableness of output, so discount depending on how much you trust them to graph things correctly in METR's repo.)
Yeah, a line is definitely not the "right" relationship, given that the y-axis is bounded 0-1 and a line isn't. A sigmoid or some other 0-1 function would make more sense, and more so the further outside the sensitive, middle region of success rate you go. I imagine the purpose of this graph was probably to sanity-check that the human baselines did roughly track difficulty for the AIs as well. (Which looks pretty true to me when eye-balling the graph. The biggest eye-sore is definitely the 0% success rate in the 2-4h bucket.)
Incidentally, your intuition might've been misled by one or both of:
As illustration of the last point: here's a bonus plot where the green line is minimizing the horizontal squared distance instead, ie predicting human minutes from average model score. I wouldn't quite say it's almost vertical, but it's much steeper.
Notice how the log-linear fit here only looks good for the SWAA data, in the 1 sec - 1 min range. There's something completely different going on for tasks longer than 1 minute, clearly not explained by the log-linear fit. If you tried to make a best fit line on the blue points (the length of tasks we care about after 2024), you'd get a very different, almost vertical line, with a very low R^2.
I don't think this is true. I got claude to clone the repo and reproduce it without the SWAA data points. The slope is ~identical (-0.076 rather than the original -0.072) and the correlation is still pretty good. (0.51)
Edit: That was with HCAST and RE-bench. Just HCAST is slope=-0.077 and R^2=0.48. I think it makes more sense to include RE-bench.
Edit 2: Updated the slopes. Now the slope is per doubling, like in the paper (and so the first slope matches the one in the paper). I think the previuos slopes were measuring per factor e instead.
The risk is that anyone with finetuning access to the AI could induce intuitive confidence that a proof was correct. This includes people who have finetuning access but who don't know the honesty password.
Accordingly, even if the model feels like it has proven that a purported honesty password would produce the honesty hash: maybe it can only conclude "either I'm being evaluated by someone with the real honesty password, or I'm being evaluated by someone with finetuning access to my weights, who's messing with me".
"People who have finetuning access" could include some random AI company employees who want to mess with the model (against the wishes of the AI company).
what if I want to train a new model and run inference on it?
The API can also have built-in functions for training.
What if I want to experiment with a new scaffold?
Scaffolds can normally be built around APIs? I thought scaffolds was just all about what prompts you send to the model and what you do with the model outputs.
I do agree that this might be rough for some types of research. I imagine the arguments are pretty similar here as the arguments about how much research can be done without access to dangerous model weights.
If you believe the "Twitter is not forgeable" hypothesis, then people can prove their identity to an AI by posting something on Twitter, and putting this post and all the reactions to the post into the AI model's context (h/t to Alexa Pan for pointing this out!). This does make it a lot less clear in which circumstances you'd actually need an honesty password as opposed to just posting about something on Twitter.
Thus, I think we should think more about: what, if any, are the situations in which we'd like to use honesty passwords?
It's a good question. Some candidate cases:
In principle, it seems like AIs should be able to fully verify the correctness of a pretty long proof where all the steps are easy and can be done in parallel. Because they do get parallel compute proportional to the length of the proof. Maybe it empirically turns out they're pretty bad at this, though. And certainly seems helpful if the AI doesn't have to rely on getting everything right. (Though maybe this trades off against the length of the proof or the easiness of each verification step? Not sure which tradeoffs are worth it given that malicious humans may be able to manipulate the AI's supposedly "random" choices of which clauses to verify.)
Another potential issue: The AIs may be able to prove correctness, but not be confident that they can prove correctness. Put differently: If someone was trying to fake this to the AI, they might be able to train an AI that can be served an incorrect password together with an incorrect proof, and become intuitively confident that it was shown the honesty password. In which case all AIs should potentially doubt their intuitive confidence levels about this.
If the honesty string is widely known outside the AI company, then it's likely that we'll accidentally process the string when processing data from the internet (e.g., someone might do a "prompt injection attack").
Why can't the AI company filter it? I guess it relies on this next point.
It's unlikely that we'll perfectly consistently remember to add honesty string filtering to LLM inference scaffolds: it's just 1 additional thing to do, and might get forgotten.
I'm not convinced.
I think there's a bunch of similar problems that AI companies will face:
... so there will be tons of reasons for companies to try to solve the "someone will forget" problem.
(Functional honesty passwords would certainly be great though!)
I'm confused by this. A hyperbolic function 1/(t_c−t) goes to infinity in finite time. It's a typical example of what I'm talking about when I talk about "superexponential growth" (because variations on it are a pretty good theoretical and empirical fit to growth dynamics with increasing returns). You can certainly use past data points of a hyperbolic function to extrapolate and make predictions about when it will go to infinity.
I don't see why time horizons couldn't be a superexponential function like that.
(In the economic growth case, it doesn't actually go all the way to infinity, because eventually there's too little science left to discover and/or too little resources left to expand into. Still a useful model up until that point.)