I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.
I think this is plausibly a big problem against competent schemers.
Can you say more of what you think the problem is? Are you thinking of something like "the scheming module tries to figure out what kind of thing would trigger the honesty module and tries to think the kinds of thoughts that wouldn't trigger it"?
I haven't read the full ELK report, just Scott Alexander's discussion of it, so I may be missing something important. But at least based on that discussion, it looks to me like ELK might be operating off premises that don't seem clearly true for LLMs.
Scott writes:
Suppose the simulated thief has hit upon the strategy of taping a photo of the diamond to the front of the camera lens.
At the end of the training session, the simulated thief escapes with the diamond. The human observer sees the camera image of the safe diamond and gives the strategy a “good” rating. The AI gradient descends in the direction of helping thieves tape photos to cameras.
It’s important not to think of this as the thief “defeating” or “fooling” the AI. The AI could be fully superintelligent, able to outfox the thief trivially or destroy him with a thought, and that wouldn’t change the situation at all. The problem is that the AI was never a thief-stopping machine. It was always a reward-getting machine, and it turns out the AI can get more reward by cooperating with the thief than by thwarting him.
So the interesting scientific point here isn’t “you can fool a camera by taping a photo to it”. The interesting point is “we thought we were training an AI to do one thing, but actually we had no idea what was going on, and we were training it to do something else”.
In fact, maybe the thief never tries this, and the AI comes up with this plan itself! In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy.
Much like in the GPT-3 example, if the training simulations include examples of thieves fooling human observers which are marked as “good”, the AI will definitely learn the goal “try to convince humans that the diamond is safe”. If the training simulations are perfect and everyone is very careful, it will just maybe learn this goal - a million cases of the diamond being safe and humans saying this is good fail to distinguish between “good means the diamond is safe” and “good means humans think the diamond is safe”. The machine will make its decision for inscrutable AI reasons, or just flip a coin. So, again, are you feeling lucky?
It seems to me that this is assuming that our training is creating the AI's policy essentially from scratch. It is doing a lot of things, some of which are what we want and some of which aren't, and unless we are very careful to only reward the things we want and none of the ones we don't want, it's going to end up doing things we don't want.
I don't know how future superintelligent AI systems work, but if LLM training was like this, they would work horrendously much worse than they do. People paid to rate AI answers report working with "incomplete instructions, minimal training and unrealistic time limits to complete tasks" and say things like "[a]fter having seen how bad the data is that goes into supposedly training the model, I knew there was absolutely no way it could ever be trained correctly like that". Yet for some reason LLMs still do quite well on lots of tasks. And even if all raters worked under perfect conditions, they'd still be fallible humans.
It seems to me that LLMs are probably reasonably robust to noisy reward signals because a large part of what the training does is "upvoting" and tuning existing capabilities and simulated personas rather than creating them entirely from scratch. A base model trained to predict the world creates different kinds of simulated personas whose behavior that would explain the data it sees; these include personas like "a human genuinely trying to do its best at task X", "a deceitful human", or "an honest human".
Scott writes:
In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy.
This might happen. It might also happen that the AI contains a "genuinely protect the diamond" persona and a "manipulate the humans to believe that the diamond is safe" persona, and that the various reward signals are upvoting these to different degrees. And that such a random process of manipulation does end up upvoting the "manipulate the humans" persona... but that if the "genuinely protect the diamond" persona has gotten sufficiently upvoted by other signals, it still ends up being the dominant one. Then it doesn't matter if there's some noise and upvoting of the "manipulate the humans" persona, as long as the "genuinely protect the diamond" persona gets more upvotes overall. And if the "genuinely protect the diamond" persona had been sufficiently upvoted from the start, the "manipulate the humans" one might end up with such a low prior probability that it'd effectively never end up active.
Now of course none of this is a rigorous proof that things would work, and with our current approaches we still see a lot of reward hacking and so on. But it seems to me like a reasonable possibility that there could be a potential "honestly report everything that I've done" persona waiting inside most models, such that one could just upvote it in a variety of scenarios and then it'd get widely linked to the rest of the model's internals so as to always detect if some kind of deception was going on. And once that had happened, it wouldn't matter if some of the reward signals around honesty were noisy, because the established structure was sufficiently robust and general against the noise.
In the case of such "imitation false negatives", honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify.
Couldn't you intentionally set up training scenarios so that there were such subtle avenues for cheating, and then reward the model for honestly reporting on those cheats? Since you knew in advance how one could cheat, you could reward honesty even on cheats that would have been very hard to detect without advance knowledge. This might then generalize to the model volunteering subtle details of cheating on other tasks where it was also hard to detect, and you didn't have advance knowledge. Then as long as the system verifying the "honesty reports" was capable of noticing when the model correctly reported information that we wanted reported and the model got rewarded for those, it could gradually end up reporting more and more things.
And one could hope that it wouldn't require too much training for a property like "report anything that I have reason to expect the humans would consider as breaking the spirit of the instructions" to generalize broadly. Of course, it might not have a good enough theory of mind to always realize that something would go against the spirit of the instructions, but just preventing the cases where the model did realize that would already be better than nothing.
[EDIT: my other comment might communicate my point better than this one did.]
The style summary included "user examples" which were AI summaries of the actual example emails my brother provided, meaning that Claude had accidentally told itself to act like an AI summary of my brother -- whereas my brother's email style is all about examining different options and quantifying uncertainty, the AI-generated "user examples" didn't carefully examine anything & just confidently stated things.
I've also had the same thing of "upload a writing sample and then have Claude generate a style where the 'user examples' are not actually from the sample at all" happen. I'm confused why it works so bad. I've mostly ended up just writing all my style prompts by hand.
Oh huh, I didn't think this would be directly enough alignment-related to be a good fit for AF. But if you think it is, I'm not going to object either.
The way I was thinking of this post, the whole "let's forget about phenomenal experience for a while and just talk about functional experience" is a Camp 1 type move. So most of the post is Camp 1, with it then dipping into Camp 2 at the "confusing case 8", but if you're strictly Camp 1 you can just ignore that bit at the end.
Homeopathy is (very strongly) antipredicted by science as we understand it, not just not-predicted.
Yeah, you're right. Though I think I've also heard people reference the evidence against the effectiveness of introspection as a reason to be skeptical about Focusing. Now if you look at the studies in question in detail, they don't actually contradict it - the studies are testing the reliability of introspection that doesn't use Focusing - but that easily sounds like special pleading to someone who doesn't have a reason to think that there's anything worth looking at there.
It doesn't help that the kinds of benefits that Focusing gives, like getting a better understand of why you're upset with someone, are hard to test empirically. Gendlin's original study said that people who do something like Focusing are more likely to benefit from therapy, but that could be explained by something else than Focusing being epistemically accurate.
I think the problem with Focusing is that it's a thing that happens to work but I don't think there's any mainstream scientific theory that would have predicted that it works, and even any retrodictions are pretty hand-wavy. So one might reasonably think "this doesn't follow from science as we understand it -> woo", the same as e.g. homeopathy or various misapplications of quantum mechanics.
By a "superintelligence" we mean an intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom and social skills. [...]
Entities such as companies or the scientific community are not superintelligences according to this definition. Although they can perform a number of tasks of which no individual human is capable, they are not intellects and there are many fields in which they perform much worse than a human brain - for example, you can't have real-time conversation with "the scientific community".
-- Nick Bostrom, "How long until superintelligence" (1997)
So I read another take on OpenAI's finances and was wondering, does anyone know why Altman is doing such a gamble on collecting enormous investments into new models in the hopes that they'll get sufficiently insane profits to make it worthwhile? Even ignoring the concerns around alignment etc., there's still the straightforward issue of "maybe the models are good and work fine but aren't good enough to pay back the investment".
Even if you did expect scaling to probably bring in huge profits, naively it'd still be wiser to pick a growth strategy that didn't require your company to become literally the most profitable company in the history of all companies or go bankrupt.
The obvious answer is something like "he believes they’re on the way to ASI and whoever gets there first, wins the game", but I'm not sure if it makes sense even under that assumption - his strategy requires not only getting to ASI first, but never once faltering on the path there. Even if ASI is really imminent but it only takes like two years longer than he expected, that alone might be enough that OpenAI is done for. He could have raised much more conservative investment and still been in the game - especially since much of the current arms race is plausibly a response to the sums OpenAI has been raising.