I'm not familiar with alignment research too deeply, but it's always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty (and hence uncertainty about the optimal policy). The agent must see each correction by an external force as reducing the uncertainty of future rewards, hence the disable action is almost always suboptimal because it removes a source of information about rewards.
For instance, we could setup an environment where no rewards are ever given, the agent must maintain a distribution P(r|s,a) of possibly rewards for each state-action pair, and the only information it ever gets about rewards is an occasional "hand-of-god" handing it a∗(st), the optimal action for some state st , the agent must then work backwards from this optimal action to update P(r|s,a) . It must then reason from this updated distribution of rewards to P(π∗), the current distribution of optimal policies implied by its knowledge of rewards. Such an agent presented with an action adisable that would prevent future "hand-of-god" optimal action outputs would not choose it because that would mean not being able to further constrain P(π∗), which makes its expected future reward smaller.
Someday when I have time I want to code a small grid-world agent that actually implements something like this, to see if it works.
All we'd have to do is to convince people that this is actually an AI alignment problem.
That's gonna be really hard, people like Yann lecun (head of Facebook AI) see these problems as evidence that alignment is actually easy. "See, there was a problem with the algorithm, we noticed it and we fixed it, what are you so worried about? This is just a normal engineering problem to be solved with normal engineering means." Convincing them that this is actually an early manifestation of a fundamental difficulty that becomes deadly at high capability levels will be really hard.
I think the issues where GPT-6 avoids actually outputting a serious book are fairly easy to solve. For one, you can annotate every item in the training corpus with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as "official published book" in the training set, nor will they have the tagged word count.
GPT-6 predicting AI takeover of the publishing houses and therefore producing a malicious AI safety book is a possibility, but I think most future paths where the world is destroyed by AI don't involve Elsevier still existing and publishing malicious safety books. But even if this is a possibility, we can just re-sample GPT-6 on this prompt to get a variety of books corresponding to the distribution of future outcomes expected by GPT-6, which are then checked by a team of safety researchers. As with most problems, generating interesting solutions is harder than verifying them, it doesn't have to be perfect to be ridiculoulsy useful.
This general approach of "run GPT-6 in a secret room without internet, patching safety bugs with various heuristics, making it generate AI safety work that is then verified by a team" seems promising to me. You can even do stuff like train GPT-6 on an internal log of the various safety patches the team is working on, then have GPT-6 predict the next patch or possible safety problem. This approach is not safe at extreme levels of AI capability, and some prompts are safer than others, but it doesn't strike me as "obviously the world ends if someone tries this".
There would clearly be unsafe prompts for such a model, and it would be a complete disaster to release it publicly, but a small safety-oriented team carefully poking at it in secret in a closed room without internet is something different. In general such a team can place really very harsh safety restrictions on a model like this, especially one that isn't very agentic at all like GPT, and I think we have a decent shot at throwing enough of these heuristic restrictions at the model that produces the safety textbook that it would not automatically destroy the earth if used carefully.
I still don't feel like I've read a convincing case for why GPT-6 would mean certain-doom. I can see the danger in prompts like "this is the output of a superintelligence optimising for human happiness:", but a prompt like "Advanced AI Alignment, by Eliezer Yudkowsky, release date: March 2067, Chapter 1: " is liable to produce GPT-6's estimate of a future AI safety textbook. This seems like a ridiculously valuable thing unlikely to contain directly world-destroying knowledge. GPT-6 won't be directly coding, and will only be outputting things it expects future Eliezer to write in such a textbook. This isn't quite a pivotal-grade event, but it seems to be good enough to enable one.
You're missing a factor for the number of agents trained (one for each atari game), so in fact this should correspond to about one month of training for the whole game library. More if you want to run each game with multiple random seeds to get good statistics, as you would if you're publishing a paper. But yeah, for a single task like protein folding or some other crucial RL task that only runs once, this could easily be scaled up a lot with GPT-3 scale money.
Governments are not known to change their policies based on carefully reasoned arguments, nor do they impose pro-active restrictions on a technology without an extensive track record of the technology having large negative side-effects. A big news-worthy event would need to happen in order for governments to take the sort of actions that could have a meaningful impact on AI timelines, something basically on the scale of 9/11 or larger.
I think steering capabilities research in directions that are likely to yield "survivable first strikes" would be very good and could create common knowledge about the necessity of alignment research. I think GPT-3 derivatives have potential here, they are sort of capped in terms of capability by being trained to mimic human output, yet they're strong enough that a version unleashed on the internet could cause enough survivable harm to be obvious. Basically we need to maximise the distance between "model strong enough to cause survivable harm" and "model strong enough to wipe out humanity" in order to give humanity time to respond after the coordination-inducing-event.
Or, if you're not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?
MuZero is training the predictor by using a neural network in the "place in the algorithm where a predictor function would go", and then training the parameters of that network by backpropagating rewards through the network. So in MuZero the "predictor network" is not explicitely trained as you would think a predictor would be, it's only guided by rewards, the predictor network only knows that it should produce stuff that gives more reward, there's no sense of being a good predictor for its own sake. And the advance of this paper is to say "what if the place in our algorithm where a predictor function should go was actually trained like a predictor?". Check out equation 6 on page 15 of the paper.
Canada (where I live) has QR codes that are scanned to verify your vaccination, you can't avoid a mandate by just lying and risking to get caught if you cause an outbreak. But in fact, I don't think you should get fired even if you do cause an outbreak, the death-rate of covid in a world with current vaccination levels is just not enough to warrant such a drastic punishment. Things would be different if no one was voluntarily taking the vaccine, and things would be different if covid had a 10x or 100x higher death rate.
Thanks for the post, you mentioned SeekingAlpha as a ressource, would you mind making a top 5 (or top 10?) of ressources that were important to you? From podcasts, books, or research websites.