Welp. I decided to do this, and here it is. I didn't take nearly enough screenshots. Some large percent of this is me writing things, some other large percent is me writing things as if I thought the outputs of OpenAI's Playground were definitely something that should be extracted/summarized/rephrased, and a small percentage is verbatim text-continuation outputs. Virtually no attempts were made to document my process. I do not endorse this as useful and would be perfectly fine if it were reign of terror'd away, though IMO it might be interesting to compare against, let's say, sane attempts. Regardless, here ya go: one hour.
It’s past my bedtime.
I’ve got a pint in me.
OpenAI Playground is open as a tab.
A timer is running.
I speak to you now of Corrigibility Concerns.
When deputizing an agent who is not you to accomplish tasks on your behalf, there are certain concerns to… not address, but make sure are addressed. Let’s not Goodhart here. Jessica Taylor named “quantilization”. Paul Christiano named “myopia”. Eliezer Yudkowsky named “low impact” and “shutdownability”. I name “eli5ability” and I name “compressibility” and I name “checkpointable” and I name “testable”.
When we list out all of our proxy measures, we want corrigibility to be overdetermined. We want to achieve 70% of our goals completely and the rest half-assed and still end up with a corrigible agent. It's okay to project what we want from an agent onto non-orthogonal dimensions and call each vector important.
So let’s define a corrigible agent. A corrigible agent is an agent that:
Additionally, because we live in the real world, it must not be too computationally expensive to train, run, check, shut down, restart, explain, retrain, or understand. This includes CPU cycles, wall-clock time, human thought, and so on.
My additions to the lexicon of corrigibility proxy measures is eli5ability, compressibility, checkpointable, and testable, and I will define them here.
A planning process must output simple plans. Complicated plans will fail, or if they succeed, will not be understandable by a human. This leads to the following heuristic: “eli5ability” means a plan must be understandable by a non-expert. “Understandable” is a technical term with a specific meaning in psychology: if a person has a model of the task in their heads, and the model is sufficiently close to the real task that they can use it to make predictions. Here the resulting plan must be simple enough to be an input to a much simpler AI whose predictions about the effects score well on whatever heuristics we mean to evaluate those plans. This is the sort of adversarial relationship which can be trained and improved in parallel, which in no way guarantees aligned AI but which certainly constrains the space of non-aligned AIs.
Planning processes must output compressible plans, in the sense that a joint probability function over a simple causal model can be specified by far fewer numbers than one over a complex causal model. The plan must be modular and each individual part packaged and vetted by other safety procedures without context, and the effects of the parts be vetted as a whole without the methods, as estimated both by the planning AI and by the eli5 AI.
Planning processes must output staged plans. Flexibility in timing, location, target, geography, personnel, etc should be part of the training process, such that rewards are commensurate (a) having checkpoints, and (b) changing the environment after every checkpoint. All coupling in plans must be penalized.
We need a new name for this one. I’m thinking of testable in the sense that code can be made testable by making it more modular, more idempotent, etc. Plans should come in pieces and those pieces should have expected effects and real life unit tests as part of the output and plans without those are highly penalized. It goes without saying at this point that the test here is not that the AI tell us what those expected effects and tests are, but that they agree with what other adversarially trained AIs say.
The main intuition I take from examining the product of my own intuitions with the half-assed text completion from OpenAI’s Playground is that it may be a useful line of thought to reward a powerful optimizer, while training, according to how well its stated expected effects of its plans agree with the expected effects stated by several differently-built less-powerful optimizers (also being trained) when presented with its plans as inputs. This may at least help force a powerful optimizer to intend to deceive us if it wants to avoid our actual request, at which point other intent-extraction safety tools may have an easier time picking up on its deceit.
(I of course told my partners up front that a public essay was one possible outcome of the survey and that I would not-publish anything they flagged as private.)
Gonna guess zero. Much less costly to leave 'em in for 12 weeks for goodwill than to try to remove people in that timeframe.
Good point. Probably at least some of this. You need referrals, and I was definitely not the smartest of the people in my reference class available to refer, though maybe 3rd, and someone looking at me versus the one I know definitely had more-raw-IQ should definitely have guessed that I was more likely to pick up that particular thing.
It's also possible I'm someone "amenable" to this mindset and that was just the "on switch". DSP, by the way.
But yeah I could see a post on... cryptanalysis, and finding and minimizing attack surfaces, without necessarily having an attack in mind, and a hindsight-view story of what first caused me to think in that way.
But in general I liked the setup a lot!
Security mindset seems highly related, and the training thing here seems like it shouldn’t be that hard? Certainly it seems very easy compared to the problem the trained people will then need to solve, and I think Eliezer has de facto trained me a substantial amount in this skill through examples over the years. There was a time I didn’t have security mindset at all, and now I have at least some such mindset, and some ability to recognize lethal issues others are missing. He doesn’t say how many other people he knows who have the abilities referred to here, I’d be curious about that. Or whether he knows anyone who has acquired them over time.
I have just realized that I've believed for years that "security mindset" is relatively easy and people who can't at least dip into it are probably being "lazy". I was being lazy; somehow I didn't notice that I was literally trained in this mindset during an internship many many years ago. I think they did at least an acceptable job of training me. If I had to guess what the key trainings were, I'd guess:
I spent all of my time trying to figure out how to figure out how much [the hidden variable causing the correlation between nerd and otaku] affects trait choices and winrates.
Apparently they are correlated without a relevant hidden variable. :D
I don't understand why it's plausible to think that AI's might collectively have different goals than humans.
Future posts, right? We're assuming that premise here:
So, for what follows, let's proceed from the premise: "For some weird reason, humans consistently design AI systems (with human-like research and planning abilities) that coordinate with each other to try and overthrow humanity." Then what? What follows will necessarily feel wacky to people who find this hard to imagine, but I think it's worth playing along, because I think "we'd be in trouble if this happened" is a very important point.
I'm not talking (yet) about whether, or why, AIs might attack human civilization. That's for future posts.
This was a fantastic idea and I am more interested in model interpretability for understanding these results than any I have seen in a while. In particular any examples of nontrivial mesa-optimizers we can find in the wild seem important to study, and maybe there's one here.