Wiki Contributions


Let's See You Write That Corrigibility Tag

Welp. I decided to do this, and here it is. I didn't take nearly enough screenshots. Some large percent of this is me writing things, some other large percent is me writing things as if I thought the outputs of OpenAI's Playground were definitely something that should be extracted/summarized/rephrased, and a small percentage is verbatim text-continuation outputs. Virtually no attempts were made to document my process. I do not endorse this as useful and would be perfectly fine if it were reign of terror'd away, though IMO it might be interesting to compare against, let's say, sane attempts. Regardless, here ya go: one hour.


It’s past my bedtime.

I’ve got a pint in me.

OpenAI Playground is open as a tab.

A timer is running.

I speak to you now of Corrigibility Concerns.

When deputizing an agent who is not you to accomplish tasks on your behalf, there are certain concerns to… not address, but make sure are addressed. Let’s not Goodhart here. Jessica Taylor named “quantilization”. Paul Christiano named “myopia”. Eliezer Yudkowsky named “low impact” and “shutdownability”. I name “eli5ability” and I name “compressibility” and I name “checkpointable” and I name “testable”.


When we list out all of our proxy measures, we want corrigibility to be overdetermined. We want to achieve 70% of our goals completely and the rest half-assed and still end up with a corrigible agent. It's okay to project what we want from an agent onto non-orthogonal dimensions and call each vector important.

So let’s define a corrigible agent. A corrigible agent is an agent that:

  1. Does what we want it to do.
  2. Doesn’t want to do something else.
  3. Can easily be checked for doing what we want it to do.
  4. Can be shut down if we want it to stop doing something.
  5. Can be restarted if we want it to do something else.
  6. Can be explained to us why it did something.
  7. Doesn’t hide its intentions from us.
  8. Doesn’t want us to not know its intentions.
  9. Can be retrained to do something different if we want it to.

Additionally, because we live in the real world, it must not be too computationally expensive to train, run, check, shut down, restart, explain, retrain, or understand. This includes CPU cycles, wall-clock time, human thought, and so on.

My additions to the lexicon of corrigibility proxy measures is eli5ability, compressibility, checkpointable, and testable, and I will define them here.


A planning process must output simple plans. Complicated plans will fail, or if they succeed, will not be understandable by a human. This leads to the following heuristic: “eli5ability” means a plan must be understandable by a non-expert. “Understandable” is a technical term with a specific meaning in psychology: if a person has a model of the task in their heads, and the model is sufficiently close to the real task that they can use it to make predictions. Here the resulting plan must be simple enough to be an input to a much simpler AI whose predictions about the effects score well on whatever heuristics we mean to evaluate those plans. This is the sort of adversarial relationship which can be trained and improved in parallel, which in no way guarantees aligned AI but which certainly constrains the space of non-aligned AIs.


Planning processes must output compressible plans, in the sense that a joint probability function over a simple causal model can be specified by far fewer numbers than one over a complex causal model. The plan must be modular and each individual part packaged and vetted by other safety procedures without context, and the effects of the parts be vetted as a whole without the methods, as estimated both by the planning AI and by the eli5 AI.


Planning processes must output staged plans. Flexibility in timing, location, target, geography, personnel, etc should be part of the training process, such that rewards are commensurate (a) having checkpoints, and (b) changing the environment after every checkpoint. All coupling in plans must be penalized.


We need a new name for this one. I’m thinking of testable in the sense that code can be made testable by making it more modular, more idempotent, etc. Plans should come in pieces and those pieces should have expected effects and real life unit tests as part of the output and plans without those are highly penalized. It goes without saying at this point that the test here is not that the AI tell us what those expected effects and tests are, but that they agree with what other adversarially trained AIs say.


The main intuition I take from examining the product of my own intuitions with the half-assed text completion from OpenAI’s Playground is that it may be a useful line of thought to reward a powerful optimizer, while training, according to how well its stated expected effects of its plans agree with the expected effects stated by several differently-built less-powerful optimizers (also being trained) when presented with its plans as inputs. This may at least help force a powerful optimizer to intend to deceive us if it wants to avoid our actual request, at which point other intent-extraction safety tools may have an easier time picking up on its deceit.

What's it like to have sex with Duncan?

(I of course told my partners up front that a public essay was one possible outcome of the survey and that I would not-publish anything they flagged as private.)

On A List of Lethalities

flunked out

Gonna guess zero. Much less costly to leave 'em in for 12 weeks for goodwill than to try to remove people in that timeframe.

pre-selected for

Good point. Probably at least some of this. You need referrals, and I was definitely not the smartest of the people in my reference class available to refer, though maybe 3rd, and someone looking at me versus the one I know definitely had more-raw-IQ should definitely have guessed that I was more likely to pick up that particular thing.

On A List of Lethalities

It's also possible I'm someone "amenable" to this mindset and that was just the "on switch". DSP, by the way.

But yeah I could see a post on... cryptanalysis, and finding and minimizing attack surfaces, without necessarily having an attack in mind, and a hindsight-view story of what first caused me to think in that way.

On A List of Lethalities

Security mindset seems highly related, and the training thing here seems like it shouldn’t be that hard? Certainly it seems very easy compared to the problem the trained people will then need to solve, and I think Eliezer has de facto trained me a substantial amount in this skill through examples over the years. There was a time I didn’t have security mindset at all, and now I have at least some such mindset, and some ability to recognize lethal issues others are missing. He doesn’t say how many other people he knows who have the abilities referred to here, I’d be curious about that. Or whether he knows anyone who has acquired them over time.


I have just realized that I've believed for years that "security mindset" is relatively easy and people who can't at least dip into it are probably being "lazy". I was being lazy; somehow I didn't notice that I was literally trained in this mindset during an internship many many years ago. I think they did at least an acceptable job of training me. If I had to guess what the key trainings were, I'd guess:

  • [examples and practice] Here, learn some things about cryptography. Here is a theoretical algorithm and a convincing non-mathematical description of why it seems very hard to break. Watch as I break it via side channels, via breaking your assumptions of hardware access, via information theory. Go break things. Start by brainstorming all the ways things might be broken. Work with other smart people who are also doing that.
  • [examines in a different domain and visceral relevance] Speaking of hardware access, literal nation states have some incentive to get hardware access to you. Please absorb the following cautions. Here is an anecdote about an employee who always gets in the passenger side of her car rather than the driver's side to illustrate the amount of caution you could wield. Every time you drive to work, please note the individuals sitting around in defensive locations with guns. It is literally at least some risk to your person if you ever write [this post] or put this internship on your resume, but generally people find it worth the risk, especially 5+ years after they are not actively associated with us.
D&D.Sci June 2022 Evaluation and Ruleset

I spent all of my time trying to figure out how to figure out how much [the hidden variable causing the correlation between nerd and otaku] affects trait choices and winrates.

Apparently they are correlated without a relevant hidden variable. :D

AI Could Defeat All Of Us Combined

I don't understand why it's plausible to think that AI's might collectively have different goals than humans.

Future posts, right? We're assuming that premise here:

So, for what follows, let's proceed from the premise: "For some weird reason, humans consistently design AI systems (with human-like research and planning abilities) that coordinate with each other to try and overthrow humanity." Then what? What follows will necessarily feel wacky to people who find this hard to imagine, but I think it's worth playing along, because I think "we'd be in trouble if this happened" is a very important point.

AI Could Defeat All Of Us Combined

I'm not talking (yet) about whether, or why, AIs might attack human civilization. That's for future posts.

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

This was a fantastic idea and I am more interested in model interpretability for understanding these results than any I have seen in a while. In particular any examples of nontrivial mesa-optimizers we can find in the wild seem important to study, and maybe there's one here.

Load More