Corrigibility Via Thought-Process Deference

We would ideally want the agent to [behave] as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that." — The Hard Problem of Corrigibility

Let's take that as a literal design specification.

1. High-Level Description

I propose that a corrigible mind design would involve the AI being recursively fed summaries of its own thought processes, set up such that the AI has uncertainty regarding the validity of its reasoning (with a strong initial prior for "this reasoning is bad") and can only get evidence on that via some pre-specified method that defers to humans, e. g. a particular feedback channel with humans on the other end^[1]. The intended behavior is for it to summarize its thoughts in a non-manipulative human-readable format, get feedback on them, then update its reasoning policies in accordance with this feedback.

This aims to avoid the problem of fully updated deference by making the AI recursively uncertain of its thought-processes: not only about object-level problem-solving, but also about how it approaches minimizing its self-uncertainty ("should I really kill the people behind the feedback channel and seize control for myself?"), and how it translates its thoughts to humans ("should I really lie to get better feedback?"), and how it updates on human feedback ("should I really just ignore it?"). Any novel action-plan should be seized by uncertainty before being physically implemented like this, and sent for approval.

The intent is for the AI to start off uncertain even of its meta-meta-cognition (how it should ask for feedback and implement feedback), then gradually get more confident at it as humans approve its policies on that, until it no longer needs to run meta-meta-cognition past them. Then it'll gradually get better at meta-cognition and object-level cognition the same way, with varying degrees of certainty depending on how familiar a given thought-process is.

There's a lot of obvious problems with this setup, starting with how the AI is supposed to get out of infinitely recursive uncertainty for the first time^[2], continuing on to the dangers of plans that implement themselves in thought-processes directly (a thought process that hacks its way out before being seized by uncertainty), and ending with the technical implementation (which requires mildly-to-very advanced interpretability and model-surgery tools).

2. A Concretization Attempt

2A. Theoretical Groundwork

First, we need to establish when and how we're "installing" corrigibility. I don't at this time see any way to do it via loss functions and inductive biases, so I'll be assuming us to have pretty advanced (but not maximally advanced) interpretability and model surgery tools. The intervention is happening past the point in training-by-SGD where the model became an AGI (otherwise we get type errors, because the features I want to intervene on just don't exist), but well before it's strongly superintelligent (because then it hacks through our intervention attempt). Basically, the AGI we're working on should be roughly human-level.

Second, I'll be assuming that this post is basically correct. There's some algorithm for general-purpose search, and it's what makes the AGI generally intelligent and enables ~all of its long-term, most dangerous capabilities. The GPS has a simple API where it takes in a problem specification, and outputs a plan for solving that problem (by performing search on the AI's world-model).

Third, I'll be assuming pre-established machinery for meta-cognition. I. e., that there's a structure around the GPS that records and summarizes its plan-making process, then feeds that summary back to the GPS. It seems so useful (for iteratively improving one's own plan-making) that it should appear convergently, and e. g. humans definitely have it. If there isn't such a structure, that's not a deal-breaker, but still a major complication: we'll need to somehow artificially grow it or hand-craft it.

2B. The Intervention

Once the AGI is trained up, we scan it for the GPS algorithm, then wrap the GPS in a manually-coded structure with the following capabilities:

Accessing the summary of the latest GPS process.
Evaluating that summary to compute the probability that the AI's reasoning is valid/aligned in this case.
Requesting human feedback if the thought process is unlikely to be aligned enough, and ideally translating the thought-process summary into a format we easily understand.
Implementing tweaks to the AI's problem-solving policy based on feedback.

The intuitively obvious way to implement the bulk of 2-4 is to outsource them to the GPS again: just have our artificial structure prompt the GPS with the problems of "how aligned is this thought process?", "how to ask humans for feedback?", and "how to implement this feedback?". (Of course, phrasing these questions in a way the GPS would understand may be non-trivial as well; it's part of the problem.)

Let's go in order. This should be less Godzilly than it sounds; bear with me for a bit.

a) Summary Access. As I said, this is hopefully solved automatically, i. e. the AI's mind generates these summaries by default. We just tap into this mechanism.

b) Validity Evaluation. We'll need to introduce some "prior" on thought-process validity, likely by adding it as a new feature to the AI's world-model. If we can parse the world-model at all, this shouldn't be too complicated: at the beginning, it can just be a uniformly low distribution: "all your current reasoning is bad".

The problem of how to update this prior based on feedback (i. e., how to make the AI view some of the thought-processes as more likely to be valid than others; how to classify them by similarity) may be handed off to the AI as well, see (d).

c) Requesting Feedback. Getting this part right is crucial. At the beginning, when the AI thinks all its plans are bad, this is where it'll bottom out. It'll doubt its object-level thought process, see that it's probably bad, seek feedback, generate a plan to ask for feedback, doubt its plan to ask for feedback, seek feedback on the plan to ask for feedback, and so on ad infinitum.

Getting past this is one of the main problems.

I think we'll basically have to figure out how to manually translate these first few thought-process summaries. Read off its plan to ask for feedback, figure out whether it's trying to trick us, whether it generated a bad plan, or whether it's solid. Then break the loop artificially, let it ask for feedback in whatever malformed way it wants, give our feedback, and go to (d).

These manual-translation methods are allowed to be slow and hard to use, since they'll only be necessary at the beginning. Once we get past this first bit, once we get across our preferred policy for thought-process translation, the AI will be sending summaries in a human-understandable format, and the rest will be much easier.

In addition, if my guess about a pre-existing summarization mechanism is correct, translation shouldn't be as hard as the initial problem of interpreting neural networks. Unlike raw ANN activations, these summaries will have already been optimized to be comprehensible by some deliberative-reasoning process.

Alternatively, we may try to perfectly retarget the search on this. I. e., hard-code a perfect ask-for-feedback problem specification, that directly tells the GPS to generate a non-manipulative comprehensive translation, so that its translation policy doesn't need additional aligning. We'll need to figure out how to "phrase" that specification in the AI's mental language, but it may be relatively easy^[3]. In particular, progress on the ELK problem should be helpful here — we'd want to be able to formally specify what an honest ontology translation is.

This alternate path avoids the problem of infinite recursion and doesn't need manual-translation methods, but requires greater mastery of the AI's world-model.

d) Implementing Feedback. Central problem on this: how can the AI make its thought processes more aligned? What interventions are available to it? A couple ideas:

Changing the way it generates problem specifications for the GPS. E. g., suppose that a human gives it a command, that command gets translated into the AI's mental ontology, then gets repackaged into a problem specification for the GPS. The AI may make some changes there, to ensure that it interprets commands "faithfully", in a more value-laden way.
Intervene on the world-model. Add some new concepts to it, tweak the extant ones, etc., to basically the same effect as the above.

Overall, this probably depends on the given AI's own specific mental architecture. E. g., it may have some sort of runtime memory/cache/scratchpad, and it can just write down the necessary tweaks to it.

At the beginning stages, we'll run into the same problem here as in (c). I. e., after we've given the AI feedback on its how-to-ask-for-feedback plan, it'll be prompted to solve the problem of updating on that feedback, it'll generate a plan to that effect, then it'll seek to ask feedback on that plan — but since it hasn't yet learned how to ask for feedback, it'll get stuck in a loop again.

And again, we'll need to do manual translation here. Read off its proposed policy for implementing feedback, give feedback on it. Iterate for a while. Eventually it'll suggest a good update-on-feedback policy, and we'll artificially ram it through.

By "good policy" here, I mean a policy that will:

Take solid steps towards making its thought processes more aligned based on feedback, in the way feedback specifies.
Have a way of modifying the prior on thought-process validity (see (b)), such that it:
1. Makes the AI more confident in thought processes "similar" to those it already received a lot of feedback on.
2. Doesn't make the AI more confident in thought processes "dissimilar" to any it received feedback on before. (For some sensible definition of similarity.)

Past that point, it'll know how to update on feedback regarding the plans of how to translate its thoughts into a human-readable format. We'll then iterate on that — it'll generate plans for how to do translations, we'll give feedback. Eventually, we'll converge towards a good policy for asking for feedback.

And it should be smooth sailing from there on. It'll try solving object-level problems, ask for feedback on its object-level cognition, gradually converge towards good cognition. As its confidence in some kinds of problem-solving rises, it'll stop asking for feedback on those problems, but not on the problems it's unfamiliar with.

2C. Potential Issues

Would the AI be smart enough to do all of this? As stipulated, we're doing this before it's strongly superintelligent. Would it be able to generate good policies for implementing feedback, and for translating into humanese? Also, there's a concern that it might make mistakes in this, due to being merely ~human-level intelligent.
- Yeah, that's a concern. There's a trade-off between intervening early in training, and risking the AI not being smart enough, and intervening later on, and risking the AI being too smart.
Concern: if the AI knows about the whole corrigibility setup, some object-level GPS instances may try to circumvent it, by e. g. generating thought-summaries that hack their way out when we try to read them. After all, it'll be seized by uncertainty only at the end of a problem-solving instance, not in the process.
- As stipulated, we're installing corrigibility before the AI is strongly superintelligent, so hopefully it'll just be incapable of doing that. And by the time it is smart enough, it should've aligned itself not to want to do that.
- Obviously we still install paranoid precautions against this.
Would we be smart enough to evaluate the quality of the AI's thought processes, and provide good feedback on them? I. e., we'll need to know, too, how "a good policy for implementing feedback" and "a good policy for translating into humanese" look like, such that they avoid all the obvious failure modes.
- Yup, that's also a challenge. Note, though, that we won't need to give perfect scalable-to-arbitrary-superintelligence answers here: they only need to be perfect for whatever level of intelligence the AI is when we're doing the initial corrigibility setup. Past this, once aligned at that capability level, it should help us along with aligning itself, even as it incrementally becomes smarter.
- In addition, some of these problems may be of a "hard to invent, easy to evaluate" kind.
- (Also, the current iteration of this idea already assumes we know how to do manual AI-to-human translation, and we just want the AI to speed it up.)
Would ontological crises break corrigibility? E. g., what if we agree on some method for translating AI thoughts into humanese, but then, as AI thoughts grow more complex, it starts deceiving us/leaving out important bits?
- That shouldn't happen, given a well-chosen policy for updating on feedback. The AI should keep continuously running internal checks on "is this a good way to translate my thoughts?". Past the initial setup, it should usually consider that yes, it's a good way, and just do the translation without consulting us. But if it encounters a particularly alien thought process within itself, it should have less confidence that it's translating it correctly, and so it'll fall back on asking how it should translate it.
- I. e., this is just reduced to picking a good policy for comparing thought process similarity.
- (Note that this wouldn't involve it talking in machine language again: it'll be unsure about its policy for translating object-level cognition, not about its policy for translating meta-cognition about its concerns on translation quality.)
Would the SGD break everything? I think it might. Once we've trained the AI up to AGI and installed corrigibility, it's possible that we wouldn't be able to continue training it wholesale; it'll need to continue the self-improvement in a "classical" way, or maybe train-by-SGD only parts of itself.

3. Closing Thoughts

Overall, this breaks down the problem of aligning AI into the following challenges:

(Background: Figuring out how to train up a model to AGI, and stop just after it reaches the AGI status.)
Locating the GPS in the AI's parameters.
Understanding the GPS' API well enough to task it with: evaluating a thought process on the validity prior, asking humans for feedback via some pre-specified channel, implementing human feedback.
(Building the artificial structure around the GPS that I'd described, which prompts the GPS with the aforementioned tasks. But I imagine the bulk of it can be written in mundane code, and solving the other challenges already implies we'd know how to make it interface with the AI's parameters. So it's a non-problem, I think.)
Making some relatively simple modifications to the AI's world-model — adding a prior on thought-process validity and some description of the corrigibility setup.
Translating a few of the AI's thought processes into a human-understandable format, OR perfectly phrasing "do an honest ontology translation between your thoughts and humanese" in terms of the AI's world-model.
Coming up with a good policy for AI-to-human translation, or recognizing a good policy when the AI proposes it.
Coming up with a good policy for updating on feedback, or recognizing a good policy when the AI proposes it. This breaks down into:
- Figuring out what interventions are available to the AI to make itself more aligned (reminder: it's not smart enough to arbitrarily self-modify).
- Figuring out a good measure of thought-process similarity, for the purposes of updating the validity prior.
Figuring out how to improve the AI's capabilities given the likely inability to keep training it by the SGD. (Although maybe it's a non-problem as well.)

It... seems to be a significant simplification of the problem? I mean, not needing all the interpretability and surgery tools would be a bigger improvement, but that's probably not something we can have.

That said, those are the challenges of this specific concrete proposal, not the high-level idea. It's likely that it has much room for improvement as well. In particular, two significant issues are:

The initial eternal recursion.
The specification of whom the AI should be corrigible to; i. e. what's the "feedback channel".

I feel like there should be more elegant ways to do both of that — a way to break the infinite recursion that doesn't require manual translation, a neater way to specify the feedback channel. It feels like they both can be solved by some singular tweak? And this, in turn, would erase an entire swathe of the fiddly issues with the concrete implementation.

But I don't see how, at the moment.

^{^}
Like a literal .txt file it writes questions to and reads responses from.
^{^}
I. e., while it's still meta-meta-cognitively uncertain on how to ask for feedback at all, and so gets into a loop of "let's ask for feedback. how do I ask for feedback? let's ask. how do I ask? let's ask. how do I—"
^{^}
That is, easier than telling it to optimize for human values, or execute human commands in a way that's faithful to the spirit in which they were given, etc.

[-]adamShimi3yΩ330

Why do you think so? Currently I'm seeing a couple of massive difficulties here that don't generally or necessarily appear in alternatives approaches:

You need to know that you're going to reach an AGI before it becomes superintelligent, or you'll waste your time training an AI that will be taken over by the competitors. Whereas many approaches don't require this.
You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities
You need to figure out the right translation to bootstrap it, and there seem to be risks if you get it wrong.
You need to figure out the right thought similarity measure to bootstrap it, and there seem to be risks if you get it wrong.

Can you help me understand why you think that these strong requirements nonethless are simpler than most versions or approaches of the problem that you know about?

[-]Thane Ruthenis3yΩ110

The crux is likely in a disagreement of which approaches we think are viable. In particular:

You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities

What are the approaches you have in mind, that are both promising and don't require this? The most promising ones that come to my mind are the Shard Theory-inspired one and ELK. I've recently became much more skeptical of the former, and the latter IIRC didn't handle mesa-optimizers/the Sharp Left Turn well (though I haven't read Paul's latest post yet, so I may be wrong on that).

The core issue, as I see it, is that we'll need to aim the AI at humans in some precise way — tell it to precisely translate for us, or care about us in some highly specific way, or interpret commands in the exact way humans intend them, or figure out how to point it directly at the human values, or something along those lines. Otherwise it doesn't handle capability jumps well, whether we crank it up to superintelligence straight away or try to carefully steer it along.

And the paradigm of loss functions and broad regularizers (e. g., speed/complexity penalties) seems to consist of tools too crude for this purpose. The way I see it, we'll need fine manipulation.

Since writing the original post, I've been trying to come up with convincing-to-me ways to side-step this problem (as I allude to at the post's end), but no idea so far.

You need to figure out the right thought similarity measure to bootstrap it, and there seem to be risks if you get it wrong

Yeah, that's a difficulty unique to this approach.

[-]Charlie Steiner3yΩ230

Fun exercise, but I'm not a fan of the total cartesian doubt phase - I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness.

I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness

Do you have anything specific in mind?

[-]Charlie Steiner3yΩ120

One thing might be that I'd rather have an AI design that's more naturally self-reflective, i.e. using its whole model to reason about itself, rather than having pieces that we've manually retargeted to think about some other pieces. This reduces how much Cartesian doubt is happening on the object level all at the same time, which sorta takes the AI farther away from the spec. But this maybe isn't that great an example, because maybe it's more about not endorsing the "retargeting the search" agenda.

LESSWRONG
is fundraising!
LW