We would ideally want the agent to [behave] as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that." — The Hard Problem of Corrigibility
Let's take that as a literal design specification.
I propose that a corrigible mind design would involve the AI being recursively fed summaries of its own thought processes, set up such that the AI has uncertainty regarding the validity of its reasoning (with a strong initial prior for "this reasoning is bad") and can only get evidence on that via some pre-specified method that defers to humans, e. g. a particular feedback channel with humans on the other end. The intended behavior is for it to summarize its thoughts in a non-manipulative human-readable format, get feedback on them, then update its reasoning policies in accordance with this feedback.
This aims to avoid the problem of fully updated deference by making the AI recursively uncertain of its thought-processes: not only about object-level problem-solving, but also about how it approaches minimizing its self-uncertainty ("should I really kill the people behind the feedback channel and seize control for myself?"), and how it translates its thoughts to humans ("should I really lie to get better feedback?"), and how it updates on human feedback ("should I really just ignore it?"). Any novel action-plan should be seized by uncertainty before being physically implemented like this, and sent for approval.
The intent is for the AI to start off uncertain even of its meta-meta-cognition (how it should ask for feedback and implement feedback), then gradually get more confident at it as humans approve its policies on that, until it no longer needs to run meta-meta-cognition past them. Then it'll gradually get better at meta-cognition and object-level cognition the same way, with varying degrees of certainty depending on how familiar a given thought-process is.
There's a lot of obvious problems with this setup, starting with how the AI is supposed to get out of infinitely recursive uncertainty for the first time, continuing on to the dangers of plans that implement themselves in thought-processes directly (a thought process that hacks its way out before being seized by uncertainty), and ending with the technical implementation (which requires mildly-to-very advanced interpretability and model-surgery tools).
First, we need to establish when and how we're "installing" corrigibility. I don't at this time see any way to do it via loss functions and inductive biases, so I'll be assuming us to have pretty advanced (but not maximally advanced) interpretability and model surgery tools. The intervention is happening past the point in training-by-SGD where the model became an AGI (otherwise we get type errors, because the features I want to intervene on just don't exist), but well before it's strongly superintelligent (because then it hacks through our intervention attempt). Basically, the AGI we're working on should be roughly human-level.
Second, I'll be assuming that this post is basically correct. There's some algorithm for general-purpose search, and it's what makes the AGI generally intelligent and enables ~all of its long-term, most dangerous capabilities. The GPS has a simple API where it takes in a problem specification, and outputs a plan for solving that problem (by performing search on the AI's world-model).
Third, I'll be assuming pre-established machinery for meta-cognition. I. e., that there's a structure around the GPS that records and summarizes its plan-making process, then feeds that summary back to the GPS. It seems so useful (for iteratively improving one's own plan-making) that it should appear convergently, and e. g. humans definitely have it. If there isn't such a structure, that's not a deal-breaker, but still a major complication: we'll need to somehow artificially grow it or hand-craft it.
Once the AGI is trained up, we scan it for the GPS algorithm, then wrap the GPS in a manually-coded structure with the following capabilities:
The intuitively obvious way to implement the bulk of 2-4 is to outsource them to the GPS again: just have our artificial structure prompt the GPS with the problems of "how aligned is this thought process?", "how to ask humans for feedback?", and "how to implement this feedback?". (Of course, phrasing these questions in a way the GPS would understand may be non-trivial as well; it's part of the problem.)
Let's go in order. This should be less Godzilly than it sounds; bear with me for a bit.
a) Summary Access. As I said, this is hopefully solved automatically, i. e. the AI's mind generates these summaries by default. We just tap into this mechanism.
b) Validity Evaluation. We'll need to introduce some "prior" on thought-process validity, likely by adding it as a new feature to the AI's world-model. If we can parse the world-model at all, this shouldn't be too complicated: at the beginning, it can just be a uniformly low distribution: "all your current reasoning is bad".
The problem of how to update this prior based on feedback (i. e., how to make the AI view some of the thought-processes as more likely to be valid than others; how to classify them by similarity) may be handed off to the AI as well, see (d).
c) Requesting Feedback. Getting this part right is crucial. At the beginning, when the AI thinks all its plans are bad, this is where it'll bottom out. It'll doubt its object-level thought process, see that it's probably bad, seek feedback, generate a plan to ask for feedback, doubt its plan to ask for feedback, seek feedback on the plan to ask for feedback, and so on ad infinitum.
Getting past this is one of the main problems.
I think we'll basically have to figure out how to manually translate these first few thought-process summaries. Read off its plan to ask for feedback, figure out whether it's trying to trick us, whether it generated a bad plan, or whether it's solid. Then break the loop artificially, let it ask for feedback in whatever malformed way it wants, give our feedback, and go to (d).
These manual-translation methods are allowed to be slow and hard to use, since they'll only be necessary at the beginning. Once we get past this first bit, once we get across our preferred policy for thought-process translation, the AI will be sending summaries in a human-understandable format, and the rest will be much easier.
In addition, if my guess about a pre-existing summarization mechanism is correct, translation shouldn't be as hard as the initial problem of interpreting neural networks. Unlike raw ANN activations, these summaries will have already been optimized to be comprehensible by some deliberative-reasoning process.
Alternatively, we may try to perfectly retarget the search on this. I. e., hard-code a perfect ask-for-feedback problem specification, that directly tells the GPS to generate a non-manipulative comprehensive translation, so that its translation policy doesn't need additional aligning. We'll need to figure out how to "phrase" that specification in the AI's mental language, but it may be relatively easy. In particular, progress on the ELK problem should be helpful here — we'd want to be able to formally specify what an honest ontology translation is.
This alternate path avoids the problem of infinite recursion and doesn't need manual-translation methods, but requires greater mastery of the AI's world-model.
d) Implementing Feedback. Central problem on this: how can the AI make its thought processes more aligned? What interventions are available to it? A couple ideas:
Overall, this probably depends on the given AI's own specific mental architecture. E. g., it may have some sort of runtime memory/cache/scratchpad, and it can just write down the necessary tweaks to it.
At the beginning stages, we'll run into the same problem here as in (c). I. e., after we've given the AI feedback on its how-to-ask-for-feedback plan, it'll be prompted to solve the problem of updating on that feedback, it'll generate a plan to that effect, then it'll seek to ask feedback on that plan — but since it hasn't yet learned how to ask for feedback, it'll get stuck in a loop again.
And again, we'll need to do manual translation here. Read off its proposed policy for implementing feedback, give feedback on it. Iterate for a while. Eventually it'll suggest a good update-on-feedback policy, and we'll artificially ram it through.
By "good policy" here, I mean a policy that will:
Past that point, it'll know how to update on feedback regarding the plans of how to translate its thoughts into a human-readable format. We'll then iterate on that — it'll generate plans for how to do translations, we'll give feedback. Eventually, we'll converge towards a good policy for asking for feedback.
And it should be smooth sailing from there on. It'll try solving object-level problems, ask for feedback on its object-level cognition, gradually converge towards good cognition. As its confidence in some kinds of problem-solving rises, it'll stop asking for feedback on those problems, but not on the problems it's unfamiliar with.
Overall, this breaks down the problem of aligning AI into the following challenges:
It... seems to be a significant simplification of the problem? I mean, not needing all the interpretability and surgery tools would be a bigger improvement, but that's probably not something we can have.
That said, those are the challenges of this specific concrete proposal, not the high-level idea. It's likely that it has much room for improvement as well. In particular, two significant issues are:
I feel like there should be more elegant ways to do both of that — a way to break the infinite recursion that doesn't require manual translation, a neater way to specify the feedback channel. It feels like they both can be solved by some singular tweak? And this, in turn, would erase an entire swathe of the fiddly issues with the concrete implementation.
But I don't see how, at the moment.
Like a literal .txt file it writes questions to and reads responses from.
I. e., while it's still meta-meta-cognitively uncertain on how to ask for feedback at all, and so gets into a loop of "let's ask for feedback. how do I ask for feedback? let's ask. how do I ask? let's ask. how do I—"
That is, easier than telling it to optimize for human values, or execute human commands in a way that's faithful to the spirit in which they were given, etc.
Why do you think so? Currently I'm seeing a couple of massive difficulties here that don't generally or necessarily appear in alternatives approaches:
Can you help me understand why you think that these strong requirements nonethless are simpler than most versions or approaches of the problem that you know about?
The crux is likely in a disagreement of which approaches we think are viable. In particular:
You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities
What are the approaches you have in mind, that are both promising and don't require this? The most promising ones that come to my mind are the Shard Theory-inspired one and ELK. I've recently became much more skeptical of the former, and the latter IIRC didn't handle mesa-optimizers/the Sharp Left Turn well (though I haven't read Paul's latest post yet, so I may be wrong on that).
The core issue, as I see it, is that we'll need to aim the AI at humans in some precise way — tell it to precisely translate for us, or care about us in some highly specific way, or interpret commands in the exact way humans intend them, or figure out how to point it directly at the human values, or something along those lines. Otherwise it doesn't handle capability jumps well, whether we crank it up to superintelligence straight away or try to carefully steer it along.
And the paradigm of loss functions and broad regularizers (e. g., speed/complexity penalties) seems to consist of tools too crude for this purpose. The way I see it, we'll need fine manipulation.
Since writing the original post, I've been trying to come up with convincing-to-me ways to side-step this problem (as I allude to at the post's end), but no idea so far.
You need to figure out the right thought similarity measure to bootstrap it, and there seem to be risks if you get it wrong
Yeah, that's a difficulty unique to this approach.
Fun exercise, but I'm not a fan of the total cartesian doubt phase - I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness.
I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness
Do you have anything specific in mind?
One thing might be that I'd rather have an AI design that's more naturally self-reflective, i.e. using its whole model to reason about itself, rather than having pieces that we've manually retargeted to think about some other pieces. This reduces how much Cartesian doubt is happening on the object level all at the same time, which sorta takes the AI farther away from the spec. But this maybe isn't that great an example, because maybe it's more about not endorsing the "retargeting the search" agenda.