I'm still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it's in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
I didn't understand what your wrote about verification well enough to have anything to say.
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it's in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?
The inner process may nevertheless use TDT if TDT doesn't diverge from CDT on the training distribution, or it might learn to use TDT but "look nice" so that it doesn't get selected against.
This was what I was intending to convey in assumption 3.
I'm currently intuiting that there's a broad basin of "seeming corrigible until you can perform a treacherous turn", but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent's justifications for its behavior, and something like its verbal loop, I don't see how things can work out.
Here are my assumptions underlying this intuition:
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer's comment.)
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they'll get.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent's cognition is necessary to spot this sort of malign reasoning.
6. We can't achieve this level of understanding via anything like current ML transparency techniques.
Which of these assumptions do you disagree with most? Or do you agree with these assumptions, but feel optimistic that we can get good worst-case guarantees anyway?
2. How easy is it to learn to be corrigible? I'd think of this as: if we impose the extra constraint that our model behave corrigibly on all inputs, in addition to solving the object-level task well, how much bigger do we need to make the model?
My current intuition: for the same reasons it seems really hard to train models that believe 51 is prime while also performing well on object-level tasks, learning to be corrigible seems very difficult.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don't expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes "sufficiently complex". I'm not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
is simply not able to foresee the impacts of its changes and so makes them 'recklessly' (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
+100. This is why I feel queasy about "OK, I judge this self-modification to be fine" when the self-modifications are sufficiently complex, if this judgment isn't based off something like zero-shot reasoning (in which case we'd have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
If we view the US government as a single entity, it's not clear that it would make sense to describe it as aligned with itself, under your notion of alignment. If we consider an extremely akrasiatic human, it's not clear that it would make sense to describe him as aligned with himself. The more agenty a human is, the more it seems to make sense to describe him as being aligned with himself.
If an AI assistant has a perfect model of what its operator approves of and only acts according to that model, it seems like it should qualify as aligned. But if the operator is very akrasiatic, should this AI still qualify as being aligned with the operator?
It seems to me that clear conceptual understandings of alignment, corrigibility, and benignity depend critically on a clear conceptual understanding of agency, which suggests a few things:
I think a lot of folks who are confused about your claims about corrigibility share my intuitions around the nature of corrigibility / the difficulty of learning corrigibility from labeled data, and I think it would shed a lot of light if you shared more of your own views on this.
I should clarify a few more background beliefs:
Let me now clarify what I mean by "foomed AGI":
In this comment thread, I was referring primarily to foomed AGIs, not the first AGIs we build. I imagine you either having a different picture of takeoff, or thinking something like "Just don't build a foomed AGI. Just like it's way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it's way too hard to build a safe foomed AGI, so let's just not do it". And my position is something like "It's probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let's do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right."
I'm happy to delve into your individual points, but before I do so, I'd like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.
Corrigibility. Without corrigibility I would be just as scared of Goodhart.
This seems like it's using a bazooka to kill a fly. I'm not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?
I agree that zero-shot reasoning doesn't save us from daemons by itself, and I think there's important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.
Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?
The daemons I'm focusing on here mostly arise from embedded agency, which Solomonoff induction doesn't capture at all. (It's worth nothing that I consider there to be a substantial difference between Solomonoff induction daemons and "internal politics"/"embedded agency" daemons.) I'm interested in hashing this out further, but probably at some future point, since this doesn't seem central to our disagreement.
But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is "the AGI was incompetent at some point, made a mistake, and bad things happened". I don't know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to "want" to do that.
However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.
If you're trying to [build the spacecraft] without putting some general intelligence into it, this sounds way harder to me, because you can't build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)
I'm surprised by how much we seem to agree about everything you've written here. :P Let me start by clarifying my position a bit:
In terms of actual disagreement, I suspect I'm much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it's doing something like 1,000,000 years' worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What's your take on this?
This proposal judges explanations by plausibility and articulateness. Truthfulness is only incidentally relevant and will be Goodharted away.
Keep in mind that the overseer (two steps forward) is always far more powerful than the agent we're distilling (one step back), is trained to not Goodhart, is training the new agent to not Goodhart (this is largely my interpretation of what corrigibility gets you), and is explicitly searching for ways in which the new agent may want to Goodhart.
I see. Given this, I think "zero-shot learning" makes sense but "zero-shot reasoning" still doesn't, since in the former "zero" refers to "zero demonstrations" and you're learning something without doing a learning process targeted at that specific thing, whereas in the latter "zero" isn't referring to anything and you're trying to get the reasoning correct in one attempt so "one-shot" is a more sensible description.
I was imagining something like "zero failed attempts", where each failed attempt approximately corresponds to a demonstration.
Are you saying that in the slow-takeoff world, we will be able to coordinate to stop AI progress after reaching AGI and then solve the full alignment problem at leisure? If so, what's your conditional probability P(successful coordination to stop AI progress | slow takeoff)?
More like, conditioning on getting international coordination after our first AGI, P(safe intelligence explosion | slow takeoff) is a lot higher, like 80%. I don't think slow takeoff does very much to help international coordination.
1. If at the time of implementing ALBA, our conceptual understanding of corrigibility is the same as it is today, how doomed would you feel?
2. How are you imagining imposing an extra constraint that our model behave corrigibly on all inputs?
3. My current best guess is that your model of how to achieve corrigibility is to train the AI on a bunch of carefully labeled examples of corrigible behavior. To what extent is this accurate?