Yeah, I agree that your formalism achieves what we want. The challenge is getting an actual AI that is appropriately myopic with respect to U1-U4. And of things about which an AI could obtain certificates of determinacy, its own one-step actions seem most likely.
Great work, glad someone is doing this. The natural follow-up to "we are in trouble if D is large" is "in what contexts can we practically make D small?" and I think it goes beyond corrigibility. A good framework to address this might be @johnswentworth's natural latents work.
I don't think we have a good theory of how to make lexicographic preferences work. If the agent is expectation-maximizing lexicographically, it will in practice consider only the first priority unless its world-model has certainty over the relevant time-horizon about the relevant inputs to . This requires myopia and a deterministic model up to the myopic horizon. Which seems hard, but at least we have desiderata! (And so on for each of the non-final priorities.) Given bounded computation, the world-model can be deterministic over relevant latents only up to a short horizon. So either we have to have a short horizon, or make most latents irrelevant. The latter path doesn't seem crazy to me---we're essentially specifying a "deontological agent." That seems feasible for , but I'm worried about . I don't think coherence theorems forbid deontology in practice, to be clear, even though my discussion above rests on their being impossible without myopia and determinism over latents.
I guess these two points are the same: "consequentialist agents are intractable over large effective state space; we need myopia and deontology to effectively slim that space."
No object-level comments besides: this seems like a worthy angle of attack; countries besides the US and China can matter and should be mobilized. Really glad someone's pushing on this.
All the more reason to sell that service to folks who don't know!
I see the makings of a specialized service business here! This is a very particular, unenjoyable schlep to set up for most ordinary folks (e.g. me), but you paint a desirable picture.
it's not hard to imagine seeing this tweet
If one doesn't plan to go into politics, is there any value in being a bipartisan single-issue donor? How much must one donate for it to be accompanied with a message of "I will vote for whoever is better on AI x-risk"?
I like your made up notation. I'll try to answer, but I'm an amateur in both reasoning-about-this-stuff and representing-others'-reasoning-about-this-stuff.
I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.
I think the "generalization step is hard" point is roughly "you can get low by trial and error. The technique you found at the end that gets low---it better not intrinsically depend on the trial and error process, because you don't get to do trial and error on '. Moreover, it better actually work on M'."
Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That's one of their many problems.
My suggest term for standard MIRI thought would just be Mirism.
I kinda don't like "generalization" as a name for this step. Maybe "extension"? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting to be small, the difficulty of going from techniques for getting small to techniques for getting a small ' (verbiage different because of the first-time constraint), the disastrousness of even smallish '...
This is an excellent encapsulation of (I think) something different---the "fragility of value" issue: "formerly adequate levels of alignment can become inadequate when applied to a takeover-capable agent." I think the "generalization gap" issue is "those perfectly-generalizing alignment techniques must generalize perfectly on the first try".
Attempting to deconfuse myself about how that works if it's "continuous" (someone has probably written the thing that would deconfuse me, but as an exercise): if AI power progress is "continuous" (which training is, but model-sequence isn't), it goes from "you definitely don't have to get it right at all to survive" to "you definitely get only one try to get it sufficiently right, if you want to survive," but by what path? In which of the terms "definitely," "one," and "sufficiently" is it moving continuously, if any?
I certainly don't think it's via the number of tries you get to survive! I struggle to imagine an AI where we all die if we fail to align it three times in a row.
I don't put any stock in "sufficiently," either---I don't believe in a takeover-capable AI that's aligned enough to not work toward takeover, but which would work toward takeover if it were even more capable. (And even if one existed, it would have to eschew RSI and other instrumentally convergent things, else it would just count as a takeover-causing AI.)
It might be via the confidence of the statement. Now, I don't expect AIs to launch highly-contingent outright takeover attempts; if they're smart enough to have a reasonable chance of succeeding, I think they'll be self-aware enough to bide their time, suppress the development of rival AIs, and do instrumentally convergent stuff while seeming friendly. But there is some level of self-knowledge at which an AI will start down the path toward takeover (e.g., extricating itself, sabotaging rivals) and succeed with a probability that's very much neither 0 nor 1. Is this first, weakish, self-aware AI able to extricate itself? It depends! But I still expect the relevant band of AI capabilities here to be pretty narrow, and we get no guarantee it will exist at all. And we might skip over it with a fancy new model (if it was sufficiently immobilized during training or guarded its goals well).
Of course, there's still a continuity in expectation: when training each more powerful model, it has some probability of being The Big One. But yeah, I more or less predict a Big One; I believe in an essential discontinuity arising here from a continuous process. The best analogy I can think of is how every exponential with r<1 dies out and every r>1 goes off to infinity. When you allow dynamic systems, you naturally get cuspy behavior.
P is a second-order predicate; it applies to predicates. The english word "perfect" applies to things, and it's a little weirder to apply it to qualities, as least if you think of \phi and \psi as being things like "Omnibenevolent" or "is omnibenevolent." If you think of \phi and \psi as being "Omnibenevolence," it makes more sense---where we type-distinguish between "qualities as things" and "things per se." It's still weird not to be able to apply P to things-per-se. We want to be able to say "P(fido)" = "fido is perfect", but that's not allowed. We can say "P(is_good_dog)" = "being a good dog is perfect".