a bunch of people are having a hard time understanding my question-answer counterfactual interval (QACI) alignment proposal, so i'm writing out this post which hopefully explains it better. in this scenario, cindy the human user uses an AI implementing of QACI to save the world.
this is not the only way i'm thinking of QACI going, but merely one example of how it could go, if it is to go well. it has many assumptions that aren't explained; it's meant to give people an idea of what QACI is aiming for, not as much to be a justification of its feasability. that said, while i don't expect a successful QACI to go exactly like this, i think this narrative captures the essential aspects of it.
ideally, i'd want to collect various objections people have to this scheme (such as disagreements about assumptions it seems to rely on), give my answers and/or adjust the plan accordingly, and make a new post about those objections and updates.
first, cindy is "in on this": she's aware of how the entire scheme is meant to function. that is required for it to actually work.
- cindy is in a room, in front of a computer, with cameras filming her. the camera's footage is being recorded on said computer.
- cindy walks to the computer and launches
aligned-AI-part-1.exeuses her webcam and maybe other sources to generate 1 gigabyte of random data. we call this blob of data the question. the data is stored on her computer, but also displayed to her — eg opened as plaintext in a text editor.
- cindy is now tasked with interpreting this data as a prompt, and notices that the it looks like random garbage — and she knows that, when the data looks like random garbage, she should type out relatively uniquely answer-identifying data that depends both on her and on the question, so she does just this. for example, she might type out whatever things she's talking about, various hashes of the input data, various hashes of data that is unique to her (such as the contents of her hard drive), stuff like that. this blob of data is called the answer. the reason the uniqueness is important is so that the blob of data actually uniquely identifies the answer typed by cindy, which would be different if cindy got a different question. whereas, if the answer was for example 1GB of zero's, this probly matches many empty text files that exist in many places on earth; or, if it's some simple pattern, maybe it can be guessed by alien superintelligences in acausal attacks in some way — and then, our AI would consider these to be valid candidates for which part of the world is the answer. maybe there's some clever algorithmic way to "entangle" the answer with the question, or something.
- once the 24 hours are over, she launches
aligned-AI-part-2.exeis the meat of the project. it launches a recursively self-improving AI which we'll call AI₀ that eventually reaches superintelligence, and executes whichever action is its best guess as to what maximizes its formal goal: to maximize whichever utility function (as a piece of math) would be returned by the (possibly computationally exponentially expensive) mathematical expression
Ewhich the world would've contained instead of the answer, if in the world, instances of question were replaced with just the string "what should the utility function be?" followed by spaces to pad to 1 gigabyte. we'll shorten this to
QACI("what should the utility function be?"). this is where a lot of the complexity of QACI is, so don't worry if you don't get it — hopefully the rest of this narrative is gonna explain it.
- AI₀ eventually emits a best guess: a different AI program, AI₁, in which AI₀ has implemented embedded agency and things like that because AI₁ can see that its output is intended to be ran inside a world. AI₀ will have make sure AI₁ is aligned with itself, of course: AI₁ is just an extra step towards the formal goal mentioned above.
- AI₁ starts thinking more seriously about its formal goal. clearly, it's gonna need to learn a lot more about the world to locate instances of question and answer in it; so it starts accessing the internet and learning about the world.
- AI₁ comes to the (true) conclusion that this world seems to contain what we'd call computers, that it's running on one such thing, and that this computer is basically the thing that generated question, emitted it into the world and received answer. so AI₁ thinks to itself "okay, let's say question was replaced with "what should the utility function be?". what would happen next?"
- AI₁ finds camera footage of the room, and thinks "aha! it looks like these things my data has talked about, a "human", was a pretty important part of what turned question into answer. i wonder what other answer this "human" would've typed into the computer if instead of the question it did get, it instead got "what should the utility function be?" as a question." (note how we never need to tell any AI the true name of "human" or "computer" or anything like that; we set up QACI such that it indirectly points to what we want, and then figuring out those complex concepts in the world is up to the AI to model in whatever way it wants)
- AI₁ starts trying to guess using its limited compute and data, but clearly that data isn't enough. nevertheless, AI₁ figures out some stuff:
- these "humans" have things they "want"
- this "human", who used the computer, seems to be intending to use this whole process AI₁ is in seems to do things it "wants"
- AI₁ should probly be a bit careful about affecting this world, because this "human"'s "want"s seem fragile and its world seems brittle. so, if the utility function that
Ewould eventually return does correspond to what this "human" would "want", which does seem like a reasonable possibility at this point, then it should try to act in ways that don't destroy its expected future utility.
- in order to get more compute and data, AI₁ very carefully hacks the internet, takes over the world, maybe prints nanobots and turns large uninhabited parts of the world into compute, and starts using its newfound access to real-world data and computing power to make better guesses as to what utility function
Ewould eventually return.
- finally! it looks like AI₁ now has the compute to make some real good informed guesses about
E. in order to get more information about this "human" that was in the room called "cindy", AI₁ also brainscans cindy.
- (cindy cooperates because she knows this was a reasonable possibility and the fact that the AI wants to do this is a sign that things are working well so far)
- it's time for a simulation! AI₁ imagines replacing all instances of question in the world, prior to its launch, with "what should the utility function be?", looking for what blob of data will take the place of answer. so, it starts running a (rough) simulation of the room cindy was in, with the computer and the cameras and everything, because that's where most of what mattered to the answer seemed to be located.
- a cindy inside the simulation reads the question blob on the computer. instead of being 1 gigabyte of random garbage, it's a plain text file that starts with "what should the utility function be?" followed by many spaces. she cheers; clearly, this alignment scheme is working as intended, indeed the AI has gotten to the point of running a simulation of her! well, it's time to get to work.
- cindy starts thinking about solutions to alignment. maybe she chats with other alignment researchers who are on board with this project — which AI₁ now also needs to brainscan in order to simulate them accurately. after 10 hours or so, she writes into the computer the mathematical expression
QACI("okay, i know this is only the first iteration, but here are my very first guesses as to what utility function to maximize to create utopia…" followed by a bunch of ideas). the crucial point here is that, because this first cindy-simulation is allowed to return not just a literal utility function, but a mathematical expression which evaluates to a utility function, she gets to make that mathematical function depend on whatever computation she wants, including another simulation of her question-answer interval. in fact, these can call each other in an arbitrarily deep stack.
- AI₁ now has its counterfactual answer. "god damn, the mathematical expression this "human" returned requires me to run another simulation of it to get the result? well, i guess it can't be helped, i'll have to repeat this process until one of the iterations of the human returns an actual utility function, instead of passing the buck to further simulations". it runs this second simulation.
- inside the second simulation, cindy reads the message on the computer, and it reads "okay, i know this is only the first iteration, but here are my very first guesses as to what utility function to maximize to create utopia…" followed by a bunch of ideas. again, she cheers; this instance of her has no memory of the first iteration of her simulation, so the fact that the alignment scheme works is news to her. she works, again for about let's say 10 hours, and then writes as answer
QACI("2nd iteration cindy here, here's the tiny bit of progress i made on alignment…" followed by more ideas)
- in this way, AI₁ simulates (or maybe, in some clever way, skips past simulating and gets straight to the result) cindy-and-her-world for many iterations.
- eventually, one of the cindy's has gotten to a solution that she's pretty confident about: it's not a full explicit utility function of what is good, but it's a much better CEV process, inspired by QACI but different in many ways. instead of returning
QACI(some string), she returns this new mathematical object.
- AI₁ goes "huh, that's new! welp, time to run that".
- on the outside, AI₁ has a strong enough hold on the world to ensure its continuation no matter what; but also, AI₁ can see that whatever this sequence of simulations will eventually end up in, it will probly entails humans not being murdered or suffering needlessly, so it avoids things that would cause that. in particular, it makes damn sure to stop anyone else from launching superintelligent AI.
- eventually, after a bunch more such macro-iterations, a utility function that creates utopias is returned, and AI₁ finally maximizes it in the world, creating utopia. in the meantime, perhaps it has been implementing increasingly accurate approximations of that utility function, and already launched into space many copies of itself tasked with running the same sequence of simulations and maximizing their utility functions in the rest of the lightcone.
This proposal is a minor variation on the HCH type ideas. The main differences seem to be
This leads to a selection of problems largely similar to the HCH problems.
Having the whole alignment community for 6 months be part of the question answerer is more likely to work than one person for a few hours, but that amplifies other problems.
This method also has the problem of amplified failure probability. Suppose somewhere down the line, millions of iterations in, cindy goes outside for a walk, and gets hit by a truck. Virtual cindy doesn't return to continue the next layer of the recursion. What then? (Possibly some code just adds "attempt 2" at the top and tries again.)
Ok, so another million layers in, cindy drops a coffee cup on the keyboard, accidentally typing some rubbish. This gets interpreted by the AI as a mathematical command, and the AI goes on to maximize ???
Chaos theory. Someone else develops a paperclip maximizer many iterations in, and the paperclip maximizer realizes it's in a simulation, hacks into the answer channel and returns "make as many paperclips as possible" to the AI.
And then there is the standard mindcrime concern. Where are all these virtual cindies going once we are done with them? We can probably just tell the AI in english that our utility function is likely to dislike deleting virtual humans. So all the virtual humans get saved on disk, and then can live in the utopia. Hey, we need loads of people to fill up the dyson sphere anyway.
I am not confident that your "make it complicated and personal data" approach at the root really stops all the aliens doing weird acausal stuff. The multiverse is big. Somewhere out there there is a cindy producing any bitstream that looks like this personal data, and somewhere out there are aliens faking the whole scenario for every possible stream of similar data. You probably need the internal counterfactual design to be resistant to acausal tampering.
(to be clear, the question and answer serve less as "training data" meant to represent the user, but as "IDs" or "coordinates" menat to locate the user in past-lightcone.)
this is true, though i think we might not need a super complex framework for hypotheticals. i have some simple math ideas that i explore a bit here, and about which i might write a bunch more.
for failure modes like the user getting hit by a truck or spilling coffee, we can do things such as at each step asking not 1 cindy the question, but asking 1000 cindy's 1000 slight variations on the question, and then maybe have some kind of convolutional network to curate their answers (such as ignoring garbled or missing output) and pass them to the next step, without ever relying on a small number of cindy's except at the very start of this process.
it is true that weird memes could take over the graph of cindy's; i don't have an answer to that apart that it seems sufficiently not likely to me that i still think this plan has promise.
hmm. that's possible. i guess i have to hope this never happens on the question-interval, on any simulation day. alternatively, maybe the mutually-checking graph of a 1000 cindy's can help with this? (but probly not; clippy can just hack the cindy's).
yup. or, if the QACI user is me, i'm probly also just fine with those local deaths; not a big deal compared to an increased chance of saving the world. alternatively, instead of being saved on disk, they can also just be recomputed later since the whole process is deterministic.
yup, i'm not confident either. i think there could be other schemes, possibly involving cryptography in some ways, to entangle the answer with a unique randomly generated signature key or something like that.
Strong upvote. Would like to see OP's response to this.
I have a pretty strong heuristic that clever schemes like this one are pretty doomed. The proposal seems to lack security mindset, as Eliezer would put it.
The most immediate/simple concrete objection I have is that no one has any idea how to create
aligned-AI-part-2.exe? I don't think figuring out what we'd do if we knew how to make a program like that is really the difficult part here.
This is a Heuristic That Almost Always Works, and it's the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.
Security mindset means look for flaws, not assume all plans are so doomed you don't need to look.
If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a new set of true names as a win condition. Not a solution, we can't train AIs with arbitrary goals, but it's progress in the same way that quantilizers was progress on mild optimization.
I don't think security mindset means "look for flaws." That's ordinary paranoia. Security mindset is something closer to "you better have a really good reason to believe that there aren't any flaws whatsoever." My model is something like "A hard part of developing an alignment plan is figuring out how to ensure there aren't any flaws, and coming up with flawed clever schemes isn't very useful for that. Once we know how to make robust systems, it'll be more clear to us whether we should go for melting GPUs or simulating researchers or whatnot."
That said, I have a lot of respect for the idea that coming up with clever schemes is potentially more dignified than shooting everything down, even if clever schemes are unlikely to help much. I respect carado a lot for doing the brainstorming.
I think a better way of rephrasing it is "clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for".
i would love a world-saving-plan that isn't "a clever scheme" with "many moving parts" but alas i don't expect it's what we get. as clever schemes with many moving parts go, this one seems not particularly complex compared to other things i've heard of.
to me it kind of is; i mean, if you have that, what do you do then? how do you use such a system to save the world?
I mostly expect by the time we know how to make a seed superintelligence and give it a particular utility function... well, first of all the world has probably already ended, but second of all I would expect progress on corrigibility and such to have been made and probably to present better avenues.
If Omega handed me
aligned-AI-part-2.exe, I'm not quite sure how I would use it to save the world? I think probably trying to just work on the utility function outside of a simulation is better, but if you are really running out of time then sure, I guess you could try to get it to simulate humans until they figure it out. I'm not very convinced that referring to a thing a person would have done in a hypothetical scenario is a robust method of getting that to happen, though?
It seems like if you can motivate an AI to do this very specific thing, you already solved the important bits somewhere else.
agreed! working on it.
As I said over on your Discord, this feels like it has a shard of hope, and the kind of thing that could plausibly work if we could hand AIs utility functions.
I'd be interested to see the explicit breakdown of the true names you need for this proposal.
only one round of initial question-answer still seems very bad to me. it's very hard to get branch coverage of a brain. random data definitely won't do it.
to be clear, the data isn't what the AI uses to learn about what the human says, the data is what the AI uses to know which thing in the world is the human, so it can then simply for example ask or brainscan the human, or learn about what it'd do in the first iteration in any number of other ways.
note for posterity: @carado and I have talked about this at length since this post and I now am mostly convinced that this is workable. I would currently describe the question (slightly metaphorically) as an "intentional glitch token", in that it is specifically designed to be a large random blob that cannot be inferred except by exploring, and which, since it gates all utility, causes the inner-aligned system to be extremely cautious.
I've been pondering that, and a thing I have been about to bring up and might as well mention in this comment is that this may cause an inner-aligned utility maximizer to sit around doing nothing forever out of abundance of caution, since it can't identify worlds where it can be sure it can identify the configuration of the world that actually increases its utility function.
How do you think about the under-definedness of counterfactuals?
EG, if counterfactuals are weird, this proposal probably does something weird, as it has to condition on increasingly weird counterfactuals.
the counterfactuals might be defined wrong but they won't be "under-defined". but yes, they might locate the blob somewhere we don't intend to (or insert the counterfactual question in a way we don't intend to); i've been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
on the other hand, if you're talking about the blob-locating math pointing to the right thing but the AI not making accurate guesses early enough as to what the counterfactuals would look like, i do think getting only eventual alignment is one of the potential problems, but i'm hopeful it gets there eventually, and maybe there are ways to check that it'll make good enough guesses even before we let it loose.
Yeah, no, I'm talking about the math itself being bad, rather than the math being correct but the logical uncertainty making poor guesses early on.
I noticed you had some other posts relating to the counterfactuals, but skimming them felt like you were invoking a lot of other machinery that I don't think we have, and that you also don't think we have (IE the voice in the posts is speculative, not affirmative).
So I thought I would just ask.
My own thinking would be that the counterfactual reasoning should be responsive to the system's overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
Sticking close to QACI, I think what this amounts to is tracking uncertainty about the counterfactuals employed, rather than solidly assuming one way of doing it is correct. But there are complex questions of how to manage that uncertainty.
i've made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i'm making progress in that direction (see the posts about blob location).
are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.
Notably, this relies on the utility function actually being sparse enough that it can't be maximized except by generating the traits abram mentions.
I'm a little suspicious about this step
What reasons do we have to believe AI_1 will be careful enough?
If we have techniques that are powerful enough to carry through this step correctly, chances are we have already solved the alignment problem. The Alignment problem is mostly about getting the AI to do anything at all without destroying the world, not about figuring out how to write down the one true perfect utility function.
one reason we might have to think that the AI would be careful about this, is that it knows i has a utility function to maximize but it doesn't know what yet, but it can make informed guesses about it. "i don't know what my human user is gonna pick as utility function, but whatever it is, it probly strongly dislikes me causing damage, so i should probly avoid that".
it's not careful because we have the alignment tech to give it the characteristic of carefulness, it's hopefully careful because it's ultimately aligned, and its best guess as to what it's aligned to entails not destroying everything that matters.
This doesn't make me any less suspicious. Humans have a utility function of "make more humans", but we still invented nuclear weapons and came within a hair's breadth of destroying the entire planet.
hello I'm "point out evolution alignment or not" brain shard. humans do not have a utility function of make more humans, they have a utility function of preserve their genetic-epigenetic-memetic self-actualization trajectory, or said less obtusly, make your family survive indefinitely. that does not mean make your family as big as possible. Even if you need to make your family big to make your family survive indefinitely, maximizing family size is a strategy chosen by almost no organisms or microbes. first order optimization is not how anything works except sometimes locally. second order or above always ends up happening because high quality optimization tries to hit targets (second order approximation of a policy update), it doesn't try to go in directions.