This proposal is a minor variation on the HCH type ideas. The main differences seem to be
This leads to a selection of problems largely similar to the HCH problems.
Having the whole alignment community for 6 months be part of the question answerer is more likely to work than one person for a few hours, but that amplifies other problems.
This method also has the problem of amplified failure probability. Suppose somewhere down the line, millions of iterations in, cindy goes outside for a walk, and gets hit by a truck. Virtual cindy doesn't return to continue the next layer of the recursion. What then? (Possibly some code just adds "attempt 2" at the top and tries again.)
Ok, so another million layers in, cindy drops a coffee cup on the keyboard, accidentally typing some rubbish. This gets interpreted by the AI as a mathematical command, and the AI goes on to maximize ???
Chaos theory. Someone else develops a paperclip maximizer many iterations in, and the paperclip maximizer realizes it's in a simulation, hacks into the answer channel and returns "make as many paperclips as possible" to the AI.
And then there is the standard mindcrime concern. Where are all these virtual cindies going once we are done with them? We can probably just tell the AI in english that our utility function is likely to dislike deleting virtual humans. So all the virtual humans get saved on disk, and then can live in the utopia. Hey, we need loads of people to fill up the dyson sphere anyway.
I am not confident that your "make it complicated and personal data" approach at the root really stops all the aliens doing weird acausal stuff. The multiverse is big. Somewhere out there there is a cindy producing any bitstream that looks like this personal data, and somewhere out there are aliens faking the whole scenario for every possible stream of similar data. You probably need the internal counterfactual design to be resistant to acausal tampering.
A difference from HCH (not the only one): As far as we can tell, an AI can't Goodhart the past (which is why the interval is in effect). I think something similar applies to the adding-random-noise part.
Only a single training example needed through use of hypotheticals.
(to be clear, the question and answer serve less as "training data" meant to represent the user, but as "IDs" or "coordinates" menat to locate the user in past-lightcone.)
We need good inner alignment. (And with this, we also need to understand hypotheticals).
this is true, though i think we might not need a super complex framework for hypotheticals. i have some simple math ideas that i explore a bit here, and about which i might write a bunch more.
for failure modes like the user getting hit by a truck or spilling coffee, we can do things such as at each step asking not 1 cindy the question, but asking 1000 cindy's 1000 slight variations on the question, and then maybe have some kind of convolutional network to curate their answers (such as ignoring garbled or missing output) and pass them to the next step, without ever relying on a small number of cindy's except at the very start of this process.
it is true that weird memes could take over the graph of cindy's; i don't have an answer to that apart that it seems sufficiently not likely to me that i still think this plan has promise.
Chaos theory. Someone else develops a paperclip maximizer many iterations in, and the paperclip maximizer realizes it's in a simulation, hacks into the answer channel and returns "make as many paperclips as possible" to the AI.
hmm. that's possible. i guess i have to hope this never happens on the question-interval, on any simulation day. alternatively, maybe the mutually-checking graph of a 1000 cindy's can help with this? (but probly not; clippy can just hack the cindy's).
So all the virtual humans get saved on disk, and then can live in the utopia. Hey, we need loads of people to fill up the dyson sphere anyway.
yup. or, if the QACI user is me, i'm probly also just fine with those local deaths; not a big deal compared to an increased chance of saving the world. alternatively, instead of being saved on disk, they can also just be recomputed later since the whole process is deterministic.
I am not confident that your "make it complicated and personal data" approach at the root really stops all the aliens doing weird acausal stuff.
yup, i'm not confident either. i think there could be other schemes, possibly involving cryptography in some ways, to entangle the answer with a unique randomly generated signature key or something like that.
I think this is brilliant as a direction to think in, but I'm object-level skeptical. I could be missing important details.
I recommend going all-in on building one. I suspect this is the bottleneck, and going full speed at your current course is often the best way to discover that you need to find a better course--or, indeed, win. Uncertainty does not imply slowing down.
This ends up looking like a sort of serial processor constrained by a sharp information bottleneck between iterations.
I have a pretty strong heuristic that clever schemes like this one are pretty doomed. The proposal seems to lack security mindset, as Eliezer would put it.
The most immediate/simple concrete objection I have is that no one has any idea how to create aligned-AI-part-2.exe
? I don't think figuring out what we'd do if we knew how to make a program like that is really the difficult part here.
This is a Heuristic That Almost Always Works, and it's the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.
Security mindset means look for flaws, not assume all plans are so doomed you don't need to look.
If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a new set of true names as a win condition. Not a solution, we can't train AIs with arbitrary goals, but it's progress in the same way that quantilizers was progress on mild optimization.
I don't think security mindset means "look for flaws." That's ordinary paranoia. Security mindset is something closer to "you better have a really good reason to believe that there aren't any flaws whatsoever." My model is something like "A hard part of developing an alignment plan is figuring out how to ensure there aren't any flaws, and coming up with flawed clever schemes isn't very useful for that. Once we know how to make robust systems, it'll be more clear to us whether we should go for melting GPUs or simulating researchers or whatnot."
That said, I have a lot of respect for the idea that coming up with clever schemes is potentially more dignified than shooting everything down, even if clever schemes are unlikely to help much. I respect carado a lot for doing the brainstorming.
I think a better way of rephrasing it is "clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for".
i would love a world-saving-plan that isn't "a clever scheme" with "many moving parts" but alas i don't expect it's what we get. as clever schemes with many moving parts go, this one seems not particularly complex compared to other things i've heard of.
to me it kind of is; i mean, if you have that, what do you do then? how do you use such a system to save the world?
I mostly expect by the time we know how to make a seed superintelligence and give it a particular utility function... well, first of all the world has probably already ended, but second of all I would expect progress on corrigibility and such to have been made and probably to present better avenues.
If Omega handed me aligned-AI-part-2.exe
, I'm not quite sure how I would use it to save the world? I think probably trying to just work on the utility function outside of a simulation is better, but if you are really running out of time then sure, I guess you could try to get it to simulate humans until they figure it out. I'm not very convinced that referring to a thing a person would have done in a hypothetical scenario is a robust method of getting that to happen, though?
only one round of initial question-answer still seems very bad to me. it's very hard to get branch coverage of a brain. random data definitely won't do it.
to be clear, the data isn't what the AI uses to learn about what the human says, the data is what the AI uses to know which thing in the world is the human, so it can then simply for example ask or brainscan the human, or learn about what it'd do in the first iteration in any number of other ways.
note for posterity: @carado and I have talked about this at length since this post and I now am mostly convinced that this is workable. I would currently describe the question (slightly metaphorically) as an "intentional glitch token", in that it is specifically designed to be a large random blob that cannot be inferred except by exploring, and which, since it gates all utility, causes the inner-aligned system to be extremely cautious.
I've been pondering that, and a thing I have been about to bring up and might as well mention in this comment is that this may cause an inner-aligned utility maximizer to sit around doing nothing forever out of abundance of caution, since it can't identify worlds where it can be sure it can identify the configuration of the world that actually increases its utility function.
I'm a little suspicious about this step
in order to get more compute and data, AI₁ very carefully hacks the internet, takes over the world, maybe prints nanobots and turns large uninhabited parts of the world into compute, and starts using its newfound access to real-world data and computing power to make better guesses as to what utility function
E
would eventually return.
What reasons do we have to believe AI_1 will be careful enough?
If we have techniques that are powerful enough to carry through this step correctly, chances are we have already solved the alignment problem. The Alignment problem is mostly about getting the AI to do anything at all without destroying the world, not about figuring out how to write down the one true perfect utility function.
one reason we might have to think that the AI would be careful about this, is that it knows i has a utility function to maximize but it doesn't know what yet, but it can make informed guesses about it. "i don't know what my human user is gonna pick as utility function, but whatever it is, it probly strongly dislikes me causing damage, so i should probly avoid that".
it's not careful because we have the alignment tech to give it the characteristic of carefulness, it's hopefully careful because it's ultimately aligned, and its best guess as to what it's aligned to entails not destroying everything that matters.
This doesn't make me any less suspicious. Humans have a utility function of "make more humans", but we still invented nuclear weapons and came within a hair's breadth of destroying the entire planet.
hello I'm "point out evolution alignment or not" brain shard. humans do not have a utility function of make more humans, they have a utility function of preserve their genetic-epigenetic-memetic self-actualization trajectory, or said less obtusly, make your family survive indefinitely. that does not mean make your family as big as possible. Even if you need to make your family big to make your family survive indefinitely, maximizing family size is a strategy chosen by almost no organisms or microbes. first order optimization is not how anything works except sometimes locally. second order or above always ends up happening because high quality optimization tries to hit targets (second order approximation of a policy update), it doesn't try to go in directions.
As I said over on your Discord, this feels like it has a shard of hope, and the kind of thing that could plausibly work if we could hand AIs utility functions.
I'd be interested to see the explicit breakdown of the true names you need for this proposal.
its formal goal: to maximize whichever utility function (as a piece of math) would be returned by the (possibly computationally exponentially expensive) mathematical expression
E
which the world would've contained instead of the answer, if in the world, instances of question were replaced with just the string "what should the utility function be?" followed by spaces to pad to 1 gigabyte.
How do you think about the under-definedness of counterfactuals?
EG, if counterfactuals are weird, this proposal probably does something weird, as it has to condition on increasingly weird counterfactuals.
the counterfactuals might be defined wrong but they won't be "under-defined". but yes, they might locate the blob somewhere we don't intend to (or insert the counterfactual question in a way we don't intend to); i've been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
on the other hand, if you're talking about the blob-locating math pointing to the right thing but the AI not making accurate guesses early enough as to what the counterfactuals would look like, i do think getting only eventual alignment is one of the potential problems, but i'm hopeful it gets there eventually, and maybe there are ways to check that it'll make good enough guesses even before we let it loose.
Yeah, no, I'm talking about the math itself being bad, rather than the math being correct but the logical uncertainty making poor guesses early on.
i've been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
I noticed you had some other posts relating to the counterfactuals, but skimming them felt like you were invoking a lot of other machinery that I don't think we have, and that you also don't think we have (IE the voice in the posts is speculative, not affirmative).
So I thought I would just ask.
My own thinking would be that the counterfactual reasoning should be responsive to the system's overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
Sticking close to QACI, I think what this amounts to is tracking uncertainty about the counterfactuals employed, rather than solidly assuming one way of doing it is correct. But there are complex questions of how to manage that uncertainty.
i've made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i'm making progress in that direction (see the posts about blob location).
My own thinking would be that the counterfactual reasoning should be responsive to the system's overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.
where straightforward utility maximization should work.
Notably, this relies on the utility function actually being sparse enough that it can't be maximized except by generating the traits abram mentions.
One assumption that stands out to me as a little questionable is the idea that Cindy will, with infinite simulated time to think, eventually manage to come up with a solution to the alignment problem. (This is compounded by the fact that she's regularly brain-wiped and can only preserve insights by cramming them into the 1 gigabyte of scratch paper afforded to her.)
1GB of text is a lot. Naively, that's a billion letters, much more if you use compression. Or you could maybe just do some kind of magic with the question containing a link to a wiki on the (simulated) internet?
If you have infinite time, you can go the monkeys on typewriters route - one of them will come up with something decent, unless an egregore gets them, or something. Though that's very unlikely to be needed - assuming that alignment is solvable by a human level intelligence (this is doing a lot of work), then it should eventually be solved.
a bunch of people are having a hard time understanding my question-answer counterfactual interval (QACI) alignment proposal, so i'm writing out this post which hopefully explains it better. in this scenario, cindy the human user uses an AI implementing of QACI to save the world.
this is not the only way i'm thinking of QACI going, but merely one example of how it could go, if it is to go well. it has many assumptions that aren't explained; it's meant to give people an idea of what QACI is aiming for, not as much to be a justification of its feasability. that said, while i don't expect a successful QACI to go exactly like this, i think this narrative captures the essential aspects of it.
ideally, i'd want to collect various objections people have to this scheme (such as disagreements about assumptions it seems to rely on), give my answers and/or adjust the plan accordingly, and make a new post about those objections and updates.
first, cindy is "in on this": she's aware of how the entire scheme is meant to function. that is required for it to actually work.
aligned-AI-part-1.exe
.aligned-AI-part-1.exe
uses her webcam and maybe other sources to generate 1 gigabyte of random data. we call this blob of data the question. the data is stored on her computer, but also displayed to her — eg opened as plaintext in a text editor.aligned-AI-part-2.exe
.aligned-AI-part-2.exe
is the meat of the project. it launches a recursively self-improving AI which we'll call AI₀ that eventually reaches superintelligence, and executes whichever action is its best guess as to what maximizes its formal goal: to maximize whichever utility function (as a piece of math) would be returned by the (possibly computationally exponentially expensive) mathematical expressionE
which the world would've contained instead of the answer, if in the world, instances of question were replaced with just the string "what should the utility function be?" followed by spaces to pad to 1 gigabyte. we'll shorten this toQACI("what should the utility function be?")
. this is where a lot of the complexity of QACI is, so don't worry if you don't get it — hopefully the rest of this narrative is gonna explain it.E
would eventually return does correspond to what this "human" would "want", which does seem like a reasonable possibility at this point, then it should try to act in ways that don't destroy its expected future utility.E
would eventually return.E
. in order to get more information about this "human" that was in the room called "cindy", AI₁ also brainscans cindy.QACI("okay, i know this is only the first iteration, but here are my very first guesses as to what utility function to maximize to create utopia…" followed by a bunch of ideas)
. the crucial point here is that, because this first cindy-simulation is allowed to return not just a literal utility function, but a mathematical expression which evaluates to a utility function, she gets to make that mathematical function depend on whatever computation she wants, including another simulation of her question-answer interval. in fact, these can call each other in an arbitrarily deep stack.QACI("2nd iteration cindy here, here's the tiny bit of progress i made on alignment…" followed by more ideas
)QACI(some string)
, she returns this new mathematical object.