Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

this post was written by Tamsin Leake at Orthogonal.
thanks to Julia Persson and mesaoptimizer for their help putting it together.

no familiarity with the Evangelion anime is required to understand this post, and it pretty much doesn't contain any spoilers.

this post explains the justification for, and the math formalization of, the QACI plan for formal-goal alignment. you might also be interested in its companion post, formalizing the QACI alignment formal-goal, which just covers the math in a more straightforward, bottom-up manner.

1. agent foundations & anthropics

🟣 misato — hi ritsuko! so, how's this alignment stuff going?

🟡 ritsuko — well, i think i've got an idea, but you're not going to like it.

🟢 shinji — that's exciting! what is it?

🟡 ritsuko — so, you know how in the sequences and superintelligence, yudkowsky and bostrom talk about how hard it is to fully formalize something which leads to nice things when maximized by a utility function? so much so that it serves as an exercise to think about one's values and consistently realize how complex they are?

🟡 ritsuko — ah, yes, the good old days when we believed this was the single obstacle to alignment.

🔴 asuka barges into the room and exclaims — hey, check this out! i found this fancy new theory on lesswrong about how "shards of value" emerge in neural networks!

🔴 asuka then walks away while muttering something about eiffel towers in rome and waluigi hyperstition…

🟡 ritsuko — indeed. these days, all these excited kids running around didn't learn about AI safety by thinking really hard about what agentic AIs would do — they got here by being spooked by large language models, and as a result they're thinking in all kinds of strange directions, like what it means for a language model to be aligned or how to locate natural abstractions for human values in neural networks.

🟢 shinji — of course that's what we're looking at! look around you, turns out that the shape of intelligence is RLHF'd language models, not agentic consequentialists! why are you still interested in those old ideas?

🟡 ritsuko — the problem, shinji, is that we can't observe agentic AI being published before alignment is solved. when someone figures out how to make AI consequentialistically pursue a coherent goal, whether by using current ML technology or by building a new kind of thing, we die shortly after they publish it.

🟣 misato — wait, isn't that anthropics? i'd rather stay away from that type of thinking, it seems too galaxybrained to reason about…

🟡 ritsuko — you can't really do that either — the "back to square one" interpretation of anthropics, where you don't update at all, is still an interpretation of anthropics. it's kind of like being the kind of person who, when observing having survived quantum russian roulette 20 times in a row, assumes that the gun is broken rather than saying "i guess i might have low quantum amplitude now" and fails to realize that the gun can still kill them — which is bad when all of our hopes and dreams rests on those assumptions. the only vaguely anthropics-ignoring perspective one can take about this is to ignore empirical evidence and stick to inside view, gears-level prediction of how convergent agentic AI tech is.

🟣 misato — …is it?

🟡 ritsuko — of course it is! on inside view, all the usual MIRI arguments hold just fine. it just so happens that if you keep running a world forwards, and select only for worlds that we haven't died in, then you'll start observing stranger and stranger non-consequentialist AI. you'll start observing the kind of tech we get when just dumbly scale up bruteforce-ish methods like machine learning and you observe somehow nobody publishing insights as to how to make those systems agentic or consequentialistic.

🟢 shinji — that's kind of frightening!

🟡 ritsuko — well, it's where we are. we already thought we were small in space, now we also know that we're also small in probabilityspace. the important part is that it doesn't particularly change what we should do — we should still try to save the world, in the most straightforward fashion possible.

🟣 misato — so all the excited kids running around saying we have to figure out how to align language models or whatever…

🟡 ritsuko — they're chasing a chimera. impressive LLMs are not what we observe because they're what powerful AI looks like — they're what we observe because they're what powerful AI doesn't look like. they're there because that's as impressive as you can get short of something that kills everyone.

🟣 misato — i'm not sure most timelines are dead yet, though.

🟡 ritsuko — we don't know if "most" timelines are alive or dead from agentic AI, but we know that however many are dead, we couldn't have known about them. if every AI winter was actually a bunch of timelines dying, we wouldn't know.

🟣 misato — you know, this doesn't necessarily seem so bad. considering that confused alignment people is what's caused the appearance of the three organizations trying to kill everyone as fast as possible, maybe it's better that alignment research seems distracted with things that aren't as relevant, rather than figuring out agentic AI.

🟡 ritsuko — you can say that alright! there's already enough capability hazards being carelessly published everywhere as it is, including on lesswrong. if people were looking in the direction of the kind of consequentialist AI that actually determines the future, this could cause a lot of damage. good thing there's a few very careful people here and there, studying the right thing, but being very careful by not publishing any insights. but this is indeed the kind of AI we need to figure out if we are to save the world.

🟢 shinji — whatever kind of anthropic shenanigans are at play here, they sure seem to be saving our skin! maybe we'll be fine because of quantum immortality or something?

🟣 misato — that's not how things work shinji. quantum immortality explains how you got here, but doesn't help you save the future.

🟢 shinji sighs, with a defeated look on his face — …so we're back to the good old MIRI alignment, we have to perfectly specify human values as a utility function and figure out how to align AI to it? this seems impossible!

🟡 ritsuko — well, that's where things get interesting! now that we're talking about coherent agents whose actions we can reason about, agents whose instrumentally convergent goals such as goal-content integrity would be beneficial if they were aligned, agents who won't mysteriously turn bad eventually because they're not yet coherent agents, we can actually get to work putting something together.

🟣 misato — …and that's what you've been doing?

🟡 ritsuko — well, that's kind of what agent foundations had been about all along, and what got rediscovered elsewhere as "formal-goal alignment": designing an aligned coherent goal and figuring out how to make an AI that is aligned to maximizing it.

2. embedded agency & untractability

🟢 shinji — so what's your idea? i sure could use some hope right now, though i have no idea what an aligned utility function would even look like. i'm not even sure what kind of type signature it would have!

🟡 ritsuko smirks — so, the first important thing to realize is that the challenge of designing an AI that emits output which save the world, can be formulated like this: design an AI trying to solve a mathematical problem, and make the mathematical problem be analogous enough to "what kind of output would save the world" that the AI, by solving it, happens to also save our world.

🟢 shinji — but what does that actually look like?

🟣 misato — maybe it looks like "what output should you emit, which would cause your predicted sequence of stimuli to look like a nice world?"

🟡 ritsuko — what do you think actually happens if an AI were to succeed at this?

🟣 misato — oh, i guess it would hack its stimuli input, huh. is there even a way around this problem?

🟡 ritsuko — what you're facing is a facet of the problem of embedded agency. you must make an AI which thinks about the world which contains it, not just about a system that it feels like it is interacting with.

🟡 ritsuko — the answer — as in PreDCA — is to model the world from the top-down, and ask: "look into this giant universe. you're in there somewhere. which action should the you-in-there-somewhere take, for this world to have the most expected utility?"

🟢 shinji — expected utility? by what utility function?

🟡 ritsuko — we're coming to it, shinji. there are three components to this: the formal-goal-maximizing AI, the formal-goal, and the glue in-between. embedded agency and decision theory are parts of this glue, and they're core to how we think about the whole problem.

🟣 misato — and this top-down view works? how the hell would it compute the whole universe? isn't that uncomputable?

🟡 ritsuko — how the hell do you expect AI would have done expected utility maximization at all? by making reasonable guesses. i can't compute the whole universe from the big-bang up to you right now, but if you give me a bunch of math which i'd understand to say "in worlds being computed forwards starting at some simple initial state and eventually leading to this room right now with shinji, misato, ritsuko in it, what is shinji more likely to be thinking about: his dad, or the pope's uncle?"

🟡 ritsuko — on the one hand, the question is immensely computationally expensive — it asks to compute the entire history of the universe up to this shinji! but on the other hand, it is talking about a world which we inhabit, and about which we have the ability to make reasonable guesses. if we build an AI that is smarter than us, you can bet it'll bet able to make guesses at least as well as this.

🟣 misato — i'm not convinced. after all, we relied on humans to make this guess! of course you can guess about shinji, you're a human like him. why would the AI be able to make those guesses, being the alien thing that it is?

🟡 ritsuko — i mean, one of its options is to ask humans around. it's not like it has to do everything by itself on its single computer, here — we're talking about the kind of AI that agentically saves the world, and has access to all kinds of computational resources, including humans if needed. i don't think it'll actually need to rely on human compute a lot, but the fact that it can serves as a kind of existence proof for its ability to produce reasonable solutions to these problems. not optimal solutions, but reasonable solutions — eventually, solutions that will be much better than any human or collection of humans could be able to come up with short of getting help from aligned superintelligence.

🟢 shinji — but what if the worlds that are actually described by such math are not in fact this world, but strange alien worlds that look nothing like ours?

🟡 ritsuko — yes, this is also part of the problem. but let's not keep moving the goalpost here. there are two problems: make the formal problem point to the right thing (the right shinji in the right world), and make an AI that is good at finding solutions to that problem. both seem like we can solve them with some confidence; but we can't just keep switching back and forth between the two.

🟡 ritsuko — if you have to solve two problems A and B, then you have to solve A assuming B is solved, and then solve B assuming A is solved. then, you've got a pair of solutions which work with one another. here, we're solving the problem of whether an AI would be able to solve this problem, assuming the problem points to the right thing; later we'll talk about how to make the problem point to the right thing assuming we have an AI that can solve it.

🟢 shinji — are there any actual implementation ideas for how to build such a problem-solving AI? it sure sounds difficult to me!

🟣 misato, carefully peeking into the next room — hold on. i'm not actually quite sure who's listening — it is known that capabilities people like to lurk around here.

🟤 kaji can be seen standing against a wall, whistling, pretending not to hear anything.

🟡 ritsuko — right. one thing i will reiterate, is that we should not observe a published solution to "how to get powerful problem-solving AI" before the world is saved. this is in the class of problems which we die shortly after a solution to it is found and published, so our lack of observing such a solution is not much evidence for its difficulty.

3. one-shot AI

🟡 ritsuko — anyways, to come back to embedded agency.

🟣 misato — ah, i had a question. the AI returns a first action which it believes would overall steer the world in a direction that maximizes its expected utility. and then what? how does it get its observation, update its model, and take the next action?

🟡 ritsuko — well, there are a variety of clever schemes to do this, but an easy one is to just not.

🟣 misato — what?

🟡 ritsuko — to just not do anything after the first action. i think the simplest thing to build is what i call a "one-shot AI", which halts after returning an action. and then we just run the action.

🟢 shinji — "run the action?"

🟡 ritsuko — sure. we can decide in advance that the action will be a linux command to be executed, for example. the scheme does not really matter, so long as the AI gets an output channel which has pretty easy bits of steering the world.

🟣 misato — hold on, hold on. a single action? what do you intend for the AI to do, output a really good pivotal act and then hope things get better?

🟡 ritsuko — have a little more imagination! our AI — let's call it AI₀ — will almost certainly return a single action that builds and then launches another, better AI, which we'll call AI₁. a powerful AI can absolutely do this, especially if it has the ability to read its own source-code for inspiration, but probably even without that.

🟡 ritsuko — …and because it's solving the problem "what action would maximize utility when inserted into this world", it will understand that AI₁ needs to have embedded agency and the various other aspects that are instrumental to it — goal-content integrity, robustly delegating RSI, and so on.

🟢 shinji — "RSI"? what's that?

🟣 misato sighs — you know, it keeps surprising me how many youths don't know about the acronym RSI, which stands for Recursive Self-Improvement. it's pretty indicative of how little they're thinking about it.

🟢 shinji — i mean, of course! recursive self-improvement is an obsolete old MIRI idea that doesn't apply to the AIs we have today.

🟣 misato — right, kids like you got into alignment by being spooked by chatbots. (what silly things do they even teach you in class these days?)

🟣 misato — you have to realize that the generation before you, the generation of ritsuko and i, didn't have the empirical evidence that AI was gonna be impressive. we started on something like the empty string, or at least coherent arguments where we had to actually build a gears-level inside-view understanding of what AI would be like, and what it would be capable of.

🟣 misato — to me, one of the core arguments that sold me on the importance of AI and alignment was recursive self-improvement — the idea that AI being better than humans at designing AI would be a very special, very critical point in time, downstream of which AI would be able to beat humans at everything.

🟢 shinji — but this turned out irrelevant, because AI is getting better than humans without RSI–

🟡 ritsuko — again, false. we can only observe AI getting better than humans at intellectual tasks without RSI, because when RSI is discovered and published, we die very shortly thereafter. you have a sort of consistent survivorship bias, where you keep thinking of a whole class of things as irrelevant because they don't seem impactful, when in reality they're the most impactful; they're so impactful that when they happen you die and are unable to observe them.

4. action scoring

🟣 misato — so, i think i have a vague idea of what you're saying, now. top-down view of the universe, which is untractable but that's fine apparently, thanks to some mysterious capabilities; one-shot AI to get around various embedded agency difficulties. what's the actual utility function to align to, now? i'm really curious. i imagine a utility function assigns a value between 0 and 1 to any, uh, entire world? world-history? multiverse?

🟡 ritsuko — it assigns a value between 0 and 1 to any distribution of worlds, which is general enough to cover all three of those cases. but let's not get there yet; remember how the thing we're doing is untractable, and we're relying on an AI that can make guesses about it anyways? we're gonna rely on that fact a whole lot more.

🟣 misato — oh boy.

🟡 ritsuko — so, first: we're not passing a utility function. we're passing a math expression describing an "action-scoring function" — that is to say, a function attributing scores to actions rather than to distributions over worlds. we'll make the program deterministic and make it ignore all input, such that the AI has no ability to steer its result — its true result is fully predetermined, and the AI has no ability to hijack that true result.

🟣 misato — wait, "hijack it"? aren't we assuming an inner-aligned AI, here?

🟡 ritsuko — i don't like this term, "inner-aligned"; just like "AGI", people use it to mean too many different and unclear things. we're assuming an AI which does its best to pick an answer to a math problem. that's it.

🟡 ritsuko — we don't make an AI which tries to not be harmful with regards to its side-channels, such as hardware attacks — except for its output, it needs to be strongly boxed, such that it can't destroy our world by manipulating software or hardware vulnerabilities. similarly, we don't make an AI which tries to output a solution we like, it tries to output a solution which the math would score high. narrowing what we want the AI to do greatly helps us build the right thing, but it does add constraints to our work.

🟡 ritsuko starts scribbling on a piece of paper on her desk — let's write down some actual math here. let's call the set of world-states, distributions over world-states, and be the set of actions.

🟢 shinji — what are the types of all of those?

🟡 ritsuko — let's not worry about that, for now. all we need to assume for the moment is that those sets are countable. we could define both and — define them both as the set of finite bitstrings — and this would functionally capture all we need. as for distributions over world-states , we'll define for any countable set , and we'll call "mass" the number which a distribution associates to any element.

🟣 misato — woah, woah, hold on, i haven't looked at math in a while. what do all those squiggles mean?

🟡 ritsuko is defined as the set of functions , which take an and return a number between and , such that if you take the of all 's in and add those up, you get a number not greater than . note that i use a notation of sums where the variables being iterated over are above the and the constraints that must hold are below it — so this sum adds up all of the for each such that .

🟣 misato — um, sure. i mean, i'm not quite sure what this represents yet, but i guess i get it.

🟡 ritsuko — the set of distributions over is basically like saying "for any finite amounts of mass less than 1, what are some ways to distribute that mass among some or all of the 's?" each of those ways is a distribution; each of those ways is an in .

🟡 ritsuko — anyways. the AI will take as input an untractable math expression of type , and return a single . note that we're in math here, so "is of type" and "is in set" are really the same thing; we'll use to denote both set membership and type membership, because they're the same concept. for example, is the set of all functions taking as input an and returning a — returning a real number between and .

🟢 shinji — hold on, a real number?

🟡 ritsuko — well, a real number, but we're passing to the AI a discrete piece of math which will only ever describe countable sets, so we'll only ever describe countably many of those real numbers. infinitely many, but countably infinitely many.

🟣 misato — so the AI has type , and we pass it an action-scoring function of type to get an action. checks out. where do utility functions come in?

🟡 ritsuko — they don't need to come in at all, actually! we'll be defining a piece of math which describes the world for the purpose of pointing at the humans who will decide on a scoring function, but the scoring function will only be over actions the AI should take.

🟡 ritsuko — the AI doesn't need to know that its math points to the world it's in; and in fact, conceptually, it isn't told this at all. on a fundamental, conceptual manner, it is not being told to care about the world it's in — if it could, it would take over our world and kill everyone in it to acquire as much compute as possible, and plausibly along the way drop an anvil on its own head because it doesn't have embedded agency with regards to the world around itself.

🟡 ritsuko — we will just very carefully box it such that its only meaningful output into our world, the only bits of steering it can predictably use, are those of the action it outputs. and we will also have very carefully designed it such that the only thing it ultimately cares about, is that that output have as high of an expected scoring as possible — it will care about this intrinsically, and nothing else intrinsically, such that doing that will be more important than hijacking our world through that output.

🟡 ritsuko — this meaning of "inner-alignment" is still hard to accomplish, but it is much better defined, much narrower, and thus hopefully much easier to accomplish than the "full" embedded-from-the-start alignments which very slow, very careful corrigibility-based AI alignment would result in.

5. early math & realityfluid

🟣 misato — so what does that scoring function actually look like?

🟡 ritsuko — you know what, i hadn't started mathematizing my alignment idea yet; this might be a good occasion to get started on that!

🟡 ritsuko wheels in a whiteboard — so, what i expect is that the order in which we're gonna go over the math is going to be the opposite order to that of the final math report on QACI. here, we'll explore things from the top-down, filling in details as we go — whereas the report will go from the bottom-up, fully defining constructs and then using them.

🟡 ritsuko — this is roughly what we'll be doing here. go over all hypotheses the AI could have within some set of hypotheses, called ; measure their probability, the that they correspond to our world, and how good the are in them. this is the general shape of expected scoring for actions.

🟢 shinji — wait, the set of hypotheses is called , not ? that's a bit confusing.

🟡 ritsuko — this is pretty standard in math, shinji. the reason to call the set of hypotheses is because, as explained before, sets are also types, and so will be of type rather than .

🟣 misato — what's in a , exactly?

🟡 ritsuko — the set of all relevant beliefs about things. or rather, the set of all relevant beliefs except for logical facts. logical uncertainty will be a thing on the AI's side, not in the math — this math lives in the realm "platonic perfect true math", and the AI will have beliefs about what its various parts tend to result in as one kind of logical belief, just like it'll have beliefs about other logical facts.

🟣 misato — so, a mathematical object representing empirical beliefs?

🟡 ritsuko — i would rather put it as a pair of: beliefs about what's real ("realityfluid" beliefs); and beliefs about where, in the set of real things, the AI is ("indexical" beliefs). but this can be simplified by allocating realityfluid across all mathematical/computational worlds (this is equivalent to assuming tegmark the level 4 multiverse is real, and can be done by assuming the cosmos to be a "universal complete" program running all computations) and then all beliefs are indexical. these two possibilities work out to pretty much the same math, anyways.

🟢 shinji — what the hell is "realityfluid"???

🟡 ritsukoit's a very long story, i'm afraid.

🟣 misato — think of it as a measure of how some constant amount of "matteringness"/"realness" — typically 1 unit of it — is distributed across possibilities. even though it kinda mechanistically works like probability mass, it's "in the other direction": it represents what's actually real, rather than representing what we believe.

🟢 shinji — why would it sum to 1? what if there's an infinite amount of stuff out there?

🟣 misatoyour realityfluid still needs to sum up to some constant. if you allocate an infinite amount of matteringness, things break and don't make sense.

🟡 ritsuko — indeed. this is why the most straightforward way to allocate realityfluid is to just imagine that the set of all that exists is a universal program whose computation is cut into time-steps each doing a constant amount of work, and then allocate some diminishing quantities of realityfluid to each time step.

🟣 misato — like saying that compute step number has realityfluid?

🟡 ritsuko — that would indeed normalize, but it diminishes exponentially fast. this makes world-states exponentially unlikely in the amount of compute they exist after; and there are philosophical reasons to say that exponential unlikelyness is what should count as non-existing.

🟢 shinji — what the hell are you talking about??

🟡 ritsuko hands shinji a paper called "Why Philosophers Should Care About Computational Complexity" — look, this is a whole other tangent, but basically, polynomial amounts of computation corresponds to "doing something", whereas exponential amounts of computation correspond to "magically obtaining something out of the ether", and this sort-of ramificates naturally across the rest of computational complexity applied to metaphysics and philosophy.

🟡 ritsuko — so instead, we can say that computation step number has realityfluid. this only diminishes quadratically, which is satisfactory.

🟡 ritsuko — oh, and for the same reason, the universal program needs to be quantum — for example, it needs to be a quantum equivalent of the classical universal program but for quantum computation, implemented on something like a quantum turing machine). otherwise, unless BQP=BPP, quantum multiverses like ours might be exponentially expensive to compute, which would be strange.

🟢 shinji — why ? why not or ?

🟡 ritsuko — those do indeed all normalize — but we pick because at some point you just have to pick something, and is a natural, occam/solomonoff-simple number which works. look, just–

🟢 shinji — and why are we assuming the universe is made of discrete computation anyways? isn't stuff made of real numbers?

🟡 ritsuko sighs — look, this is what the church-turing-deutsch principle is about. for any universe made up of real numbers, you can approximate it thusly:

  • compute 1 step of it with every number truncated to its first 1 binary digit of precision
  • compute 1 step of it with every number truncated to its first 2 binary digits of precision

for 1 time step with 1 bit of precision, then 2 time steps with 2 bits of precision, then 3 with 3, and so on. for any piece of branch-spacetime which is only finitely far away from the start of its universe, there exists a threshold at which it starts being computed in a way that is indistinguishable from the version with real numbers.

🟢 shinji — but they're only an approximation of us! they're not the real thing!

🟡 ritsuko sighs — you don't know that. you could be the approximation, and you would be unable to tell. and so, we can work without uncountable sets of real numbers, since they're unnecessary to explain observations, and thus an unnecessary assumption to hold about reality.

🟢 shinji, frustrated — i guess. it still seems pretty contrived to me.

🟡 ritsuko — what else are you going to do? you're expressing things in math, which is made of discrete expressions and will only ever express countable quantities of stuff. there is no uncountableness to grab at and use.

🟣 misato — actually, can't we introduce turing jumps/halting oracles into this universal program? i heard that this lets us actually compute real numbers.

🟡 ritsuko — there's kind-of-a-sense in which that's true. we could say that the universal program has access to a first-degree halting oracle, or a 20th-degree; or maybe it runs for 1 step with a 1st degree halting oracle, then 2 steps with a 2nd degree halting oracle, then 3 with 3, and so on.

🟡 ritsuko — your program is now capable, at any time step, of computing an infinite amount of stuff. let's say one of those steps happens to run an entire universe of stuff, including a copy of us. how do you sub-allocate realityfluid? how much do we expect to be in there? you could allocate sub-compute-steps — with a 1st degree halting oracle executing at step , you allocate realityfluid to each of the infinite sub-steps in the call to the halting-oracle. you're just doing discrete realityfluid allocation again, except now your some of the realityfluid in your universe is allocated at people who have obtained results from a halting oracle.

🟡 ritsuko — this works, but what does it get you? assuming halting oracles is kind of a very strange thing to do, and regular computation with no halting oracles is already sufficient to explain this universe. so we don't. but sure, we could.

🟢 shinji ruminates, unsure where to go from there.

🟣 misato interrupts — hey, do we really need to cover this? let's say you found out that this whole view of things is wrong. could you fix your math then, to whatever is the correct thing?

🟡 ritsuko waves around — what?? what do you mean if it's wrong?? i'm not rejecting the premise that i might be wrong here, but like, my answer here depends a lot on in what way i'm wrong and what is the better / more likely correct thing. so, i don't know how to answer that question.

🟣 misato snaps shinji back to attention — that's fair enough, i guess. well, let's get back on track.

6. precursor assistance

🟡 ritsuko — so, one insight i got for my alignment idea came from PreDCA, which stands for Precursor Detection, Classification, and Assistance. it consists of mathematizations for:

  • the AI locating itself within possibilities
  • locating the high-agenticness-thing which had lots of causation-bits onto itself — call it the "Precursor". this is supposed to find the human user who built/launched the AI. (Detection)
  • bunch of criteria to ensure that the precursor is the intended human user and not something else (Classification)
  • extrapolating that precursor's utility function, and maximizing it (Assistance)

🟣 misato — what the hell kind of math would accomplish that?

🟡 ritsuko — well, it's not entirely clear to me. some of it is explained, other parts seem like they're expected to just work naturally. in any case, this isn't so important — the "Learning Theoretic Agenda" into which PreDCA fits is not fundamentally similar to mine, and i do not expect it to be the kind of thing that saves us in time. as far as i predict, that agenda has purchased most of the dignity points it will have cashed out when alignment is solved, when it inspired my own ideas.

🟢 shinji — and your agenda saves us in time?

🟡 ritsuko — a lot more likely so, yes! for one, i am not trying to build an entire theory of intelligence and machine learning, and i'm not trying to develop an elegant new form of bayesianism whose model of the world has concerning philosophical ramifications which, while admittedly possibly only temporary, make me concerned about the coherency of the whole edifice. what i am trying to do, is hack together the minimum viable world-saving machine about which we'd have enough confidence that launching it is better expected value than not launching it.

🟡 ritsuko — anyways, the important thing is that that idea made me think "hey, what else could we do to even more make sure the selected precursor is the human use we want, and not something else like a nearby fly or the process of evolution?" and then i started to think of some clever schemes for locating the AI in a top-down view of the world, without having to decode physics ourselves, but rather by somehow pointing to the user "through" physics.

🟣 misato — what does that mean, exactly?

🟡 ritsuko — well, remember how PreDCA points to the user from-the-top-down? the way it tries to locate the user is by looking for patterns, in the giant computation of the universe, which satisfy these criteria. this fits in the general notion of generalized computation interpretability, which is fundamentally needed to care about the world because you want to detect not just simulated moral patients, but arbitrarily complexly simulated moral patients. so, you need this anyways, and it is what "looking inside the world to find stuff, no matter how it's encoded" looks like.

🟣 misato — and what sort of patterns are we looking for? what are the types here?

🟡 ritsuko — as far as i understand, PreDCA looks for programs, or computations, which take some input and return an policy. my own idea is to locate something less abstract, about which we can actually have information-theoretic guarantees: bitstrings.

🟣 misato — …just raw bitstrings?

🟡 ritsuko — that's right. the idea here is kinda like doing an incantation, except the incantation we're locating is a very large piece of data which is unlikely to be replicated outside of this world. imagine generating a very large (several gigabytes) file, and then asking the AI "look for things of information, in the set of all computations, which look like that pattern." we call "blobs" such bitstrings serving as *anchors into to find our world and location-within-it in the set of possible world-states and locations-within-them.

7. blob location

🟡 ritsuko — for example, let's say the universe is a conway's game of life. then, the AI could have a set of hypotheses as programs which take as input the entire state of the conway's game of life grid at any instant, and returning a bitstring which must be equal to the blob.

🟡 ritsuko — first, we define (uppercase omega, a set of lowercase omega) as the set of "world-states" — states of the grid, defined as the set of cell positions whose cell is alive.

🟢 shinji — what's and ?

🟡 ritsuko is the set of pairs whose elements are both a member of , the set of relative integers. so is the set of pairs of relative integers — that is, grid coordinates. then, is the set of subsets of . finally, is the size of set — requiring that is akin to requiring that is a finite set, rather than infinite. let's also define:

  • as the set of booleans
  • as the set of finite bitstring
  • is the set of bitstrings of length
  • is the length of bitstring

🟡 ritsuko — what do you think "locate blob in world-state " could look like, mathematically?

🟣 misato — let's see — i can use the set of bitstrings of same length as , which is . let's build a set of

🟢 shinji — wait, is the set of functions from to . but we were talking about programs from to . is there a difference?

🟡 ritsuko — this is a very good remark, shinji! indeed, we need to do a bit more work; for now we'll just posit that for any sets , is the set of always-halting, always-succeeding programs taking as input an and returning a .

🟣 misato — let's see — what about ?

🟡 ritsuko — you're starting to get there — this is indeed the set of programs which return when taking as input. however, it's merely a set — it's not very useful as is. what we'd really want is a distribution over such functions. not only would this give a weight to different functions, but summing over the entire distribution could also give us some measure of "how easy it is to find in . remember the definition of distributions, ?

🟢 shinji — oh, i remember! it's the set of functions in which sum up to at most one over all of .

🟡 ritsuko — indeed! so, we're gonna posit what i'll call kolmogorov simplicity, , which is like kolmogorov complexity except that it's a distribution, never returns 0 nor 1 for a single element, and importantly it returns something like the inverse of complexity. it gives some amount of "mass" to every element in some (countable) set .

🟣 misato — oh, i know then! the distribution, for each , must return

🟡 ritsuko — that's right! we can start to define as the function that takes as input a pair of world-state and blob of length , and returns a distribution over programs that "find" in . plus, since functions are weighed by their kolmogorov simplicity, for complex 's they're "encouraged" to find the bits of complexity of in , rather than those bits of complexity being contained in itself.

🟡 ritsuko — note also that this distribution over returns, for any function , either or , which entails that for any given , the sum of for all 's sums up to less than one — that sum represents in a sense "how hard it is to find in " or "the probability that is somewhere in ".

🟡 ritsuko — the notation here, is because returns a distribution , which is itself a function — so we apply to , and then we sample the resulting distribution on .

🟢 shinji — "the sum represents"? what do you mean by "represents"?

🟡 ritsuko — well, it's the concept which i'm trying to find a "true name" for, here. "how much is the blob located in world-state ? well, as much of the sum of the kolmogorov simplicity of every program that returns when taking as input ".

🟣 misato — and then what? i feel like my understanding of how this ties into anything is still pretty loose.

🟡 ritsuko — so, we're actually gonna get two things out of : we're gonna get how much contains (as the sum of for all 's), but we're also gonna get how to get another world-state that is like , except that is replaced with something else.

🟢 shinji — how are we gonna get that??

🟡 ritsuko — here's my idea: we're gonna make return not just but rather — a pair of the blob of a "free bitstring" (tau) which it can use to store "everything in the world-state except ". and we'll also sample programs which "put the world-state back together" given the same free bitstring, and a possibly different counterfactual blob than .

🟣 misato — so, for , is defined as something like…

🟢 shinji stares at the math for a while — actually, shouldn't the statement be more general? you don't just want to work on , you want to work on any other blob of the same length.

🟡 ritsuko — that's correct shinji! let's call the original blob the "factual blob", let's call other blobs of the same length we could insert in its stead "counterfactual blobs" and write them as — we can establish that (prime) will denote counterfactual things in general.

🟣 misato — so it's more like…

🟣 misato — … should equal, exactly?

🟡 ritsuko — we don't know what it should equal, but we do know something about what it equals: should work on that counterfactual and find the same counterfactual blob again.

🟡 ritsuko — actually, let's make be merely a distribution over functions that produce counterfactual world-states from counterfactual blobs — let's call those "counterfactual insertion functions" and denote them and their set (gamma) — and we'll encapsulate away from the rest of the math:

🟢 shinji — isn't a bit circular?

🟡 ritsuko — well, yes and no. it leaves a lot of degrees of freedom to and , perhaps too much. let's say we had some function — let's not worry about how it works. then could weigh each "blob location" by how much counterfactual world-states are similar, when sampled over all counterfactual blobs.

🟣 misato — maybe we should also constrain the programs for how long they take to run?

🟡 ritsuko — ah yes, good idea. let's say that for and , is how long it takes to run program on input , in some amount of steps each doing a constant amount of work — such as steps of compute in a turing machine.

🟡 ritsuko — (i've also replaced with since that's shorter and they're equal anyways)

🟣 misato — where does the first sum end, exactly?

🟡 ritsuko — it applies to the whole– oh, you know what, i can achieve the same effect by flattening the whole thing into a single sum. and renaming the in to to avoid confusion.

🟢 shinji — are we still operating in conway's game of life here?

🟡 ritsuko — oh yeah, now might be a good time to start generalizing. we'll carry around not just world-states , but initial world-states (alpha). those are gonna determine the start of universes — distributions of world-states being computed-over-time — and we'll use them when we're computing world-states forwards or comparing the age of world-states. for example probably needs this, so we'll need to pass it to which will now be of type :

8. constrained mass notation

🟢 shinji — i notice that you're multiplying together your "kolmogorov simplicities" and and now divided by a sum of how long they take to run. what's going on here exactly?

🟡 ritsuko — well, each of those number is a "confidence amount" — scalars between 0 and 1 that say "how much does this iteration of the sum capture the thing we want", like probabilities. multiplication is like the logical operator "and" except for confidence ratios, you know.

🟢 shinji — ah, i see. so these sums do something kinda like "expected value" in probability?

🟡 ritsuko — something kinda like that. actually, this notation is starting to get unwieldy. i'm noticing a bunch of this pattern:

🟣 misato — so, if you want to use the standard probability theory notations, you need random variables which–

🟡 ritsuko — ugh, i don't like random variables, because the place at which they get substituted for the sampled value is ambiguous. here, i'll define my own notation:

🟡 ritsuko will stand for "constrained mass", and it's basically syntactic sugar for sums, where means "sum over (where returns the set of arguments over which a function is defined), and then multiply each iteration of the sum by ". now, we just have to define uniform distributions over finite sets as…

🟢 shinji for finite set ?

🟡 ritsuko — that's it! and now, is much more easily written down:

🟢 shinji — huh. you know, i'm pretty skeptical of you inventing your own probability notations, but this is much more readable, when you know what you're looking at.

🟣 misato — so, are we done here? is this blob location?

🟡 ritsuko — well, i expect that some thing are gonna come up later that are gonna make us want to change this definition. but right now, the only improvement i can think of is to replace and with .

🟣 misato — huh, what's the difference?

🟡 ritsuko — well, now we're sampling from kolmogorov simplicity at the same time, which means that if there is some large piece of information that they both use, they won't be penalized for using it twice but only once — a tuple containing two elements which have a lot of information in common only has that information counter once by .

🟣 misato — and we want that?

🟡 ritsuko — yes! there are some cases where we'd want two mathematical objects to have a lot of information in common, and other places where we'd want them to not need to be dissimilar. here, it is clearly the former: we want the program that "deconstructs" the world-state into blob and everything-else, and the function that "reconstructs" a new world-state from a counterfactual blob and the same everything-else, to be able to share information as to how they do that.

9. what now?

🟢 shinji — so we've put together a true name for "piece of data in the universe which can be replaced with counterfactuals". that's pretty nifty, i guess, but what do we do with it?

🟡 ritsuko — now, this is where the core of my idea comes in: in the physical world, we're gonna create a random unique enough blob on someone's computer. then we're going to, still in the physical world, read its contents right after generating it. if it looks like a counterfactual (i.e. if it doesn't look like randomness) we'll create another blob of data, which can be recognized by as an answer.

🟢 shinji — what does that entail, exactly?

🟡 ritsuko — we'll have created a piece of real, physical world, which lets use use to get the true name, in pure math, of "what answer would that human person have produced to this counterfactual question?"

🟣 misato — hold on — we already have this. the AI can already have an interface where it asks a human user something, and waits for our answer. and the problem with that is that, obviously, the AI hijacks us or its interface to get whatever answer makes its job easiest.

🟡 ritsuko — aha, but this is different! we can point at a counterfactual question-and-answer chunk-of-time (call it "question-answer counterfactual interval", or "QACI") which is before the AI's launch, in time. we can mathematically define it as being in the past of the AI, by identifying the AI with some other blob which we'll also locate using , and demand that the blob identifying the AI be causally after the user's answer.

🟣 misato — huh.

🟡 ritsuko — that's another idea i got from PreDCA — making the AI pursue the values of a static version of its user in its past, rather than its user-over-time.

🟢 shinji — but we don't want the AI to lock-in our values, we want the AI to satisfy our values-as-they-evolve-over-time, don't we?

🟣 misato — well, shinji, there's multiple ways to phrase your mistake, here. one is that, actually, you do — but if you're someone reasonable, then the values you endorse are some metaethical system which is able to reflect and learn about what's good, and to let people and philosophy determine what can be pursued.

🟣 misato — but you do have values you want to lock in. your meta-values, your metaethics, you don't want those to be able to change arbitrarily. for example, you probly don't want to be able to become someone who wants everyone to maximally suffer. those endorsed, top-level, metaethics meta-values, are something you do want to lock in.

🟡 ritsuko — put it another way: if you're reasonable, then if the AI asks you what you want inside the question-answer counterfactual interval, you won't answer "i want everyone to be forced to watch the most popular TV show in 2023". you'll answer something more like "i want everyone to be able to reflect on their own values and choose what values and choices they endorse, and how, and that the field of philosophy can continue in these ways in order to figure out how to resolve conflicts", or something like that.

🟣 misato — wait, if the AI is asking the user counterfactual questions, won't it ask the user whatever counterfactual question brainhacks the user into responding whatever answer makes its job easiest? it can just hijack the QACI.

🟡 ritsuko — aha, but we don't have to have the AI formulate answers! we could do something like: make the initial question some static question like "please produce an action that saves the world", and then the user thinks about it for a bit, returns an answer, and that answer is fed back into another QACI to the user. this loops until one of the user responds with an answer which starts with a special string like "okay, i'm done for sure:", followed by a bunch of text which the AI will interpret as a piece of math describing a scoring over actions, and it'll try to output a utility function which maximizes that.

🟢 shinji — so it's kinda like coherent extrapolated volition but for actions?

🟡 ritsuko — sure, i think of it as an implementation of CEV. it allows its user to run a long-reflection process. actually, that long-reflection process even has the ability to use a mathematical oracle.

🟣 misato — how does that work?

10. blob signing & closeness in time

🟡 ritsuko — so, let's define as a function, and this'll clarify what's going on. will be our initial random factual question blob. takes as parameter a blob location for the question — which, remember, comes in the form of a function you can use to produce counterfactual world-states with counterfactual blobs! — and a counterfactual question blob , and returns a distribution of possible answers . it's defined as:

🟡 ritsuko — we're, for now just positing, that there is a function (remember that defines a hypothesis for the initial state, and mechanics, of our universe) which, given a world-state, returns a distribution of world-states that are in its future. so this piece of math samples possible future world-states of the counterfactual world-state where was replaced with , and possible locations of possible answers in those world-states.

🟣 misato? what does that mean?

🟡 ritsuko — here, the fact that doesn't necessarily sum to 1 — we say that it doesn't normalize — means that summed up over all can be less than 1. in fact, this sum will indicate "how hard is it to find the answer in futures of counterfactual world-states ?" — and uses that as the distribution of answers.

🟣 misato — hmmm. wait, this just finds whichever-answers-are-the-easiest-to-find. what guarantees that looks like an answer at all?

🟡 ritsuko — this is a good point. maybe we should define something like which, to any input "payload" of a certain length, associates a blob which is actually highly complex, because embeds a lot of bits of complexity. for example, maybe (where is the "payload") concatenates together with a long cryptographic hash of and of some piece of information highly entangled with our world-state.

🟢 shinji — we're not signing the counterfactual question , only the answer payload ?

🟡 ritsuko — that's right. signatures matter for blobs we're finding; once we've found them, we don't need to sign counterfactuals to insert in their stead.

🟣 misato — so, it seems to me like how works here, is pretty critical. for example, if it contains a bunch of mass at world-states where some AI is launched, whether ours or another, then that AI will try to fill its future lightcone with answers that would match various 's — so that our AI would find those answers instead of ours — and make those answers be something that maximize their utility function rather than ours.

🟡 ritsuko — this is true! indeed, how we sample for is pretty critical. how about this: first, we'll pass the distribution into :

🟡 ritsuko — …and inside , which is now of type , for any we'll only sample world-states which have the highest mass in that distribution:

🟡 ritsuko — the intent here is that for any way-to-find-the-blob , we only sample the closest matching world-states in time — which does rely on having higher mass for world-states that are closer in time. and hopefully, the result is that we pick enough instances of the signed answer blobs located shortly in time after the question blobs, that they're mostly dominated by the human user answering them, rather than AIs appearing later.

🟣 misato — can you disentangle the line where you sample ?

🟡 ritsuko — sure! so, we write an anonymous function — a distribution is a function, after all! — taking a parameter from the set , and returning . so this is going to be a distribution that is just like , except it's only defined for a subset of — those in .

🟡 ritsuko — in this case, is defined as such: first, take the set of elements for which . then, apply the distribution to all of them, and only keep elements for which they have the most (there can be multiple, if multiple elements have the same maximum mass!).

🟡 ritsuko — oh, and i guess is redundant now, i'll erase it. remember that this syntax means "sum over the body for all values of for which these constraints hold…", which means we can totally have the value of be bound inside the definition of like this — it'll just have exactly one value for any pair of and .

11. QACI graph

🟢 shinji — why is returning a distribution over answers, rather than picking the single element with the most mass in the distribution?

🟡 ritsuko — that's a good question! in theory, it could be that, but we do want the user to be able to go to the next possible counterfactual answer if the first one isn't satisfactory, and the one after that if that's still not helpful, and so on. for example: in the piece of math which will interpret the user's final result as a math expression, we want to ignore answers which don't parse or evaluate as proper math of the intended type.

🟢 shinji — so the AI is asking the counterfactual past-user-in-time to come up with a good action-scoring function in… however long a question-answer counterfactual interval is.

🟡 ritsuko — let's say about a week.

🟢 shinji — and this helps… how, again?

🟡 ritsuko — well. first, let's posit , which tries to parse and evaluate a bitstring representing a piece of math (in some pre-established formal language) and returns either:

  • what it evaluates to if it is a member of
  • an empty set if it isn't a member of or fails to parse or evaluate

🟡 ritsuko — we then define as a function that returns the highest-mass element of the distribution for which returns a value rather than the empty set. we'll also assume for convenience , a convenience function which converts any mathematical object into a counterfactual blob . this isn't really allowed, but it's just for the sake of example here.

🟣 misato — okay…

🟡 ritsuko — so, let's say the first call is . the user can return any expression, as their action-scoring function — they can return (a function taking an action and returning some utility measure over it), but they can also return where is the set of action-scoring functions. they get to call themselves recursively, and make progress in a sort of time-loop where they pass each other notes.

🟣 misato — right, this is the long-reflection process you mentioned. and about the part where they get a mathematical oracle?

🟡 ritsuko — so, the user can return things like:

.

🟣 misato — huh. that's nifty.

🟢 shinji — what if some weird memetic selection effects happen, or what if in one of the QACI intervals, the user randomly gets hit by a truck and then the whole scheme fails?

🟡 ritsuko — so, the user can set up giant giant acyclic graphs of calls to themselves, providing a lot of redundancy. that way, if any single node fails to return a coherent output, the next nodes can notice this and keep working with their peer's output.

🟡 ritsuko — a small graph of QACI can be accomplished with something like

🟡 ritsuko — and you can imagine how this could be extended to a larger structure such as:

12. fictional agents

🟢 shinji — wait a minute; what if the QACI user goes crazy from realizing that they're fictional, inside an AI's hypothesis, rather than in the physical world-state? especially if they have access to a goddamn math oracle.

🟡 ritsuko — ah. uuh. well, don't panic shinji, but, we are fictional. right now.

🟢 shinji starts flailing around.

🟣 misato — what the hell are you talking about?

🟡 ritsuko — misato, come on. we're inside a crazy high-tech compound designed to build and maintain giant robots to fight alien creatures, and we're discussing the entirely unrelated topic of AI alignment. oh, and, uh, there's this:

🔵 rei appears in front of them and waves at the camera.

🟣 misato — …so, why are we talking about alignment, instead of, i don't know, trying to escape?

🟡 ritsuko — misato, this is fiction, you don't escape. instead, you reason about what agents are writing your fiction. based on prior, we should vaguely guess that they're like us, and that the conversation we're having is useful for them to write a fiction about. my guess is, we're a fiction written by someone who's trying to communicate ideas about AI alignment and, for some godforsaken reason i cannot imagine, decided that we should be the characters having this conversation.

🟡 ritsuko — but don't be mistaken: if we are indeed fictional, then the future of this world we're observing barely matters. our parent world likely holds a lot more realityfluid. so we shou–

🟢 shinji, shaking — here comes this "realityfluid" bullshit again!

🟡 ritsuko — anyways, the coherent thing for fictional characters to do is to try to steer their parent world, because it's generally "realer". and generally, this is achieved by cooperating with whatever their parent world seems to be doing, so we should probably just continue to discuss AI alignment.

🟡 ritsuko — and, in a similar manner, there are people who grok these facts sufficiently well that we would expect them to take the correct actions, even upon realizing that they are the fictional consideration of a superintelligence. given the explanation i just gave, i would claim that i could do it myself, for one.

🔵 rei magically erases everyone's memory of this section of the dialogue, and then disappears, out of view from everyone.

13. top-level QACI call

🟢 shinji — so we have , the function which can call itself recursively. what's the top-level, terminal call to it which yields the action-scoring function?

🟡 ritsuko — ah, i think it'd look like:

🟡 ritsuko — where is some initial counterfactual blob, such as the plaintext string "please return a good scoring function over actions" encoded in ASCII, and then padded with zeros to be of the size needed for a blob. has type — from a question location, it returns a distribution of action-scoring functions.

🟣 misato — so like, the counterfactual user inside the call should be able to return math that calls more , but where do they get the and ?

🟢 shinji — couldn't they return the whole math?

🟡 ritsuko — ah, that's not gonna work — the chance of erroneous blob locations might accumulate too much if each does a new question location sampling; we want something more realiable. an easy solution is to the text not into a , but into a and to pass it so that the user can return a function which receives those and uses them to call .

🟡 ritsuko — actually, while we're at it, we can pass a it whole lot more things it might need…

🟢 shinji — what's going on with here?

🟡 ritsuko — oh, this is just a trick of how we implement distributions — when measuring the mass of any specific , we try to the answer payload into a function , and we only count the location when is equal to with useful parameters passed to it.

🟣 misato — what's around ? where do and come from?

🟡 ritsuko — so… remember this?

🟡 ritsuko — this is where we start actually plugging in our various parts. we'll assume some distribution over initial world-states and sample question locations in futures of those initial world-states — which will serve, for now, as the .

🟡 ritsuko — the actual AI we use will be of a type like , and so we can just call , and execute its action guess.

🟣 misato — and… that's it?

🟡 ritsuko — well, no. i mean, the whole fundamental structure is here, but there's still a bunch of work we should do if we want to increase the chances that this produces the outcomes we want.

14. location prior

🟡 ritsuko — so, right now each call to penalizes for being being too kolmogorov-complex. we could take advantage of this by encouraging our two different blob locations — the question location and the answer location — to share bits of information, rather than coming up with their own, possibly different bits of information. this increases the chances that the question is located "in a similar way" to the answer.

🟣 misato — what does this mean, concretely?

🟡 ritsuko — well, for example, they could have the same bits of information for how to find bits of memory on a computer's memory on earth, encoded in our physics, and then the two different 's and functions would only differ in what computer, what memory range, and what time they find their blobs in.

🟡 ritsuko — for this, we'll define a set of "location priors" being sampled as part of the hypothesis that samples over — let's call it (xi). we might as well posit .

🟡 ritsuko — we'll also define a kolmogorov simplicity measure which can use another piece of information, as, let's see…

🟡 ritsuko — there we go, measuring the simplicity of the pair of the prior and the element favors information being shared between them.

🟣 misato — wait, this fails to normalize now, doesn't it? because not all of is sampled, only pairs whose first element is .

🟡 ritsuko — ah, you're right! we can simply normalize this distribution to solve that issue.

🟡 ritsuko — and in we'll simply add and then pass around to all blob locations:

🟡 ritsuko — finally, we'll use it in to sample from:

15. adjusting scores

🟡 ritsuko — here's an issue: currently in , we're weighing hypotheses by how hard it is to find both the question and the answer.

🟡 ritsuko — do you think that's wrong?

🟣 misato — i think we should first ask for how hard it is to find questions, and then normalize the distribution of answers, so that harder-to-find answers don't penalize hypotheses. the reasoning behind this is that we want QACI graphs to be able to do a lot of complicated things, and that we hope question location is sufficient to select what we want already.

🟡 ritsuko — ah, that makes sense, yeah! thankfully, we can just normalize right around the call to , before applying it to :

🟢 shinji — what happens if we don't get the blob locations we want, exactly?

🟡 ritsuko — well, it depends. there are two kinds of "blob mislocations": "naive" and "adversarial" ones. naive mislocations are hopefully not a huge deal; considering that we're doing average scoring over all scoring functions weighed by mass, hopefully the "signal" from our aligned scoring functions beats out the "noise" from locations that select the wrong thing at a random place, like "boltzmann blobs".

🟡 ritsuko — adversarial blobs, however, are tougher. i expect that they mostly result from unfriendly alien superintelligences, as well as earth-borne AI, both unaligned ones and ones that might result from QACI. against those, i hope that inside QACI we come up with some good decision theory that lets us not worry about that.

🟣 misato — actually, didn't someone recently publish some work on a threat-resistant utility bargaining function, called "Rose"?

🟡 ritsuko — oh, nice! well in that case, if is of type , then we can simply wrap it around all of :

🟡 ritsuko — note that we're putting the whole thing inside an anonymous -function, and assigning to the result of applying to that distribution.

16. observations

🟢 shinji — you know, i feel like there ought to be some better ways to select hypotheses that look like our world.

🟡 ritsuko — hmmm. you know, i do feel like if we had some "observation" bitstring (mu) which strongly identifies our world, like a whole dump of wikipedia or something, that might help — something like . but how do we tie that into the existing set of variables serving as a sampling?

🟣 misato — we could look for the question in futures of the observation world-state– how do we get that world-state again?

🟡 ritsuko — oh, if you've got you an reconstitute the factual observation world-state with .

🟣 misato — in that case, we can just do:

🟡 ritsuko — oh, neat! actually, couldn't we generate two blobs and sandwich the question blob between the two?

🟣 misato — let's see here, the second observation can be

🟣 misato — how do i sample the location from both the future of and the past of ?

🟡 ritsuko — well, i'm not sure we want to do that. remember that tries to find the very first matching world-state for any . instead, how about this:

🟡 ritsuko — it's a bit hacky, but we can simply demand that "the world-state be in the future of the world-state more than the world-state is in the future of the world-state".

🟣 misato — huh. i guess that's… one way to do it.

🟢 shinji — could we encourage the blob location prior to use the bits of information from the observations? something like…

🟡 ritsuko — nope. because then, 's programs can simply return the observations as constants, rather than finding them in the world, which defeats the entire purpose.

🟣 misato — …so, what's in those observations, exactly?

🟡 ritsuko — well, is mostly just going to be with "more, newer content". but the core of it, , could be a whole lot of stuff. a dump of wikipedia, a callable of a some LLM, whatever else would let it identify our world.

🟢 shinji — can't we just, like, plug the AI into the internet and let it gain data that way or something?

🟡 ritsuko — so there's like obvious security concerns here. but, assuming those were magically fixed, i can see a way to do that: could be a function or mapping rather than a bitstring, and while the AI would observe it as a constant, it could be lazily evaluated. including, like, could be a fully memoized function — such that the AI can't observe any mutable state — but it would still point to the world. in essence, this would make the AI point to the entire internet as its observation, though of course it would in practice be unable to obtain all of it. but it could navigate it just as if it was a mathematical object.

🟣 misato — interesting. though of course, the security concerns make this probably unviable.

🟡 ritsuko — hahah. yeah. oh, and we probably want to pass inside :

17. where next

🟣 misato — so, is that it then? are we done?

🟡 ritsuko — hardly! i expect that there's a lot more work to be done. but this is a solid foundation, and direction to explore. it's kind of the only thing that feels like a path to saving the world.

🟢 shinji — you know, the math can seem intimidating at first, but actually it's not that complicated. one can figure out this math, especially if they get to ask questions in real time to the person who invented that math.

🟡 ritsuko — for sure! it should be noted that i'm not particularly qualified at this. my education isn't in math at all — i never really did math seriously before QACI. the only reason why i'm making the QACI math is that so far barely anyone else will. but i've seen at least one other person try to learn about it and come to understand it somewhat well.

🟢 shinji — what are some directions which you think are worth exploring, for people who want to help improve QACI?

🟡 ritsuko — oh boy. well, here are some:

  • find things that are broken about the current math, and ideally help fix them too.
  • think about utility function bargaining more — notably, perhaps scores are regularized, such as maybe by weighing ratings that are more "extreme" (further away from ) as less probable. alternatively, maybe scoring functions have a finite amount of "votestuff" that they get to distribute amongst all options the way a normalizing distribution does, or maybe we implement something kinda like quadratic voting?
  • think about how to make a lazily evaluated observation viable. i'm not sure about this, but it feels like the kind of direction that might help avoid unaligned alien AIs capturing our locations by bruteforcing blob generation using many-worlds.
  • generally figure out more ways to ensure that the blob locations match the world-states we want — both by improving and , and by finding more clever ways to use them — you saw how easy it was to add two blob locations for the two observations .
  • think about turning this scheme into a continuous rather than one-shot AI. (possibly exfohazardous, do not publish)
  • related to that, think about ways to make the AI aligned not just with regards to its guess, but also with regards to its side-effects, so as to avoid it wanting to exploit its way out. (possibly exfohazardous, do not publish)
  • alternatively, think about how to box the AI so that the output with regards to which it is aligned is its only meaningful source of world-steering.
  • one thing we didn't get into much is what could actually be behind , , and . you can read more about those here, but i don't have super strong confidence in the way they're currently put together. in particular, it would be great if someone who groks physics a lot more than me thought about whether many-worlds gives unaligned alien superintelligences the ability to forge any blob or observation we could put together in a way that would capture our AI's blob location.
  • maybe there are some ways to avoid this by tying the question world-state with the AI's action world-state? maybe implementing embedded agency helps with this? note that blob location can totally locate the AI's action, and use that to produce counterfactual action world-states. maybe that is useful. (possibly exfohazardous, do not publish)
  • think about and the function (see the full math post) and how to either implement it or achieve a similar effect otherwise. for example, maybe instead of relying on an expensive hash, we can formally define that need to be "consequentialist agents trying to locate the blob in the way we want", rather than any program that works.
  • think about how to make counterfactual QACI intervals resistant to someone launching unaligned superintelligence within them.

🟣 misato — ack, i didn't really think of that last one. yeah, that sounds bad.

🟡 ritsuko — yup. in general, i could also do with people who could help with inner-alignment-to-a-formal-goal, but that's a lot more hazardous to work on. hence why we have not talked about it. but there is work to be done on that front, and people who think they have insights should probly contact us privately and definitely not publish them. interpretability people are doing enough damage to the world as it is.

🟢 shinji — well, things don't look great, but i'm glad this plan is around! i guess it's something.

🟡 ritsuko — i know right? that's how i feel as well. lol.

🟣 misato — lmao, even.


New to LessWrong?

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 9:15 PM
[-]Max H10mo1711

Probably not directly relevant to most of the post, but I think:
 

we don't know if "most" timelines are alive or dead from agentic AI, but we know that however many are dead, we couldn't have known about them. if every AI winter was actually a bunch of timelines dying, we wouldn't know.

Is probably false. 

It might be the case that humans are reliably not capable of inventing catastrophic AGI without a certain large minimum amount of compute, experimentation, and researcher thinking time which we have not yet reached. A superintelligence (or smarter humans) could probably get much further much faster, but that's irrelevant in any worlds where higher-intelligence beings don't already exist.

With hindsight and an inside-view look at past trends, you can retro-dict what the past of most timelines in our neighborhood probably look like, and conclude that most of them have probably not yet destroyed themselves.

It may be that going forward this trend does not continue: I do think most timelines including our own are heading for doom in the near future, and it may be that the history of the surviving ones will be full of increasingly implausible development paths and miraculous coincidences. But I think the past is still easily explained without any weird coincidences if you take a gears-level look at the way SoTA AI systems actually work and how they were developed.

[-]Max H10moΩ240

we don't make an AI which tries to not be harmful with regards to its side-channels, such as hardware attacks — except for its output, it needs to be strongly boxed, such that it can't destroy our world by manipulating software or hardware vulnerabilities. similarly, we don't make an AI which tries to output a solution we like, it tries to output a solution which the math would score high. narrowing what we want the AI to do greatly helps us build the right thing, but it does add constraints to our work.

 

Another potential downside of this approach: it places a lot of constraints on the AI itself, which means it probably has to be strongly superintelligent to start working at all.

I think an important desiderata of any alignment plan is that your AI system starts working gradually, with a "capabilities dial" that you (and the aligned system itself) turn up just enough to save the world, and not more.

Intuitively, I feel like an aligned AGI should look kind of like a friendly superhero, whose superpower is weak superintelligence, superhuman ethics, and a morality which is as close as possible to the coherence-weighted + extrapolated average morality of all currently existing humans (probably not literally; I'm just trying to gesture at a general thing of averaging over collective extrapolated volition / morality / etc.).

Brought into existence, that superhero would then consider two broad classes of strategies:

  1.  Solve a bunch of hard alignment problems: embedded agency, stable self-improvement, etc. and then, having solved those, build a successor system to do the actual work.
  2. Directly do some things with biotech / nanotech / computer security / etc. at its current intelligence level to end the acute risk period. Solve remaining problems at its leisure, or just leave them to the humans.

From my own not-even-weakly superhuman vantage point, (2) seems like a much easier and less fraught strategy than (1). If I were a bit smarter, I'd try saving the world without AI or enhancing myself any further than I absolutely needed to.

Faced with the problem that the boxed AI in the QACI scheme is facing... :shrug:. I guess I'd try some self-enhancement followed by solving problems in (1), and then try writing code for a system that does (2) reliably. But it feels like I'd need to be a LOT smarter to even begin making progress.

Provably safely building the first "friendly superhero" might require solving some hard math and philosophical problems, for which QACI might be relevant or at least in the right general neighborhood. But that doesn't mean that the resulting system itself should be doing hard math or exotic philosophy. Here, I think the intuition of more optimistic AI researchers is actually right: an aligned human-ish level AI looks closer to something that is just really friendly and nice and helpful, and also super-smart.

(I haven't seen any plans for building such a system that don't seem totally doomed, but the goal itself still seems much less fraught than targeting strong superintelligence on the first try.)

[-]dr_s10mo44

when someone figures out how to make AI consequentialistically pursue a coherent goal, whether by using current ML technology or by building a new kind of thing, we die shortly after they publish it.

Provably false, IMO. What makes such AI deadly isn't its consequentialism, but its capability. Any such AI that:

  1. isn't smart enough to consistently and successfully deceive most humans, and

  2. isn't smart enough to improve itself

is containable and ultimately not an existential threat, just like a human consequentialist wouldn't be. We even have an example of this, someone rigged together ChaosGPT, an AutoGPT agent with the explicit goal of destroying humanity, and all it can do is mumble to itself about nuclear weapons. You could argue it's not pursuing its goal coherently enough, but that's exactly the point, it's too dumb. Self improvement is the truly dangerous threshold. Unfortunately that's not a very high one (probably being somewhen at the upper end of competent human engineers and scientific).

it's kind of like being the kind of person who, when observing having survived quantum russian roulette 20 times in a row, assumes that the gun is broken rather than saying "i guess i might have low quantum amplitude now" and fails to realize that the gun can still kill them — which is bad when all of our hopes and dreams rests on those assumptions

Yes, this is exactly the reason why you shouldn't update on "antropic evidence" and base your assumptions on it. The example with quantum russian roulette is a bit of a loaded one (pun intended), but here is the general case:

You have a model of reality, you gather some evidence which seem to contradict this model. Now you can either update your model, or double down on it, claiming that all the evidence is a bunch of outliners. 

Updating on antropics in such situation is refusing to update your model when it contradicts the evidence. It's adopting an anti-laplacian prior while reasoning about life or death (survival or extinction) situations - going a bit insane specifically in the circumstances with the highest stakes possible.

[-]Signer10mo20

Still don't buy this "realityfluid" business. Certainly not in the "Born measure is measure of realness" sense. It's not necessary to conclude that some number is realness just because otherwise your epistemology doesn't work so good - it's not a law, that you must find surprising finding yourself in a branch with high measure, when all branches are equally real. Them all being equally real doesn't contradict observations, it's just means the policy of expecting stuff to happen according to Born probabilities gives you high Born measure of knowing about Born measure, not real knowledge about reality.

that's fair, but if "amounts of how much this is matters"/"amount of how much this is real" is not "amounts of how much you expect to observe things", then how could we possibly determine what it is? (see also this)

[-]Signer10mo10

I think that expecting to observe things according to branch counting instead of Born probabilities is a valid choice. Anything bad happens if you do it only of you already care about Born measure.

But if the question is "how do you use observations to determine what's real" than - indirectly by using observations to figure out that QM is true? Not sure if even this makes sense without some preference for high measure, but maybe it is. Maybe by only excluding possibility of your branch not existing, once your observe it? And valuing the measure of you indirectly knowing about realness of everything is not incoherent too ¯\(ツ)/¯. I more in "advocating for people to figure it out in more detail" stage, than having any answers^^.

[-][anonymous]10mo20
[This comment is no longer endorsed by its author]Reply
[-]Signer10mo10

you should rationally expect to observe things according to them

I disagree. It's only rational if you already value having high Born measure. Otherwise what bad thing happens if you expect to observe every quantum outcome with equal probability? It's not that you would be wrong. It's just that Born measure of you in the state of being wrong will be high. But no one forces you to care about that. And other valuable things, like consciousness, work fine with arbitrary low measure.

You have to speak of a “happening” density over a continuous space the points of which are branches.

Yeah, but why you can't use uniform density? Or I don't know, I'm bad at math, maybe something else analogous to branch counting in discrete case. And you would need to somehow define "you" and other parts of your preferences in term of continuous space anyway - there is no reason this definition have to involve Born measure.

[-][anonymous]10mo00
[This comment is no longer endorsed by its author]Reply
[-]Signer10mo10

I'm not against distributions in general. I'm just saying that conditional on MWI there is no uncertainty about quantum outcomes - they all happen.

if you prepared a billlion such qubits in a lab and measured them all, the number of 0s would be in the vicinity of 360 million with virtual certainty.

But that's not what the (interpretation of the) equations say(s). The equations say that all sequences of 0s and 1s exist and you will observe all of them.

it’ll end up either in epistemic probabilities that concord with long-run empirical frequencies

They only concord with long-run empirical frequencies in regions of configuration space with high Born measure. They don't concord with, for example, average frequencies across all observers of the experiment.

For instance, if I find an atom with a 1m half-life and announce to the world that I’ll blow up the moon when it decays, and you care about the moon enough to take a ten-minute Uber to my secret base but aren’t sure whether you should pay extra for a seven-minute express ride, the optimal decision requires determining whether the extent to which riding express decreases my probability of destroying the moon is, when multiplied by your valuation of the moon, enough to compensate for the cost of express.

The point is there is no (quantum-related) uncertainty about moon being destroyed - it will be destroyed and also will be saved. My actions then should depend on how I count/weight moons across configuration space. And that choice of weights depends on arbitrary preferences. I may as well stop caring about the moon after two days.

Does the one-shot AI necessarily aim to maximize some function (like the probability of saving the world, or the expected "savedness" of the world or whatever), or can we also imagine a satisficing version of the one-shot AI which "just tries to save the world" with a decent probability, and doesn't aim to do any more, i.e., does not try to maximize that probability or the quality of that saved world etc.?

I'm asking this because

  • I suspect that we otherwise might still make a mistake in specifying the optimization target and incentivize the one-shot AI to do something that "optimally" saves the world in some way we did not foresee and don't like.
  • I try to figure out whether your plan would be hindered by switching from an optimization paradigm to a satisficing paradigm right now in order to buy time for your plan to be put into practice :-)
[-]O O10mo00

Maybe I am getting hung over the particular wording but are we assuming our agent has arbitrary computation power when we say they have a top down model of the universe? Is this a fair assumption to make or does this arbitrarily constrain our available actions. 

where did you get to in the post? i believe this is addressed afterwards.

[-]O O10mo10

Is it?

It seems like this assumption is used later on.

For example, I am a little confused by the reality fluid section but if it’s just the probability an output is real, I feel like we can’t just arbitrarily decide to 1/n^2 (justifying it by ocaams razor doesn’t seem very mathematical and this is counterintuitive to real life). This seems to give our program arbitrary amounts of precision.

Furthermore associating polynomial computational complexity with this measure of realness and NP with unreal ness also seems very odd to me. There are many simple P programs that are incomputable and NP outputs can correspond with realness. I’m not sure if I’m just wholly misunderstanding this section, but the justification for all this is just odd, we are assuming because reality exists, it must be computable essentially?

Intuitively simulating the universe with a quantum computer seems very hard as well. Don’t see why it would be strange for it to be hard. I am not qualified to evaluate that claim, but it seems extraordinary enough to require someone with the background to chime in.

Furthermore, don’t really see how you can practically get an Oracle with Turing jumps.

I’m not sure how important this math is for the rest of the section, but it seems like we use this oracle to answer questions.