## LESSWRONGLW

Thanks for your patience, I think this is important and helpful to talk through (hope it's as helpful for you as for me!)

Let's introduce two terminologies I made up. First, the thing I mentioned above:

• Non-optimization means that "an action leading to a "good" consequence (according to a predetermined criterion) happens no more often than chance" (e.g. a rock)
• Level-1 optimization means "an action leading to a "good" consequence happens no more often than chance at first, but once it's stumbled upon, it tends to be repeated in the future". (e.g. bacteria)
• Level-2 optimization means "an action leading to a "good" consequence is taken more often than chance from the start, because of foresight and planning". (e.g. human)

Second, when you run a program:

• Algorithm Land is where you find abstract mathematical entities like "variables", "functions", etc.
• Real World is that place with atoms and stuff.

Now, when you run a program, you can think of what's happening in Algorithm Land (e.g. a list of numbers is getting sorted) and what's happening in the Real World (e.g. transistors are switching on and off). It's really always going to be both at once.

And now let's simplify things greatly by putting aside the case of world-modeling programs, which have a (partial, low-resolution) copy of the Real World inside Algorithm Land. Instead, let's restrict our attention a chess-playing program or any other non-world-modeling program.

Now, in this case, when we think about Level-2 optimization, the foresight and planning involved entail searching exclusively through causal pathways that are completely inside Algorithm Land. (Why? Because without a world model, it has no way to reason about Real-World causal pathways.) In this case, I say there isn't really anything much to worry about.

Why not worry? Think about classic weird AGI disaster scenarios. For example, the algorithm is optimizing for the "reward" value in register 94, so it hacks its RAM to overwrite the register with the biggest possible number, then seizes control of its building and the power grid to ensure that it won't get turned off, then starts building bigger RAMs, designing killer nanomachines, and on and on. Note that ALL those things (1) involve causal pathways in the Real World (even if the action and consequence are arguably in Algorithm Land) and (2) would be astronomically unlikely to occur by random chance (which is what happens without Level-2 optimization). (I won't say that nothing can go awry with Level-1 optimization—I have great respect for bacteria—but it's a much easier situation to keep under control than rogue Level-2 optimization through Real-World causal pathways.)

Again, things that happen in Algorithm Land are also happening in the Real World, but the mapping is kinda arbitrary. High-impact things in Algorithm Land are not high-impact things in the Real World. For example, using RAM to send out manipulative radio signals is high-impact in the Real World, but just a random meaningless series of operations in Algorithm Land. Conversely, an ingeniously-clever chess move in Algorithm Land is just a random activation of transistors in the Real World.

(You do always get Level-1 optimization through Real-World causal pathways, with or without a world model. And you can get Level-2 optimization through Real-World causal pathways, but a necessary requirement seems to be an algorithm with a world-model and self-awareness (i.e. knowledge that there is a relation between things in Algorithm Land and things in the Real World).

# 24

Abstract A self-unaware AI oracle is a system that builds a world-model inside which there is no concept of "me". I propose that such systems are safe to operate, that they can be plausibly built by continuing known AI development paths, and that they would be competitive with any other type of AI—capable of things like inventing new technology and safe recursive self-improvement.

Epistemic status: Treat as preliminary brainstorming; please let me know any ideas, problems, and relevant prior literature.

# Overview

My proposal is to build a system that takes in data about the world (say, all the data on the internet) and builds a probabilistic generative model of that data, in the sense of taking an arbitrary number of bits and then predicting one or more masked bits. If you want to imagine something more specific, here are three examples: (1) A hypercomputer using Solomonoff Induction; (2) Some future descendant of GPT-2, using a massive Transformer (or some better future architecture); (3) A neocortex-like algorithm that builds low-level predictive models of small segments of data, then a hierarchy of higher-level models that predict which of the lower-level models will occur in which order in different contexts.[1]

Further, we design the system to be "self-unaware"—we want it to construct a generative world-model in which its own physical instantiation does not have any special status. The above (1-2) are examples, as is (3) if you don't take "neocortex-like" too literally (the actual neocortex can learn about itself by, say, controlling muscles and observing the consequences).

Finally, we query the system in a way that is compatible with its self-unawareness. For example, if we want to cure cancer, one nice approach would be to program it to search through its generative model and output the least improbable scenario wherein a cure for cancer is discovered somewhere in the world in the next 10 years. Maybe it would output: "A scientist at a university will be testing immune therapy X, and they will combine it with blood therapy Y, and they'll find that the two together cure all cancers". Then, we go combine therapies X and Y ourselves.

# What is self-unawareness?

In the context of predictive world-modeling systems, a self-unaware system (a term I just made up—is there an existing term?) is one that does not have itself (or any part of itself, or any consequences of itself) as specially-flagged entities in its world model.

Example of a self-aware system: A traditional RL agent. (Why? Because it has a special concept of "its own actions" represented in its models.)

Example of a self-unaware system: Any system that takes inputs, does a deterministic computation (with no further inputs from the environment), and spits out an output. (Why? Because when you correctly compute a computable function, you get the same answer regardless of where and whether the computation is physically instantiated in the universe.) (Edit to add: On second thought, this is wrong, according to the definition of self-unawareness that I'm using everywhere else. The "more examples" subsection is a better description of what I'm getting at.)

In one sense, ensuring that an AGI is self-unaware seems like it should be pretty easy; "The space of all computable functions" is a pretty big space to explore, and that doesn't even exhaust the possibilities! On the other hand, of course there are always pitfalls ... for example, if your code has a race condition, that's a side-channel potentially leaking information from the physical instantiation into the (symbolic) world model. Still, designing a system to be self-unaware seems pretty tractable, and maybe even amenable to formal verification (flag some variables as "part of the world-model" and other variables as "self-knowledge" and prevent them from intermingling, or something like that...).

## More examples of how to think about self-unawareness

• If a self-unaware system is, at some moment, consolidating its knowledge of anthropology, it doesn't "know" that it is currently consolidating its knowledge of anthropology—this fact is not represented in the world-model it's building.
• If a self-unaware system is running on a particular supercomputer in Mexico, maybe its world-model "knows" (from news stories) that there is a new AI research project using this particular supercomputer in Mexico, but it won't conclude that this research project is "me", because, as far as it knows, there is no "me"; it is utterly ignorant of its own existence.

If you find this unintuitive, well, so do I! That's because self-unaware systems are super non-anthropomorphic. If I'm able to think straight about this concept, it's only by firmly grounding myself in the three examples I mentioned in the Overview. For example, take a hypercomputer, using Solomonoff Induction to find the world-model that most parsimoniously predicts all the data on the internet. Does this world-model contain the statement: "I am a hypercomputer running Solomonoff Induction"? No!! That's just not something that would happen in this system. (Correction[2])

Just as Hume said you can't derive an "ought" from an "is", my contention here is that you can't derive a first-person perspective from any amount of third-person information.

# How do you query a self-unaware system?

There's some awkwardness in querying a self-unaware system, because it can't just directly apply its intelligence to understanding your questions, nor to making itself understood by you. Remember, it doesn't think of itself as having input or output channels, because it doesn't think of itself period! Still, if we spend some time writing (non-intelligent) interface code, I think querying the system should ultimately work pretty well. The system does, after all, have excellent natural-language understanding inside of it.

I think the best bet is to program the system to make conditional predictions about the world, using its world-model. I gave an example above: "Calculate the least improbable scenario, according to your world-model, wherein a cure for cancer is discovered anywhere in the world". The subroutine does some calculation, writes the answer to disk, and terminates (of course, as always, it doesn't know that it wrote the answer to disk, it just does it). As we read the answer, we incidentally learn the cure for cancer. I expect that we would have some visibility into its internal world-model, but even a black-box predictive world-modeler is probably OK. Imagine prompting GPT-2 for the most likely completion of the sentence "In 2030 scientists finally discovered a cure for cancer, and it was ...", or something like that.

## Can you give it multiple queries in sequence without resetting it?

Yes, although things get a bit tricky when the system starts reasoning about itself. (It reasons about itself in exactly the same way, and for exactly the same reason, as it reasons about anything else in the world.)

Suppose that we continue allowing our self-unaware system to have read-only internet access, after we've published the cure for cancer from the previous section. Now plastered all over the newspapers are stories about the famous Self-Unaware AI Oracle running in on a supercomputer in Mexico, which has just invented the cure for cancer. The system now will definitely put self-unaware AI oracles into its predictive generative world-model (if it hadn't already), which entails trying to understand and predict what such a thing would do in different circumstances. Maybe it even reads its own source code!

Unfortunately, it won't be able to reason about itself perfectly—that would require simulating itself, which causes an infinite regress. But the system will apply all its models and heuristics, and do the best it can to come up with a predictive model of itself, on the basis of the (non-privileged) information it has. (Note that there is no reason to expect it to build a self-consistent model, i.e., a model in which its own predictions about itself are correct.)

OK, now we go back and ask the system for something else, say, a cure for Alzheimer's. Again, we say "Please output the least improbable scenario, according to your world-model, wherein Alzheimer's is cured in the next 10 years". It might say "Those scientists in Mexico, using their Self-Unaware AI Oracle, learn the following treatment approach...".

Now, recall that with the classic AI Oracle, we worry about manipulative answers. In this particular case, we can be sure that the system itself is not inherently manipulative (see below), but we would still be in trouble if the Self-Unaware AI Oracle in Mexico believes that the Self-Unaware AI Oracle in Mexico would be manipulative. This failure mode seems extremely unlikely. As I mentioned a couple paragraphs ago, its third-person model of itself will be an imperfect approximation to the real thing, but the belief that it would be manipulative would be a bizarrely specific modeling error that came out of nowhere!

Still, out of an abundance of caution, the query should probably be something like: "Please output the least improbable scenario, according to your world-model, wherein a cure for Alzheimer's is invented by a group with no access to any AI oracles!" Then we avoid the self-modeling issues altogether. Indeed, it should be possible to use a predictive world model to do counterfactual reasoning about what might happen in a world in which there are no AIs whatsoever.

# Self-unaware AIs are safe to use

As far as I can tell, a self-unaware system should be about as safe as we could possibly hope for. Manipulation, deception, self-preservation, reproduction, and all the other Omohundro goals seem to be completely out of the question for a system with no knowledge that it is a calculation physically instantiated in the universe!

I'm trying hard to think of failure modes here. OK, there's a potential for the system to describe possible futures while leaving out critical details. So, let's say we ask for the least improbable scenario wherein "cancer is cured", and the system describes the events leading up to a plague that wipes out humanity. OK, well then we ask for the second least improbable scenario, and this one sounds pretty good! But after we inject the chemical into the first test patient, it turns out that this one is also a plague that wipes out humanity, but the AI's model output was unfortunately treating that fact as an irrelevant detail and neglecting to mention it. Are we back to Goodhart's law and the "nearest unblocked strategy" problem? Sort of, but I don't think it's particularly worrisome version. Again, we're asking the system to guess what normal humans might plausibly do in the natural course of events (in the absence of AI), and we can look at the probabilities assigned to the scenarios to make sure they're not outlandishly small. We can also program the system to answer our follow-up questions, like "according to your models, if someone rolls out this therapy, what is the likely impact on lifespan? what is the likely impact on the environment? how does it work on the cellular level?" and so on. And we can trust that, while the answers may be imperfect, they will not be manipulative. I'm really not seeing any cause for concern here, or elsewhere, although I'm going to keep thinking about it.

# Are Self-Unaware AI Oracles competitive with other approaches to AGI?

I see two main disadvantages of Self-Unaware AI Oracles, but I think that both are less problematic than they first appear.

The first disadvantage is that these things are completely incompatible with RL techniques (as far as I can tell), and a lot of people seem to think that RL is the path to superintelligence. Well, I'm not at all convinced that we need RL, or that RL would ultimately even be all that helpful. The alternative path I'm proposing here is self-supervised learning: Given a sequence of bits from the internet, predict the subsequent bits. So there's a massive amount of training data—for example, I heard that 100,000 years of video have been uploaded to YouTube! I keep going back to those three examples from the beginning: (1) GPT-2 shows that we can get impressively far on this type of self-supervised learning even with today's technology; (2) Solomonoff induction on the entire internet is the astronomically high ceiling on what's possible; and (3) the human brain—which works primarily on exactly this type of self-supervised learning[3]—is a nice reference point for how far we might get along this path just by brute-force biomimetic engineering.

The second disadvantage is that it's still an oracle, needing a human in the loop.[4] But as far as oracles go, it's about as powerful as you could hope for: able to answer more-or-less arbitrary questions, and able to design new technology, as in the cancer example above. In particular, we can take a bootstrapping strategy, where we can ask the safe self-unaware oracle to help us design a safe AGI agent.

By the same token, despite appearances, Self-Unaware AI Oracles are capable of recursive self-improvement: We just present the query in the third person. ("This is a Self-Unaware AI Oracle", we say to it, holding up a giant mirror. "How might scientists make this type of system better?") We can even record the system doing a calculation, then pass that video back to itself as an input to improve its self-models. I think this would be a quite safe type of self-improvement, insofar as self-unawareness is (I hope) possible to rigorously verify, and also insofar as we're not worried about manipulative suggestions.

# Conclusion

Again, this is intuition-based brainstorming, not rigorous argument, and I'm looking forward to any feedback. For one thing, I think there are probably better and more precise ways to define self-unawareness, but I hope my definition above is close enough to get the idea across. I'll keep thinking about it, and I hope others do too!

1. See Jeff Hawkins On Intelligence or Andy Clark Surfing Uncertainty, for example. ↩︎

2. Correction: I got this example wrong. The hypercomputer chooses a predictive algorithm, and the question is whether the latter is self-unaware. That's not so obvious... ↩︎

3. See references in footnote 1. Of course, the human brain uses both self-supervised learning (predict the next thing you'll see, hear, and feel) and RL (cake is good, death is bad). My feeling is that we can throw out the RL part (or throw out enough of it to allow self-unawareness) and the system will still work pretty well. For example, when Einstein invented relativity, he wasn't doing RL interaction with the real world, but rather searching through generative models, tweaking and recombining the higher-level models and keeping them when they offered a parsimonious and accurate prediction of lower-level models. I think we can write self-unaware code that does that kind of thing. Without a reward signal, we might need to program our own mechanism to direct "attention", i.e. to guide which aspects of the world need to be modeled with extra-high accuracy. But again, this seems like something we can just put in manually. Note: I'm not terribly confident about anything in this footnote, and want to think about it more. ↩︎

4. If you try to make a self-unaware AI agent, it immediately starts modeling itself and adjusting its behavior to that model, which (as mentioned above) yields hard-to-predict and possibly-problematic behavior ... unless there's a trick I haven't thought of. ↩︎