(See also my later post, Self-Supervised Learning and AGI Safety)
Abstract A self-unaware AI oracle is a system that builds a world-model inside which there is no concept of "me". I propose that such systems are safe to operate, that they can be plausibly built by continuing known AI development paths, and that they would be competitive with any other type of AI—capable of things like inventing new technology and safe recursive self-improvement.
Epistemic status: Treat as preliminary brainstorming; please let me know any ideas, problems, and relevant prior literature.
My proposal is to build a system that takes in data about the world (say, all the data on the internet) and builds a probabilistic generative model of that data, in the sense of taking an arbitrary number of bits and then predicting one or more masked bits. If you want to imagine something more specific, here are three examples: (1) A hypercomputer using Solomonoff Induction; (2) Some future descendant of GPT-2, using a massive Transformer (or some better future architecture); (3) A neocortex-like algorithm that builds low-level predictive models of small segments of data, then a hierarchy of higher-level models that predict which of the lower-level models will occur in which order in different contexts.
Further, we design the system to be "self-unaware"—we want it to construct a generative world-model in which its own physical instantiation does not have any special status. The above (1-2) are examples, as is (3) if you don't take "neocortex-like" too literally (the actual neocortex can learn about itself by, say, controlling muscles and observing the consequences).
Finally, we query the system in a way that is compatible with its self-unawareness. For example, if we want to cure cancer, one nice approach would be to program it to search through its generative model and output the least improbable scenario wherein a cure for cancer is discovered somewhere in the world in the next 10 years. Maybe it would output: "A scientist at a university will be testing immune therapy X, and they will combine it with blood therapy Y, and they'll find that the two together cure all cancers". Then, we go combine therapies X and Y ourselves.
What is self-unawareness?
In the context of predictive world-modeling systems, a self-unaware system (a term I just made up—is there an existing term?) is one that does not have itself (or any part of itself, or any consequences of itself) as specially-flagged entities in its world model.
Example of a self-aware system: A traditional RL agent. (Why? Because it has a special concept of "its own actions" represented in its models.)
Example of a self-unaware system: Any system that takes inputs, does a deterministic computation (with no further inputs from the environment), and spits out an output. (Why? Because when you correctly compute a computable function, you get the same answer regardless of where and whether the computation is physically instantiated in the universe.) (Edit to add: On second thought, this is wrong, according to the definition of self-unawareness that I'm using everywhere else. The "more examples" subsection is a better description of what I'm getting at.)
In one sense, ensuring that an AGI is self-unaware seems like it should be pretty easy; "The space of all computable functions" is a pretty big space to explore, and that doesn't even exhaust the possibilities! On the other hand, of course there are always pitfalls ... for example, if your code has a race condition, that's a side-channel potentially leaking information from the physical instantiation into the (symbolic) world model. Still, designing a system to be self-unaware seems pretty tractable, and maybe even amenable to formal verification (flag some variables as "part of the world-model" and other variables as "self-knowledge" and prevent them from intermingling, or something like that...).
More examples of how to think about self-unawareness
- If a self-unaware system is, at some moment, consolidating its knowledge of anthropology, it doesn't "know" that it is currently consolidating its knowledge of anthropology—this fact is not represented in the world-model it's building.
- If a self-unaware system is running on a particular supercomputer in Mexico, maybe its world-model "knows" (from news stories) that there is a new AI research project using this particular supercomputer in Mexico, but it won't conclude that this research project is "me", because, as far as it knows, there is no "me"; it is utterly ignorant of its own existence.
If you find this unintuitive, well, so do I! That's because self-unaware systems are super non-anthropomorphic. If I'm able to think straight about this concept, it's only by firmly grounding myself in the three examples I mentioned in the Overview. For example, take a hypercomputer, using Solomonoff Induction to find the world-model that most parsimoniously predicts all the data on the internet. Does this world-model contain the statement: "I am a hypercomputer running Solomonoff Induction"? No!! That's just not something that would happen in this system. (Correction)
Just as Hume said you can't derive an "ought" from an "is", my contention here is that you can't derive a first-person perspective from any amount of third-person information.
How do you query a self-unaware system?
There's some awkwardness in querying a self-unaware system, because it can't just directly apply its intelligence to understanding your questions, nor to making itself understood by you. Remember, it doesn't think of itself as having input or output channels, because it doesn't think of itself period! Still, if we spend some time writing (non-intelligent) interface code, I think querying the system should ultimately work pretty well. The system does, after all, have excellent natural-language understanding inside of it.
I think the best bet is to program the system to make conditional predictions about the world, using its world-model. I gave an example above: "Calculate the least improbable scenario, according to your world-model, wherein a cure for cancer is discovered anywhere in the world". The subroutine does some calculation, writes the answer to disk, and terminates (of course, as always, it doesn't know that it wrote the answer to disk, it just does it). As we read the answer, we incidentally learn the cure for cancer. I expect that we would have some visibility into its internal world-model, but even a black-box predictive world-modeler is probably OK. Imagine prompting GPT-2 for the most likely completion of the sentence "In 2030 scientists finally discovered a cure for cancer, and it was ...", or something like that.
Can you give it multiple queries in sequence without resetting it?
Yes, although things get a bit tricky when the system starts reasoning about itself. (It reasons about itself in exactly the same way, and for exactly the same reason, as it reasons about anything else in the world.)
Suppose that we continue allowing our self-unaware system to have read-only internet access, after we've published the cure for cancer from the previous section. Now plastered all over the newspapers are stories about the famous Self-Unaware AI Oracle running in on a supercomputer in Mexico, which has just invented the cure for cancer. The system now will definitely put self-unaware AI oracles into its predictive generative world-model (if it hadn't already), which entails trying to understand and predict what such a thing would do in different circumstances. Maybe it even reads its own source code!
Unfortunately, it won't be able to reason about itself perfectly—that would require simulating itself, which causes an infinite regress. But the system will apply all its models and heuristics, and do the best it can to come up with a predictive model of itself, on the basis of the (non-privileged) information it has. (Note that there is no reason to expect it to build a self-consistent model, i.e., a model in which its own predictions about itself are correct.)
OK, now we go back and ask the system for something else, say, a cure for Alzheimer's. Again, we say "Please output the least improbable scenario, according to your world-model, wherein Alzheimer's is cured in the next 10 years". It might say "Those scientists in Mexico, using their Self-Unaware AI Oracle, learn the following treatment approach...".
Now, recall that with the classic AI Oracle, we worry about manipulative answers. In this particular case, we can be sure that the system itself is not inherently manipulative (see below), but we would still be in trouble if the Self-Unaware AI Oracle in Mexico believes that the Self-Unaware AI Oracle in Mexico would be manipulative. This failure mode seems extremely unlikely. As I mentioned a couple paragraphs ago, its third-person model of itself will be an imperfect approximation to the real thing, but the belief that it would be manipulative would be a bizarrely specific modeling error that came out of nowhere!
Still, out of an abundance of caution, the query should probably be something like: "Please output the least improbable scenario, according to your world-model, wherein a cure for Alzheimer's is invented by a group with no access to any AI oracles!" Then we avoid the self-modeling issues altogether. Indeed, it should be possible to use a predictive world model to do counterfactual reasoning about what might happen in a world in which there are no AIs whatsoever.
Self-unaware AIs are safe to use
As far as I can tell, a self-unaware system should be about as safe as we could possibly hope for. Manipulation, deception, self-preservation, reproduction, and all the other Omohundro goals seem to be completely out of the question for a system with no knowledge that it is a calculation physically instantiated in the universe!
I'm trying hard to think of failure modes here. OK, there's a potential for the system to describe possible futures while leaving out critical details. So, let's say we ask for the least improbable scenario wherein "cancer is cured", and the system describes the events leading up to a plague that wipes out humanity. OK, well then we ask for the second least improbable scenario, and this one sounds pretty good! But after we inject the chemical into the first test patient, it turns out that this one is also a plague that wipes out humanity, but the AI's model output was unfortunately treating that fact as an irrelevant detail and neglecting to mention it. Are we back to Goodhart's law and the "nearest unblocked strategy" problem? Sort of, but I don't think it's particularly worrisome version. Again, we're asking the system to guess what normal humans might plausibly do in the natural course of events (in the absence of AI), and we can look at the probabilities assigned to the scenarios to make sure they're not outlandishly small. We can also program the system to answer our follow-up questions, like "according to your models, if someone rolls out this therapy, what is the likely impact on lifespan? what is the likely impact on the environment? how does it work on the cellular level?" and so on. And we can trust that, while the answers may be imperfect, they will not be manipulative. I'm really not seeing any cause for concern here, or elsewhere, although I'm going to keep thinking about it.
Are Self-Unaware AI Oracles competitive with other approaches to AGI?
I see two main disadvantages of Self-Unaware AI Oracles, but I think that both are less problematic than they first appear.
The first disadvantage is that these things are completely incompatible with RL techniques (as far as I can tell), and a lot of people seem to think that RL is the path to superintelligence. Well, I'm not at all convinced that we need RL, or that RL would ultimately even be all that helpful. The alternative path I'm proposing here is self-supervised learning: Given a sequence of bits from the internet, predict the subsequent bits. So there's a massive amount of training data—for example, I heard that 100,000 years of video have been uploaded to YouTube! I keep going back to those three examples from the beginning: (1) GPT-2 shows that we can get impressively far on this type of self-supervised learning even with today's technology; (2) Solomonoff induction on the entire internet is the astronomically high ceiling on what's possible; and (3) the human brain—which works primarily on exactly this type of self-supervised learning—is a nice reference point for how far we might get along this path just by brute-force biomimetic engineering.
The second disadvantage is that it's still an oracle, needing a human in the loop. But as far as oracles go, it's about as powerful as you could hope for: able to answer more-or-less arbitrary questions, and able to design new technology, as in the cancer example above. In particular, we can take a bootstrapping strategy, where we can ask the safe self-unaware oracle to help us design a safe AGI agent.
By the same token, despite appearances, Self-Unaware AI Oracles are capable of recursive self-improvement: We just present the query in the third person. ("This is a Self-Unaware AI Oracle", we say to it, holding up a giant mirror. "How might scientists make this type of system better?") We can even record the system doing a calculation, then pass that video back to itself as an input to improve its self-models. I think this would be a quite safe type of self-improvement, insofar as self-unawareness is (I hope) possible to rigorously verify, and also insofar as we're not worried about manipulative suggestions.
Again, this is intuition-based brainstorming, not rigorous argument, and I'm looking forward to any feedback. For one thing, I think there are probably better and more precise ways to define self-unawareness, but I hope my definition above is close enough to get the idea across. I'll keep thinking about it, and I hope others do too!
See Jeff Hawkins On Intelligence or Andy Clark Surfing Uncertainty, for example. ↩︎
Correction: I got this example wrong. The hypercomputer chooses a predictive algorithm, and the question is whether the latter is self-unaware. That's not so obvious... ↩︎
See references in footnote 1. Of course, the human brain uses both self-supervised learning (predict the next thing you'll see, hear, and feel) and RL (cake is good, death is bad). My feeling is that we can throw out the RL part (or throw out enough of it to allow self-unawareness) and the system will still work pretty well. For example, when Einstein invented relativity, he wasn't doing RL interaction with the real world, but rather searching through generative models, tweaking and recombining the higher-level models and keeping them when they offered a parsimonious and accurate prediction of lower-level models. I think we can write self-unaware code that does that kind of thing. Without a reward signal, we might need to program our own mechanism to direct "attention", i.e. to guide which aspects of the world need to be modeled with extra-high accuracy. But again, this seems like something we can just put in manually. Note: I'm not terribly confident about anything in this footnote, and want to think about it more. ↩︎
If you try to make a self-unaware AI agent, it immediately starts modeling itself and adjusting its behavior to that model, which (as mentioned above) yields hard-to-predict and possibly-problematic behavior ... unless there's a trick I haven't thought of. ↩︎
A traditional RL agent absolutely could be a deterministic computation (modulo bugs in the code). It is a program that gets compiled into or run by machine instructions which follow a particular deterministic specification that Intel has written (if you're running on Intel chips).
Also, this argument would prove that humans are not deterministic systems, which seems like it's proving too much.
If you want to predict what's going to happen in the world, it often helps if you know that you are a thing that affects the world.
For your three examples: The Solomonoff induction example is a weird case because it is an uncomputable algorithm that only has computable hypotheses, so it can't be aware of itself, but your second and third examples seem like they totally could lead to self-aware systems. In fact, the third example sounds like a description of humans, and humans are self-aware.
Overall I don't see how we could tell in advance whether a system would be self-unaware or not.
GPUs aren't deterministic.
I mean, sure. Seems irrelevant to the point being made here.
If you're objecting to the fact that I said a thing that was literally false but basically correct, I've changed "is" to "could be".
For deterministic computation: What I was trying to get at is that a traditional RL agent does some computation, gets a new input based on its actions and environment, does some more computation, and so on. (I admit that I didn't describe this well. I edited a bit.)
Your argument about Solomonoff induction is clever but I feel like it's missing the point. Systems with some sense of self and self-understanding don't generally simulate themselves or form perfect models of themselves; I know I don't! Here's a better statement: "I am a predictive world-model, I guess I'm probably implemented on some physical hardware somewhere." This is a true statement, and the system can believe that statement without knowing what the physical hardware is (then it can start reasoning about what the physical hardware is, looking for news stories about AI projects). I'm proposing that we can and should build world-models that don't have this type of belief in its world model.
What I really have in mind is: There's a large but finite space of computable predictive models (given a bit-stream, predict the next bit). We run a known algorithm that searches through this space to find the model that best fits the internet. This model is full of insightful, semantic information about the world, as this helps it make predictions. Maybe if we do it right, the best model would not be self-reflective, not knowing what it was doing as it did its predictive thing, and thus unable to reason about its internal processes or recognize causal connections between that and the world it sees (even if such connections are blatant).
One intuition is: An oracle is supposed to just answer questions. It's not supposed to think through how its outputs will ultimately affect the world. So, one way of ensuring that it does what it's supposed to do, is to design the oracle to not know that it is a thing that can affect the world.
I agree it's missing the point. I do get the point, and I disagree with it -- I wanted to say "all three cases will build self-models"; I couldn't because that may not be true for Solomonoff induction due to an unrelated reason which as you note misses the point. I did claim that the other two cases would be self-aware as you define it.
(I agree that Solomonoff induction might build an approximate model of itself, idk.)
My claim is that we have no idea how to do this, and I think the examples in your post would not do this.
I'm not disagreeing that if we could build a self-unaware oracle then we would be safe. That seems reasonably likely to fix agency issues (though I'd want to think about it more). My disagreement is on the premise of the argument, i.e. can we build self-unaware oracles at all.
On further reflection, you're right, the Solomonoff induction example is not obvious. I put a correction in my post, thanks again.
I think we're on the same page! As I noted at the top, this is a brainstorming post, and I don't think my definitions are quite right, or that my arguments are airtight. The feedback from you and others has been super-helpful, and I'm taking that forward as I search for more a rigorous version of this, if it exists!! :-)
A sufficient condition here should be a lack of feedback loops that include information about the agent. I'm not sure that this is necessary, though, and there may be some more lax criteria we could live with.
This is mostly all theoretically, though, because it's going to be very hard to create any system which is actually embedded in the world and present it information that is not causally influenced by itself since that seems theoretically impossible, though you might be able to achieve practically good enough acausality via some kind of data "scrubbing" procedure, though I don't hold out much hope there given how hard it is to achieve this even in narrow cases*.
*I speak from firsthand experience here. I used to work at a company that sold ad insights and required collecting PII about users. We scrubbed the data for EU export, and we met the legal standard, but we also later figured out we could still triangulate the identity for any customer given their anonymized data to within an accuracy of 3 individuals on average.
Agreed. Also agreed that this seems very difficult, both in theory and in practice.
Well, it takes two things: (1) Self-knowledge ("I wrote '0' into register X", "I am thinking about turtles", etc. being in the world-model) and (2) knowledge of things causal consequences of that (the programmers see the 1 in register X and then change their behavior). With both of those, the system can learn causal links between its own decisions and the rest of the world, and can therefore effect real-world consequences.
Out of these two options, I think you're proposing to cut off path (2), which I agree is very challenging. I am proposing to cut off path (1) instead, and not worry about path (2). Thus it's a cybersecurity-type hardware/software design challenge, not a data sanitation challenge.
Given how I'm now thinking of what you mean by self-unawareness (in that it includes a lack of optimization and learning), (I) seems uninteresting here, since this seems to me to be suggesting that we build "oracles" that are not AI but instead regular software.
I'm not sure this is true. To give a counterexample, evolution is a self-unaware optimization process (or, if you want to claim subagents within the evolutionary process void unawareness, then at least it was unaware for a very long time) that does seem to engage in behavior we'd reasonably say tries to do things like:
I think we could do a similar analysis of a financial market, which also seems self-unaware. And let's not forget the many cases people have already seen of neural-nets accidentally or instrumentally learning things they weren't expected to.
I'm pretty sure this matches what Omohundro and Bostrom had in mind with these kinds of arguments, too: any optimization process, regardless of self-awareness, can be driven to these instrumental goals even, even if they are unaware of what they are doing or why they are doing it or even that "they" are doing anything, and in fact "goal" is a framing we put on this pattern of behavior because it's simply the thing in our experience that's most similar to the kind of "goal" that's happening in a self-unaware process.
Imagine three levels, in order of increasing concern: (1) system does self-preserving action sometimes randomly, no more often than chance. (2) system does self-preserving action randomly, but once it sees the good consequences, starts doing it systematically. (3) system does self-preserving action systematically from the start, because it had foresight and motivation. Humans and problematic AIs are up at (3), a population of bacteria undergoing evolution are at (2), and a self-unaware oracle is at (1).
I think the concern with your proposal is that good consequences don't seem like they need to be seen, they can also be the result of selection (namely, surviving). If there are a bunch of prediction projects and ones which by chance are biased (or other wise handled) in ways that increase their chances of "survival"** then as time goes on, older projects are more inclined in that way* because the ones that weren't shut down at higher rates (and ones biased/handled in the other direction shut down at really high rates).***
*Whatever that way is.
**This need not be a long lasting property, only a local one. That which was adaptive in other circumstances can be maladaptive when things change. This doesn't seem like a problem until you consider the results of evolution - it takes a long time, but eventually it gets more general capabilities.
***This might have more to do with the way the project is handled than the code though - it depends on the difference in weight between factors, and doesn't incorporate the consequences of multiple equilibria. It's a rough model, and a better model would be more explicit about how the oracle's predictions affect it's chances of surviving, but drawing attention to that factor was the point. (And just saying "what if they ask the oracle if running it is worth it" doesn't address the possibility that they don't, the way this argument does.) How new projects would work isn't super clear - they could start over from scratch, use the same code as someone else but feed in different information. They might also try to imitate older projects, prorpagating tendencies which tend to keep projects alive, which might involve biased oracles. This problem might be fixed by having survorship track accuracy, but as things change, what led to accuracy in the past might lead to inaccuracy in the future.
Thanks for this helpful comment. The architecture I'm imagining is: Model-choosing code finds a good predictive world-model out of a vast but finite space of possible world-models, by running SGD on 100,000 years of YouTube videos (or whatever). So the model-chooser is explicitly an optimizer, the engineer who created the model-chooser is also explicitly an optimizer, and the eventual predictive world-model is an extremely complicated entity with superhuman world-modeling capabilities, and I am loath to say anything about what it is or what it's going to do.
Out of these three, (1) the engineer is not problematic because it's a human, (2) the model-chooser is not problematic because it's (I assume and expect) a known and well-understood algorithm (e.g. Transformer), and thus (3) the eventual predictive world-model is the only thing we're potentially worried about. My thought is that, we can protect ourselves from the predictive world-model doing problematic consequentialist planning by scheming to give it no information whatsoever about how it can affect the world, even knowing that it exists or knowing what actions it is taking, such that if it has problematic optimization tendencies, it is unable to act on them.
(In regards to (1) more specifically, if a company is designing a camera, the cameras with properties that the engineers like are preferentially copied by the engineers into later versions. Yes, this is a form of optimization, but nobody worries about it more than anything else in life. Right?)
The concern is basically:
Evolution* is well understood.
And while it's not a really good optimizer, if you run it too long**, it's "not safe" (it made us after all), - it can get to GI (or AGI).
I don't think cameras think. They have a simple state which is determined by the settings (unless somethings broken), and they take images. While we could argue that they have a memory 1) in the form of images and video, and 2) in the form of state, they don't do things with that information. (While you might be able to use them for video editing, that's largely a manual process, and is mostly performed on (bigger) computers designed for more general use.)
Comparatively, the point of Machine Learning is...machine learning. Today that's learning to recognize things (like human faces) and do things (like video games). If things get to the point where I don't drive a car, instead my car drives itself - that will be pretty big.
And the thing about some of this new stuff like neural nets is that we don't. They're not hardcoded. They learn. And the more difficult/complex the task is, the harder they can be to understand. Currently, I'm more worried about a system which we don't understand being put in an important role and failing unexpectedly, or people using these tools for sinister ends (I'm not the biggest fan of facial recognition tech), than a super-intelligent AGI. AlphaStar didn't win by hacking its opponent's computer or causing a seizure.
*Or evolutionary algorithms
**What you're running it on (the problem you're trying to solve, and how complex it's allowed to get) might also matter.
Just to be clear, when OpenAI trained GPT-2, I am not saying that GPT-2 is a known and well-understood algorithm for generating text, but rather that SGD (Stochastic Gradient Descent) is a known and well-understood algorithm for generating GPT-2. (I mean, OK sure, ML researchers are still studying SGD, but its inner workings are not an impenetrable mystery the way that GPT-2's are.)
Why should we expect, though, that if it is the case that a self-unaware oracle only performs self-preserving actions sometimes randomly that it wouldn't incidentally optimize for that? I'm willing to believe it may be possible to construct a system with optimizing pressures weak enough that it couldn't end up incidentally optimizing for other things that are instrumentally useful (and unsafe), and I have some vague ideas about how that might happen, but I'm unclear from what you've presented so far why I should expect a self-unaware oracle to be such a system.
I'm not necessarily asking for a mathematically rigorous proof, but I also don't see a reasonable story that would lead me to conclude that, thus I fall back on my prior assumption that optimizing systems are at risk of developing secondary behaviors because they are useful in optimizing for the primary target.
I'm not sure what you have in mind here; to me, optimization requires some causal pathway from "Action X has consequence Y" to "Take Action X more often than chance".
A system can optimize if it has a way to store specially-flagged information in the form of "I took action X, and it had consequence Y" (or "if I take action X, it will have consequence Y"), and then bring that flagged information to bear when taking actions. A population of bacteria can do this! Evolution flags its "decisions" (mutations), storing that information in DNA, and then "consults" the DNA when "deciding" what the gene distribution will be in the next generation. A self-unaware system, lacking any "I" or "my decision" or "my action" flags in either its internal or external universe, would be missing the causal links necessary to optimize anything. Right?
But if you build something that can't optimize that's not really AI or an oracle, that's just regular software that doesn't learn. I guess an expert system, for example, is functionally kind of like an oracle and it would meet your requirement of self-unawareness, but it also seems pretty uninteresting from a capabilities standpoint since any learning it does happens offline and only via external reprogramming of the algorithm (and then you just pass the buck to whatever that external thing doing the reprogramming is, be it human or another AI).
To me this is sort of like saying "hey, look, we can make provably correct software, just don't put any loops that might be unbounded in it". Sure, that works, but it also restricts the class of what you can achieve so much that people have generally chosen not to tradeoff the additional capabilities for correctness. I think here we're looking at something similar here if your notion of self-unawareness also mean no optimization and learning.
A self-unaware system would not be capable of one particular type of optimization task:
Take real-world actions ("write bit 0 into register 11") on the basis of anticipating their real-world consequences (human will read this bit and then do such-and-such).
This thing is an example of an optimization task, and it's a very dangerous one. Maybe it's even the only type of really dangerous optimization task! (This might be an overstatement, not sure.) Not all optimization tasks are in this category, and a system can be intelligent by doing other different types of optimization tasks.
A self-unaware system certainly is an optimizer in the sense that it does other (non-real-world) optimization tasks, in particular, finding the string of bits that would be most likely to follow a different string of bits on a real-world webpage.
As always, sorry if I'm misunderstanding you, thanks for your patience :-)
This seems to me to be missing the point made well by "Embedded Agency" and exemplified by the anvil problem: you can't in practice build a system where you can achieve this kind of thing because there is not real separation between inside the system and outside the system, just a model which assumes such a distinction exists.
Thanks for your patience, I think this is important and helpful to talk through (hope it's as helpful for you as for me!)
Let's introduce two terminologies I made up. First, the thing I mentioned above:
Second, when you run a program:
Now, when you run a program, you can think of what's happening in Algorithm Land (e.g. a list of numbers is getting sorted) and what's happening in the Real World (e.g. transistors are switching on and off). It's really always going to be both at once.
And now let's simplify things greatly by putting aside the case of world-modeling programs, which have a (partial, low-resolution) copy of the Real World inside Algorithm Land. Instead, let's restrict our attention a chess-playing program or any other non-world-modeling program.
Now, in this case, when we think about Level-2 optimization, the foresight and planning involved entail searching exclusively through causal pathways that are completely inside Algorithm Land. (Why? Because without a world model, it has no way to reason about Real-World causal pathways.) In this case, I say there isn't really anything much to worry about.
Why not worry? Think about classic weird AGI disaster scenarios. For example, the algorithm is optimizing for the "reward" value in register 94, so it hacks its RAM to overwrite the register with the biggest possible number, then seizes control of its building and the power grid to ensure that it won't get turned off, then starts building bigger RAMs, designing killer nanomachines, and on and on. Note that ALL those things (1) involve causal pathways in the Real World (even if the action and consequence are arguably in Algorithm Land) and (2) would be astronomically unlikely to occur by random chance (which is what happens without Level-2 optimization). (I won't say that nothing can go awry with Level-1 optimization—I have great respect for bacteria—but it's a much easier situation to keep under control than rogue Level-2 optimization through Real-World causal pathways.)
Again, things that happen in Algorithm Land are also happening in the Real World, but the mapping is kinda arbitrary. High-impact things in Algorithm Land are not high-impact things in the Real World. For example, using RAM to send out manipulative radio signals is high-impact in the Real World, but just a random meaningless series of operations in Algorithm Land. Conversely, an ingeniously-clever chess move in Algorithm Land is just a random activation of transistors in the Real World.
(You do always get Level-1 optimization through Real-World causal pathways, with or without a world model. And you can get Level-2 optimization through Real-World causal pathways, but a necessary requirement seems to be an algorithm with a world-model and self-awareness (i.e. knowledge that there is a relation between things in Algorithm Land and things in the Real World).
Just want to note that I like your distinctions between Algorithm Land and the Real World and also between Level-1 optimization and Level-2 optimization.
I think some discussion of AI safety hasn't been clear enough on what kind of optimization we expect in which domains. At least, it wasn't clear to me.
But a couple things fell into place for me about 6 months ago, which very much rhyme with your two distinctions:
1) Inexploitability only makes sense relative to a utility function, and if the AI's utility function is orthogonal to yours (e.g. because it is operating in Algorithm Land), then it may be exploitable relative to your utility function, even though it's inexploitable relative to its own utility function. See this comment (and thanks to Rohin for the post that prompted the thought).
2) While some process that's optimizing super-hard for an outcome in Algorithm Land may bleed out into affecting the Real World, this would sort of be by accident, and seems much easier to mitigate than a process that's trying to affect the Real World on purpose. See this comment.
Putting them together, a randomly selected superintelligence doesn't care about atoms, or about macroscopic events unfolding through time (roughly the domain of what we care about). And just because we run it on a computer that from our perspective is embedded in this macroscopic world, and that uses macroscopic resources (compute time, energy), doesn't mean it's going to start caring about macroscopic Real World events, or start fighting with us for those resources. (At least, not in a Level-2 way.)
On the other hand, powerful computing systems we build are not going to be randomly selected from the space of possible programs. We'll have economic incentives to create systems that do consider and operate on the Real World.
So it seems to me that a randomly selected superintelligence may not actually be dangerous (because it doesn't care about being unplugged -- that's a macroscopic concept that seems simple and natural from our perspective, but would not actually correspond to something in most utility functions), but that the superintelligent systems anyone is likely to actually build will be much more likely to be dangerous (because they will model and or act on the Real World).
It's hard to be sure this separation will remain, though. An algorithm may accidentally hit upon unexpected techniques while learning like row-hammering or performing operations that cause the hardware to generate radio waves (as you point out) or otherwise behave in unexpected ways that may result in preferred outcomes by manipulating things in the "real world" outside the intended "algorithm land".
For another example, I seem to recall a system that learned to win in a competitive environment by mallocing so much that it starved out its competitors running on the same system. It never knew about the real world consequences of its actions since it didn't have access to know about other processes on the system, yet it carried out the behavior anyway. There are many other examples of this, and someone even collected them in a paper on arXiv, although I can't seem to find the link now.
The point is that the separation between Algorithm Land and the Real World doesn't exist except in our models. Even if you ran the algorithm on a computer with an air gap and placed the whole thing inside a Faraday cage, I'd still be concerned about unexpected leaks outside the sandbox of Algorithm Land into the Real World (maybe someone sneaks their phone in past security, and the optimizer learns to incidentally modify the fan on the computer it runs on to produce sounds that get exploit the phone's microphone to transmit information to it? the possible failure scenarios are endless). Trying to maintain the separation you are looking for is known generally as "boxing" and although it's likely an important part of a safe AI development protocol, many people, myself included, consider it inadequate on its own and not something we should rely on, but rather part of a security-in-depth approach.
OK, so I was saying here that software can optimize for something (e.g. predicting a string of bits on the basis of other bits) and it's by default not particularly dangerous, as long as the optimization does not involve an intelligent foresight-based search through real-world causal pathways to reach the desired goal. My argument for this was (1) Such a system can do Level-1 optimization but not Level-2 optimization (with regards to real-world causal pathways unrelated to implementing the algorithm as intended), and (2) only the latter is unusually dangerous. From your response, it seems like you agree with (1) but disagree with (2). Is that right? If you disagree with (2), can you make up a scenario of something really bad and dangerous, something that couldn't happen with today's software, something like a Global Catastrophic Risk, that is caused by a future AI that is optimizing something but is not more specifically using a world-model to do an intelligent search through real-world causal pathways towards a desired goal?
Sure. Let's construct the 0-optimizer. Its purpose is simply to cause there to be lots of 0s in memory (as opposed to 1s). It only knows about Algorithm Land, and even then it's a pretty narrow model: it knows about memory and can read and write to it. Now at some point the 0-optimizer manages to get all the bits set to 0 in its addressable memory, so it would seem to have reached maximum attainment.
But it's a hungry optimizer and keeps trying to find ways to set more bits to 0. It eventually stumbles upon a gap in security of the operating system that allows it to gain access to memory outside its address space, so it can now set those bits to 0. Obviously it does this all "accidentally", never knowing it's using a security exploit, it just stumbles into it and just sees memory getting written with 0s so it's happy (this has plenty of precedent; human minds are great examples of complex systems that have limited introspective access that do lots of complex things without knowing how or why they are doing them). With some luck, it doesn't immediately destroy itself and gets a chance to be hungry for more 0s.
Next it accidentally starts using the network interface on the computer. Although it doesn't exactly understand what's going on, it figures out how to get responses that just contain lots of 0s. Unfortunately for us what this is actually doing is performing a denial of service attack against other computers to get back the 0s. Now we have a powerful optimization process that's hungry for 0s and it satisfies its hunger by filling our networks with garbage traffic.
Couple of hops on, it's gone from denial of service attacks to wiping out our ability to use Internet service to our ability to use any EM communication channel to generating dangerously high levels of radiation that kill all life on Earth.
This story involved a lot of luck, but my expectation is that we should not underestimate how "lucky" a powerful optimizer can be, given evolution is a similarly ontologically simple process that nonetheless managed to produce some pretty complex results.
I'm having a hard time figuring out how the examples of evolution and markets map on to the agent above. They're both processes that seem to take advantage of the fact that there are agents inside them trying to optimize goals, and are the results of these agents and processes coming into conflict with each other and reality. While I can imagine the unselfaware agent simulating these instances, I don't see a way to actually view it as a process made up of agents trying to achieve goals.
Yes, I kind of over-answered the question: I'm saying even if you have some optimization process that is not even an agent you can get instrumental "behaviors" that are not safe, let alone an agent that is not self-aware.
If you haven't already seen it, you might want to check out: https://www.gwern.net/Tool-AI
Also see these discussions of Drexler's Comprehensive AI Services proposal, which also emphasizes non-agency:
Thanks! A lot of my thinking here is that I just really believe that, once people find the right neural architecture, self-supervised learning on the internet is going rocket-launch all the way to AGI and beyond, leaving little narrow AI services in the dust.
The way I read it, Gwern's tool-AI article is mostly about self-improvement. I'm proposing that the system will be able to guide human-in-the-loop "self"-improvement. That's kinda slower, but probably good enough, especially since eventually we can (hopefully) ask the oracle how to build a safe agent.
I'm not sure I understand what you mean here. I linked Gwern's post because your proposal sounded very similar to me to Holden's Tool AI concept, and Gwern's post is one of the more comprehensive responses I can remember coming across.
Is it your impression that what you're proposing is substantially different from Holden's Tool AI?
When I say that your idea sounded similar, I'm thinking of passages like this (from Holden):
Compared to this (from you):
Your, "Then, we go combine therapies X and Y ourselves." to me sounds a lot like Holden's separation of (1) Calculating the best action vs (2) Either explaining (in the case of Tool AI) or executing (in the case of Agent AI) the action. In both cases you seem to be suggesting that we can reap the rewards of superintelligence but retain control by treating the AI as an advisor rather than as an agent who acts on our behalf.
Am I right that what you're proposing is pretty much along the same lines as Holden's Tool AI -- or is there some key difference that I'm missing?
Thanks for that, this is helpful. Yes, same genre for sure. According to Eliezer's response to Holden, tool AI is a synonym of "non-self-improving oracle". Anyway, whatever we call it, my understanding of the case against tool AI is that (1) we don't know how to make a safe tool AI (part of Eliezer's response), and (2) even if we could, it wouldn't be competitive (Gwern's response).
I'm trying to contribute to this conversation by giving an intuitive argument for how I'm thinking that both these objections can be overcome, and I'm also trying to be more specific about how the tool AI might be built and how it might work.
More specifically, most (though not 100%) of the reasons that Gwern said tool AI would be uncompetitive are in the category of "self-improving systems are more powerful". So that's why I specifically mentioned that a tool AI can be self-improving ... albeit indirectly and with a human in the loop.
What's always bothered me about "unaware self AI" scenarios are the fact they are literally reading everything we write right now. How is their "generated world-model" processing us talking about it? There might be a wake-up somewhere in there. That point, of course, is the entire argument about when a machine passes the Turing test or other models to become indistinguishable from sentient life.
Just as if it were looking into the universe from outside it, it would presumably be able to understand anything in the world, as a (third-person) fact about the world, including that humans have self-awareness, that there is a project to build a self-unaware AI without it, and so on. We would program it with strict separation between the world-model and the reflective, meta-level information about how the world-model is being constructed and processed. Thus the thought "Maybe they're talking about me" cannot occur, there's nothing in the world-model to grab onto as a referent for the word "me". Exactly how this strict separation would be programmed, and whether you can make a strong practical world-modeling system with such a separation, are things I'm still trying to understand.
A possible (not realistic) example is: We enumerate a vast collection of possible world-models, which we construct by varying any of a vast number of adjustable parameters, describing what exists in the world, how things relate to each other, what's going on right now, and so on. Nothing in any of the models has anything in it with a special flag labeled "me", "my knowledge", "my actions", etc., by construction. Now, we put a probability distribution over this vast space of models, and initialize it to be uniform (or whatever). With each timestep of self-supervised learning, a controller propagates each of the models forward, inspects the next bit in the datastream, and adjusts the probability distribution over models based on whether that new bit is what we expected. After watching 100,000 years of YouTube videos and reading every document ever written, the controller outputs the one best world-model. Now we have a powerful world-model, in which there are deep insights about how everything works. We can use this world-model for whatever purpose we like. Note that the "learning" process here is a dumb thing that just uses the transition rules of the world-models, it doesn't involve setting up the world-models themselves to be capable of intelligent introspection. So it seems to me like this process ought to generate a self-unaware world model.
Here's a more general way of thinking about what you're saying that I find useful: It's not that self-awareness is the issue per se, it's that you're avoiding building an agent - by a specific technical definition of "agent."
Agents, in the sense I think is most useful when thinking about AI, are things that choose actions based on the predicted consequences of those actions.
On some suitably abstract level of description, agents must have available actions, they must have some model of the world that includes a free parameter for different actions, and they must have a criterion for choosing actions that's a function of what the model predicts will happen when it takes those actions. Agents are what is dangerous, because they steer the future based on their criterion.
What you describe in this post is an AI that has actions (outputting text to a text channel), and has a model of the world. But maybe, you say, we can make it not an agent, and therefore a lot less dangerous, by making it so that there is no free parameter in the model for the agent to try out different actions. and instead of choosing its action based on consequences, it will just try to describe what its model predicts.
Thinking about it in terms of agents like this explains why "knowing that it's running on a specific computer" has the causal powers that it does - it's a functional sort of "knowing" that involves having your model of the world impacted by your available actions in a specific way. Simply putting "I am running on this specific computer" into its memory wouldn't make it an agent - and if it chooses what text to output based on predicted consequences, it's an agent whether or not it has "I am running on this specific computer" in its memory.
So, could this work? Yes. It would require a lot of hard, hard work on the input/output side, especially if you want reliable natural language interaction with a model of the entire world, and you still have to worry about the inner optimizer problem, particularly e.g. if you're training it in a way that creates an incentive for self-fulfilling prophecy or some other implicit goal.
The basic reason I'm pessimistic about the general approach of figuring out how to build safe non-agents is that agents are really useful. If your AI design requires a big powerful model of the entire world, that means that someone is going to build an agent using that big powerful model very soon after. Maybe this tool gives you some breathing room by helping suppress competitors, or maybe it makes it easier to figure out how to build safe agents. But it seems more likely to me that we'll get a good outcome by just directly figuring out how to build safe agents.
This is helpful, thanks. It sounds very reasonable to say "if it's just programmed to build a model and query it, it doesn't matter if it's self-aware". And it might be true, although I'm still a bit uncertain about what can happen when the model-builder includes itself in its models. There are also questions of what properties can be easily and rigorously verified. My hope here is that we can flag some variables as "has information about the world-model" and other variables as "has information about oneself", and we can do some kind of type-checking or formal verification that they don't intermingle. If something like that is possible, it would seem to be a strong guarantee of safety even if we didn't understand how the world-modeler worked in full detail.
RE your last paragraph: I don't think there is any point ever when we will have a safe AI and no one is incentivized (or even curious) to explore alternate designs that are not known to be safe (but which would be more powerful if they worked). So we need to get to some point of development, and then sound the buzzer and start relying 100% on other solutions, whether it's OpenAI becoming our benevolent world dictators, or hoping that our AI assistants will tell us what to do next, or who knows what. I think an oracle that can answer arbitrary questions and invent technology is good enough for that. Once we're there, I think we'll be more than ready to move to that second stage...