A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.
The basic idea is that we can define extrapolated values by just taking an emulation of a human, putting it in a hypothetical environment with access to powerful resources, and then adopting whatever values it eventually decides on. You might want some philosophical insight before launching into such a definition, but since we are currently laboring under the threat of catastrophe, it seems that there is virtue in spending our effort on avoiding death and delegating whatever philosophical work we can to someone on a more relaxed schedule.
You wouldn't want to run an AI with the values I lay out, but at least it is pinned down precisely. We can articulate objections relatively concretely, and hopefully begin to understand/address the difficulties.
(Posted at the request of cousin_it.)
Cool! Thanks to you, we finally seem to have a viable attack on the problem of FAI, by defining goals in terms of hypothetical processes that could output a goal specification, like brain emulations with powerful computers. Everyone please help advance this direction of inquiry :-)
One potential worry is that the human subject must be over some minimal threshold of intelligence for this scheme to work. A village fool would fail. How do I convince myself that the threshold is below the "reasonably intelligent human" level?
I disagree; reading Paul's description made it clear to me how superficial it is to want to solve a problem by creating an army of uploads to do it for you. You may as well just try to solve the problem here and now, rather than hoping to outsource it to a bunch of nonexistent human-simulations running on nonexistent hardware. The only reason to consider such a baroque way of solving a problem is if you expect to be very pressed for time and yet to also have access to superdupercomputing power. You know, the world is hurtling towards singularity, no-one has crossed the finish line but many people are getting close, your FAI research organization manages to get a hold of a few petaflops on which to run a truncated AIXI problem-solver... and now you can finally go dig up that scrap of paper on which your team wrote down, years before, the perfectly optimal wish: "I want you, FAI-precursor, to do what the ethically stabilized members of our team would do, if they had hundreds of years to think about it, and if they...", etcetera.
It's a logically possible scenario, but is it remotely likely? This absolutely should not be the paradigm for a successful implementation of FAI or CEV. It's just a wacky contingency that you might want to spend a little time thinking about. The plan should be that un-uploaded people will figure out what to do. They will surely make intensive use of computers, and there may be some big final calculation in which the schematics of human genetic, neural and cultural architecture are the inputs to a reflective optimization process; but you shouldn't imagine that, like some bunch of Greg Egan characters, the researchers are going to successfully upload themselves and then figure out the logistics and the mathematics of a successful CEV process. It's like deciding to fix global warming by building a city on the moon that will be devoted to the task of solving global warming.
The plan doesn't require a truncated AIXI-like solver with lots of hardware. It's a goal specification you can code directly into a self-improving AI that starts out with weak hardware. "Follow the utility function that program X would output if given enough time" doesn't require the AI to run program X, only to reason about the likely outputs of program X.
It doesn't in principle require this, but might in practice, in which case the AI might eat the universe if that's the amount of computational resources necessary to compute the results of running program X. That is a potential downside of this plan.
Well on the dark, sardonic upside, it might find it convenient to eat the people in the process of using their minds to compute a CEV-function. Infinite varieties of infinite hell-eternities for everyone!
Could you express your objection more precisely than "it's wacky"?
(Note that the hypothetical process probably doesn't even output a goal specification, it just outputs a number, which the AI tries to control.)
The hope is something like: "We can reason about the outputs of this process, so an AI as smart as us can reason about the outputs of this process (perhaps by realizing it can watch or ask us to learn about the behavior of the definition)." The bar the AI has to meet, is to realize basically what is going on the definition. This assumes of course not only that the process actually works, but that it works for the reasons we believe it works.
I have doubts about this, and it seems generally important to think about whether particular models of UDT could make these inferences. The sketchy part seems to be the AI looking out at the world, and drawing mathematical inferences from the assumption that it's environment is a draw from a universal distribution. There are two ways you could imagine this going wrong:
I expect UDT properly formulated avoids both issues, and if it doesn't we are going to have to take these things on anyway. But it needs some thought.
Maybe the AI could also ask the hypothetical human upload for the right universal distribution to use? Such an AI probably wouldn't work out of the box, but there could be some sort of bootstrapping scheme...
Certainly, but it's not even clear what that distribution should be over, and whether considering probability distributions at all is the right thing to do. The initial AGI needs some operating principles, but these principles should allow fundamental correction, as the research on the inside proceeds (for which they have to be good enough initially).
That's certainly a nice answer to the question "what's the domain of the probability distribution and the utility function?" You just say that the utility function is a parameterless definition of a single number. But that seems to lead to the danger of the AI using acausal control to make the hypothetical human output a definition that's easy to maximize. Do you think that's unlikely to happen?
ETA: on further thought, this seems to be a pretty strong argument against the whole "indirect normativity" idea, regardless of what kind of object the hypothetical human is supposed to output.
The outer AGI doesn't control the initial program if the initial program doesn't listen to the outer AGI. It's a kind of reverse AI box problem: the program that the AGI runs shouldn't let the AGI in. This certainly argues that the initial program should take no input, and output its result blindly. That it shouldn't run the outer AGI internally is then the same kind of AI safety consideration as that it shouldn't run any other UFAI internally, so it doesn't seem like an additional problem.
Of course, once you are powerful enough you let the AGI in (or you define a utility function which invokes the AI, which is really no difference), because this is how you control it.
I don't understand your comment. [Edit: I probably do now.] You output something that the outer AGI uses to optimize the world as you intend, you don't "let the AGI in". You are living in its goal definition, and your decisions determine AGI's values.
Are you perhaps referring to the idea that AGI's actions control its goal state? But you are not its goal state, you are a principle that determines its goal state, just as the AGI is. You show the AGI where to find its goal state, and the AGI starts working on optimizing it.
What's the difference between the simulated humans outputting a utility function U' which the outer AGI will then try to maximize, and the simulated humans just running U' and the outer AGI trying to maximize the value returned by the whole simulation (and hence U')? If case of the latter, you're "letting the AGI in" by including its definition (explicitly or implicitly via something like the universal prior) in the definition of U'.
OK, I see what Paul probably meant. Let's say "utility value", not "utility function", since that's what we mean. I don't think we should be talking about "running utility value", because utility might be something given by an abstract definition, not state of execution of any program. As I discussed in the grandparent, the distinction I'm making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn't be allowed to for AI safety reasons). There is a map/territory distinction between the definition of utility value prepared by the initial program and the utility value itself optimized by the outer AGI.
(Also, "utility function" might be confusing especially for outsiders who are used to "utility function" meaning a mapping from world states to utility values, whereas Paul is using it to mean a parameterless computation that returns a utility value.)
I think Paul is thinking that the utility definition that the simulated humans come up with is not necessarily a definition of our actual values, but just something that causes the outer AGI to self-modify into an FAI, and for that purpose it might be enough to define it using a programming language.
I think Paul's intuition here is that the simulated humans (or enhanced humans and/or FAIs they build inside the simulation) may find it useful to "blur the lines". In other words, the distinction you draw is not a fundamental one but just a safety heuristic that the simulated researchers may decide to discard or modify once they become "powerful enough". For example they may decide to partially simulate the outer AGI or otherwise try to reason about what it might do given various definitions of U' the simulation might ultimately decide upon, once they understand enough theory to see how to do this in a safe way.
Good point. Thanks.
A simplified version of this proposal that applies more generally is to implement a WBE-based FAI project using AGI, before normal WBE becomes available. This way, you only need to figure out how to build a "yield control to this-here program as it would be after N cycles" AGI, and the rest of the FAI project design can be left to the initial WBE-ed team. This would possibly have a side effect of initially destroying the rest of the human world, since the AGI won't be guided by our values before the N cycles of internal simulation complete (it would care about simulating the internal environment, not about saving external human lives; it might turn out to be FAI-complete to make it safe throughout), but if that internal environment can be guaranteed to be given control afterwards (when a FAI project inside it is complete), then eventually this is a plausible winning plan.
The "yield control" AGI seems problematic to define, much less to imagine existing. Do you think it is plausible?
This is certainly the line of thought that led me here, however.
The general "goal" of this system is to make sure the world is controlled by the decisions of the program produced by the initial program, so the simulation of the initial program and yielding of control to its output are subgoals of that. I don't see how to make that work, but I don't see why it can't be done either. The default problems of AGI instrumental drives get siphoned off into the possible initial destructiveness (but don't persist later), and the problem of figuring out human values gets delegated to the humans inside the initial program (which is the part that seems potentially much harder to solve pre-WBE than whatever broadly goal-independent decision theory is needed to make this plan well-defined).
This seems to me like the third plausible winning plan, the other two being (1) figuring out and implementing pre-WBE FAI and (2) making sure the WBE shift (which shouldn't be hardware-limited for the first-runner advantage to be sufficient) is dominated by a FAI project. Unless this somehow turns out to be FAI-complete (which is probable, given that almost any plan is), it seems strictly easier than pre-WBE FAI, although it has a significant cost of possibly initially destroying the current world, which is the problem that the other two plans don't (by default) have.
"Destroy the world" doesn't seem to be a big problem to me. Paul's (proposed) AGI can be viewed as not directly caring about our world, but only about a world/computation defined by H and T (let's call that HT World). If it can figure out that its preferences for HT World can be best satisfied by it performing actions that (as a side effect) cause it to take over our world, then it seems likely that it can also figure out that it should take over our world in a non-destructive way. I'm more worried about whether (given realistic amounts of initial computing power) it would manage to do anything at all.
I'm not talking about Paul's proposal in particular, but about eventually-Friendly AIs in general. Their defining feature is that they have correct Friendly goal given by a complicated definition that leaves a lot of logical uncertainty about the goal until it's eventually made more explicit. So we might explore the neighborhood of normal FAIs, increasing the initial logical uncertainty about their goal, so that they become more and more prone to initial pursuit of generic instrumental gains at the expense of what they eventually realize to be their values.
Oh, please reinterpret my comment as replying to this comment of yours. (That one is specifically talking about Paul's proposal, right?)
Well, yes, but I interpreted the problem of impossibly complicated value definition as the eFAI* (which does seem to be a problem with Paul's specific proposal, even if we assume that it theoretically converges to a FAI) never coming out of its destructive phase, and hence possibly just eating the universe without producing anything of value, so "destroy the world" is in a sense the sole manifestation of the problem with a hypothetical implementation of that proposal...
[* eFAI = eventually-Friendly AI, let's coin this term]
Pre-WBE FAI can initially destroy the world too, if its utility function specification is as complex as CEV for example.
Right, but it's not clear that this is a natural flaw for other possible FAI designs, in a way that it seems to be for this one. Here, we start the AGI without understanding of human values, only the output of the initial program that will be available some time in the future is expected to have that understanding, so there is nothing to morally guide the AGI in the meantime. By "solving FAI" I meant that we do get some technical understanding of human values when the thing is launched, which might be enough to avoid the carnage.
(This whole line of reasoning creates a motivation for thinking about Oracle AI boxing. Here we have AGIs that become FAIs eventually, but might be initially UFAI-level dangerous.)
My proposal seems like the default way to try and implement that. But I definitely agree that it's reasonable to think about this aspect of the problem more.
I think it's useful to separate the problem of pointing the external AGI to the output of a specific program, and the problem of arranging the structure of the initial program so that it produces a desirable output. The structure of the initial program shouldn't be overengineered, since its role is to perform basic philosophical research that we don't understand how to do, so the focus there should be mainly on safeguards that promote desirable research dynamics (and prevent UFAI risks inside the initial program).
On the other hand, the way in which AGI uses the output of the initial program (i.e. the notion of preference) has to be understood from the start, this is the decision theory part (you can't stop the AGI, it will forever optimize according to the output of the initial program, so it should be possible to give its optimization target a definition that expresses human values, even though we might not know at that point what kind of notion human values are an instance of). I don't think it's reasonable to force a "real-valued utility function" interpretation or the like on this, it should be much more flexible (for example, it's not even clear what "worlds" should be optimized).
The approach I was taking was: the initial program does whatever it likes (including in particular simulation of the original AI), and then outputs a number, which the AI aims to control. This (hopefully) allows the initial program to control the AI's behavior in detail, by encouraging it to (say) thoroughly replace itself with a different sort of agent. Also, note that the ems can watch the AI reasoning about the ems, when deciding how to administer reward.
I agree that the internal structure and the mechanism for pointing at the output should be thought about largely separately. (Although there are some interactions.)
I don't think this can be right. I expect it's impossible to create a self-contained abstract map of the world (or human value, of which the world is an aspect), the process of making observations has to be part of the solution.
(But even if we are talking about a "number", what kind of number is that? Why would something simple like a real number be sufficient to express relevant counterfactual utility values that are to be compared? I don't know enough to make such assumptions.)
Regarding the 100 years informal example at the beginning:
Let's say I kidnapped you, put you in a box, and told you that you would spend the rest of your life figuring out the answer to some obvious moral question ("Should we cure Brad of cancer?") I could imagine you might become quite resentful, might suffer from some sort of mental illness after the first 10 years, and might give the wrong answer just out of spite.
Of course, if we take a snapshot of your brain and put it through the same experience in a simulation, it will feel exactly the same way.
Obviously you can't just cut out whatever parts of the brain might be responsible for resentment/boredom/mental illness. Specifying some sort of entertainment for you might cause you to ignore solving the problem to focus on the entertainment, or the entertainment might change your values. Terminating your simulation as soon as you found the answer to the question could cause you to focus on your impending death.
Maybe you could give us an informal summary of your proposal, to harvest cognitive surplus from those who don't want to read the entire thing and possibly find some relevant common sense sticking point that didn't occur to you? I suspect I would find an informal discussion where specifics were hashed out as necessary more interesting to read. Of course the eventual goal is formalization, but best practices for formalization may not be best practices for harvesting cognitive surplus.
To put it another way, you might wish to make sure you had something worth formalizing (through lots of informal discussion) before taking the trouble to formalize it.
"In(1)" looked like the natural log to me, for what it's worth.
I do give a (somewhat) concise overview, in the section headed 'The Proposal.'
The 100 years example is not quite right, in that in the real example we put you in an environment with unlimited computational power. One of the first things you are likely to do is create an extremely pleasant environment for yourself to work in (another is to create a community to work alongside you, either out of emulations of yourself, emulations of others, or reconstructed from simulations of worlds like Earth), while you figure out what should be done.
That said, there are other ways that your values might change through this process. For example, one of the first things you would hypothetically realize, if you ended up in an environment with some apparently infinitely powerful computers, is that you are in a hypothetical situation. I don't know about you, but if I discovered I was in a clearly hypothetical situation, my views about the moral relevance of people in hypotheticals would change (hypothetically).
(I'm going based on this informal explanation you gave, now.)
It seems like a system such as you describe could exhibit chaotic behavior. Since the person is going to have to create an environment for themselves from scratch, initial decisions about what their environment should be like could impact subsequent decisions, etc. (Also, depending on the level of detail that the person has to specify, reversibility of decisions, etc. maybe the task of creating an environment for oneself would change their character substantially, e.g. like tripping on an untested psychedelic drug.)
Of course, the utility function produced could be "good enough".
Here's another objection. Putting someone in an environment they control completely which has unlimited computational power could lead to some pretty unexpected stuff. Wireheading would be easy, and it could start innocuously: I decide I could use an attractive member of my preferred gender to keep me company and things get worse from there. If you put someone in this situation it seems like there'd be tremendous incentives to procrastinate indefinitely on solving the problem at hand.
It seems like under ideal conditions we could empirically test the behavior of this sort of exotic "utility function" and make sure it was meeting basic sanity checks.
Creating the initial community requires the first person to create ems of other people who do not initially exist within the simulation and organize their society in a way that makes them productive and prevents them from undergoing value drift. The first person must also prevent value drift in themself over the entire time period that they are solving these other problems. This is far too hard for one person and organizing a group of uploads that can do so is nontrivial.
Here's a stronger version of my previous criticism of this argument. Suppose instead of giving neuroimaging data to the AI and defining H in terms of a brute force search for a model that can explain the neuroimaging data, we give it a cryptographic hash of the neuroimaging data (of sufficient length to avoid possible collisions), and modify the definition of H to first perform a brute force search to recover the neuroimaging data from the hash. In this case, we can still say that torturing is probably bad according to U, but the AI obviously can't arrive at this conclusion from the formal definition of U alone (assuming it can't break the cryptographic hash). It seems clear that we can't safely assume that "the U-maximizer can carry out any reasoning that we can carry out".
In order to "carry out reasoning inspired by human models", the AI has to first form a usable model of a human. I don't have a strong argument that the U-maximizer can't do this from the original definition of U (i.e., from plaintext neuroimaging data), but intuitively it seems implausible given an amount of computing power the U-maximizer might initially have access to (say, within a couple orders of magnitude of the amount needed to do standard WBE). I don't see how "simply asking" could work either. What kind of questions might the U-maximizer ask, and how can we answer it, given that we don't know how to formalize what "torture" means?
It occurs to me that we can view this proposal through the "acausal trade" lens, instead of the "indirect normativity" lens, which might give another set of useful intuitions. What Paul is proposing can be seen as creating an AGI that can exert causal control in our world but cares only about a very specific world / platonic computation defined by H and T, while the inhabitants of that world (simulated humans and their descendants) care a lot about our world but has no direct influence over it. The hoped for outcome is for the two parties to do a trade: the AGI turns our world into a utopia in return for the inhabitants of the HT World satisfying its preferences (i.e., having the computation return a high utility value).
From this perspective, Paul's proposal can also be seen as an instance of what I called "Instrumentally Friendly AI" (on the decision theory list):
I'm slightly worried that even formally specifying an "idealized and unbounded computer" will turn out to be Oracle-AI-complete. We don't need to worry about it converting something valuable into computronium, but we do need to ensure that it interacts with the simulated human(s) in a friendly way. We need to ensure that it doesn't modify the human to simplify the process of explaining something. The simulated human needs to be able to control what kinds of minds the computer creates in the process of thinking (we may not care, but the human would). And the computer should certainly not hack its way out of the hypothetical via being thought about by the FAI.
We are trying to formally specify the input-output behavior of an idealized computer, running some simple program. The mathematical definition of a Turing machine with an input tape would suffice, as would a formal specification of a version of Python running with unlimited memory.
Okay, I see that that's what you're saying. The assumption then (which seems reasonable but needs to be proven?) is that the simulated humans, given infinite resources, would either solve Oracle AI [edit: without accidentally creating uFAI first, I mean] or just learn how to do stuff like create universes themselves.
There is still the issue that a hypothetical human with access to infinite computing power would not want to create or observe hellworlds. We here in the real world don't care, but the hypothetical human would. So I don't think your specific idea for brute-force creating an Earth simulation would work, because no moral human would do it.
Another problem with this proposal: what if egoism is the right morality, or at least that our "actual" values have a large selfish component? If that is the case, then presumably the simulated humans inside the proposed AI will eventually realize it, and then cause the AI to value them (the simulations) instead of us (biological humans).
It seems difficult for approaches to FAI based on indirect normativity (e.g., CEV) to capture selfish values (with the correct indexical references), so it's not just a problem for this specific proposal, but I don't seem to recall seeing the issue mentioned anywhere before.
After reading the article, I thought I understood it, but from reading the comments, this appears to be an illusion. Yet, I think I should be able to understand, it doesn't seem to require any special math or radically new concepts... My understanding is below. Could someone check it and tell me where I'm wrong?
The proposal is to define a utility function U(), which takes as input some kind of description of the universe, and returns the evaluation of this description, a number between 0 and 1.
The function U is defined in terms of two other functions - H and T, representing a mathematical description of a specific human brain, and an infinitely powerful computing environment.
Although the U-maximizing AGI will not be able to actually calculate U, it will be able to reason about it (that is, prove theorems), which should allow it to perform at least some actions, which would therefore be provably friendly.
Possible objection: the proposal appears to fix U in terms of a mathematical description H of some current human brain. What happens in the future, when humans significantly self-modify?
H is used to start off the process. H is then able to interact with a hypothetical unbounded computer, which may eventually run many (potentially quite exotic) simulations, among them of the sorts of minds humans self-modify into.
But your point (as I understood it) is that all these exotic simulations don't actually get run, they mostly just get reasoned about. If this is so, then as we go farther into future, U becomes increasingly obsolete.
If I cause an infinite loop in my python shell, my computer does not crash and I can just kill the shell and start what I was doing over from the beginning. I don't understand why this wouldn't work in your scenario.
You kill the shell not when it is in an infinite loop, but when it takes more than a few seconds to run. We can set up such a safety net, allowing the human to run anything that takes (say) less than a million years to run, without risk of crashing. This is the sort of thing I was referring to by "some care."
Ultimately we do want the human to be able to run arbitrarily expensive subroutines, which prohibits using any heuristic of the form "stop this computation if it goes on for more than N steps."
What if we keep this heuristic but also define T to have an instruction that is equivalent to calling a halting-problem oracle (with each call counting as one step)? Of course that makes it harder for the outer AGI to reason about how to maximize its utility, but the increase in difficulty doesn't seem very large relative to the difficulty in the original proposal.
If we weight the simulation-people equally to the real-people, the computer will either choose to neglect us entirely and expend its resources more efficiently by using them on the simulation people or choose to treat us as the test subjects for some of its simulated projects, because there are going to be more simulation people than real people (unless the quality/quantity of the simulation isn't very good, which has its own problems).
I don't like that idea, but I also don't like the idea of weighting certain people higher than others just because some are in a machine.
I think there's a problem because of the range of possible humans that could be simulated within the AI.
What if it decides to simulate Hitler?