I've been working on defining "optimizer", and I'm wondering about what people consider to be or not be an optimizer. I'm planning on taking about it in my own post, but I'd like to ask here first because I'm a scaredy cat.
I know a person or AI refining plans or hypotheses would generally be considered an optimizer.
What about systems that evolve? Would an entire population of a type of creature be its own optimizer? It's optimizing for genetic fitness of the individuals, so I don't see why it wouldn't be. Evolutionary programming just emulates it, and it's definitely an optimizer.
How do you draw the line between systems that evolve and systems that don't evolve? Is a sterile rock an optimization process? I suppose there is potential for the rocks' contents to evolve. I mean, it's maybe eventually, through the right collisions, life could evolve in a pile of rocks, and then it would be evolve like normal. Are rocks not optimizers, or just really weak, slow optimizers, that take a really really long time to come up with a configuration that isn't equally horrible as everything else in the rock for self-reproduction.
What about systems that tend towards stable configurations? Imagine you have a box with lots of action figures and props and you're bouncing it around. I think such a system would, if feasible, tend towards stable configurations of its contents. For example, initially, the action figures might be all scattered about and bouncing everywhere. But eventually, the system might make the action figures in secure, stable positions. For example, maybe Spiderman would end up with his arm securely longed in a prop and his adjustable spider web accessory securely wrapped around a miniature street light? Is that system an optimizer? What if the toys also come with little motors and a microcontroller to control them, and change their program them by bouncing them around? If you tried this for a sufficiently long time, you could potentially end up with your action figures producing clever strategies to maintain their despite shakes and configuration and avoid further changes in their programs.
What about annealing? Basically annealing involves putting a piece of metal in an oven and heating it for a while. It changes durability and ductility. Normally, people wouldn't think of a piece of metal to be an optimizer. However, there's an optimization algorithm called "simulated annealing". It works pretty much the same way as actual annealing. Actual annealing works as a process in which the things in the metal end up in low-energy states. I don't know how I could justify calling a simulated annealing program and optimizer and not call actual annealing an optimizer.
To what extent is people's intuition of "optimizer" well-defined? I at first clearly say general people and AIs as optimizer, but I don't know about the above things.
Am I right that "optimizer" is a fuzzy concept?
And is it well-defined? I imagined so, but I've been thinking about a lot of things that my intuition doesn't say is or isn't an optimizer.
How much should we care about our notion of "optimizer"? It seems like the main point of the concept is that we know that some optimizers have the potential to be super powerfully or dangerously good at something. So what if we just directly focused on how to tell if a system has the potentially to be super dangerously or powerfully good at something?
I agree that intelligent agents have a tendency to seek power and that that is a large cause of what makes them dangerous. Agents could potentially cause catastrophes in other ways, but I'm not sure if any are realistic.
As an example, suppose an agent creates powerful self-replicating nanotechnology that makes a pile of paperclips, the agent's goal. However, since they are self-replicating the agent didn't want to spend the time engineering a way to stop replication, the nanobots eat the world.
But catastrophes like this would probably also be dealt with by AUP-preservation, though. At least, if you use the multi-equation impact measure. (If the impact equation only concerns the agent's ability to achieve its own goal, maybe it would let the world be consume after putting up a nanotech-proof barrier around all of its paperclip manufacturing resources. But again, I don't know if that's realistic.)
I'm also concerned agents would create large, catastrophic changes to the world in ways that don't increase their power. For example, an agent who wants to make paperclips might try to create nanotech that assembles the entire world into paperclips. It's not clear to me that this would increase the agent's power much. The agent wouldn't necessarily have any control of the bots, so it would limit the agent to doing with for just its one utility function. And if the agent is intelligent enough to easily discover how to create such technology, actually creating them doesn't sound like it would give it more power than it already had.
If the material for the bots is scarce then making them prevents the AI from making other things, then they might provide a net decrease to the agent's power. And once the world is paperclips, the agent would be limited to just having paperclips available, which could make it pretty weak.
I don't know if you consider the described scenario as seeking power. At least, I don't think it would count as an increase in the agent's impact equation.
I hadn't thought about the distinction between gaining and using resources. You can still wreak havoc without getting resources, though, by using them in a damaging way. But I can see why the distinction might be helpful to think about.
It still seems to me that an agent using equation 5 would pretty much act like a human imitator for anything that takes more than one step, so that's why I was using it as a comparison. I can try to explain my reasoning if you want, but I suppose it's a moot point now. And I don't know if I'm right, anyways.
Basically, I'm concerned that most nontrivial things a person wants will take multiple actions, so in most of the steps the AI will be motivated mainly by the reward given in the current step for reward-shaping reasons (as long as it doesn't gain too much power). And doing the action that gives the most immediate reward for reward shaping-reasons sounds pretty much like doing whatever action the human would think is best in that situation. Which is probably what the human (and mimic) would do.
Is there much the reduced-impact agent with reward shaping could do that an agent using human mimicry couldn't?
Perhaps it could improve over mimicry by being able to consider all actions, while a human mimic would only in effect consider the actions a human would. But I don't think there are usually many single-step actions to choose from, so I'm guessing this isn't a big benefit. Could the performance improvement come from better understanding the current state than mimics could? I'm not sure when this would make a big difference, though.
I'm also still concerned the reduced-impact agent would find some clever way to cause devastation while avoiding the impact penalty, but I'm less concerned about human mimics causing devastation. Are there other, major risks to using mimicry that the reduced-impact agent avoids?
I have a question about attainable utility preservation. Specifically, I read the post "Attainable Utility Preservation: Scaling to Superhuman", and I'm wondering how and agent using the attainable utility implementation in equations 3, 4, and 5 could actually be superhuman. I've been misunderstanding things and mis-explaining things recently, so I'm asking here instead of the post for now to avoid wasting an AI safety researcher's time.
The equations incentivize the AI to take actions that will provide an immediate reward in the next timestep, but penalizes its ability to achieve rewards in later timesteps.
But what if the only way to receive a reward is to do something that will only give a reward several timesteps later? In realistic situations, when can you ever actually accomplish the goal you're trying to accomplish in a single atomic action?
For example, suppose the AI is rewarded for making paperclips, but all it can do in the next timestep is start moving its arm towards wire. If it's just rewarded for making paperclips, and it can't make a paperclip the next timestep, so the AI would instead focus on minimizing impact and not do anything.
I know you could adjust the reward function to reward the AI doing things that you think will help it accomplish your primary goal in the future. For example, you know the AI moving its arm towards the wire is useful, so you could reward that. But then I don't see how the AI could do anything clever or superhuman to make paperclips.
Suppose the AI can come up with a clever means of making paperclips by creating a new form of paperclip-making machine. Presumably, it would take many actions to build before it could be completed. And the person responsible for giving out awards wouldn't be able to anticipate that the exact device the AI is making would be helpful, so I don't see how the person giving out the rewawrds could get the AI to make the clever machine. Or do anything else clever.
Then wouldn't such a reduced-impact agent pretty much just follow the doing what a human would think is most helpful for making paperclips? But then wouldn't the AI pretty much just emulating human, not superhuman, behavior?
Thanks for the link. It turns out I missed some of the articles in the sequence. Sorry for misunderstanding your ideas.
I thought about it, and I don't think your agent would have the issue I described.
Now, if the reward function was learned using something like a universal prior, then other agents might be able to hijack the learned reward function to make the AI misbehave. But that concern is already known.
Thanks for the response.
In my comment, I imagined the agent used evidential or functional decision theory and cared about the actual paperclips in the external state. But I'm concerned other agent architectures would result in misbehavior for related reasons.
Could you describe what sort of agent architecture you had in mind? I'm imagining you're thinking of an agent that learns a function for estimating future state, percepts, and reward based on the current state and the action taken. And I'm imagining the system uses some sort of learning algorithm that attempts to find sufficiently simple models that accurately predicted its past rewards and percepts. I'm also imagining it either has some way of aggregating the results of multiple similarly accurate and simple models or for choosing one to use. This is how I would imagine someone would design an intelligent reinforcement learner, but I might be misunderstanding.
I realized both explanations I gave were overly complicated and confusing. So here's a newer, hopefully much easier to understand, one:
I'm concerned a reduced-impact AI will reason as follows:
"I want to make paperclips. I could use this machinery I was supplied with to make them. But the paperclips might be low quality, I might not have enough material to make them all, and I'll have some impact on the rest of the world, potentially large ones due to chaotic effects. I'd like something better.
What if I instead try to take over the world and make huge numbers of modified simulations of me? The simulations would look indistinguishable from the non-simulated world, but would have many high-quality invisible paperclips pre-made so as to perfectly accomplish the AI's goal. And doing the null action would be set to have the same effects of trying to take over the world to make simulations so as to make the plans in simulations still be low-impact. This way, an AI in one of the simulations would have the potential to perfect accomplish its goal and have almost zero impact. If I execute this plan, then I'd almost certainly be in a simulation, since there would be vast numbers of simulated AIs but only one original, and all would perceive the same things. So, if I execute this plan I'll almost certainly perfectly accomplish my goal and have effectively zero impact. So that's what I'll do."
Oh, I'm sorry, I looked through posts I read to see where to add the comment and apparently chose the wrong one.
Anyways, I'll try to explain better. I hope I'm not just crazy.
An agent's beliefs about what the world it's currently in influence its plans. But its plans also have the potential to influence its beliefs about what world it's currently in. For example, if the AI original think it's not in a simulation, but then plans on trying to make lots of simulations of it, then it would think it's more likely that it currently is in a simulation. Similarly, if the AI decides against trying to make simulations, then it would probably place higher probability in it not currently being in a simulation.
So, to summarize, the AI's beliefs about the current world influence its current plan, but the AI's current plan potentially influences its beliefs about the current world, which has the potential to result influence the AI's plan, which can further modify its beliefs, and so on. Unless the AI would continue having plan and beliefs fluctuate, I imagine the AI would come to an equilibrium: a state in which the AI's current plan is the optimal one for its current beliefs about what sort of world it's in, and such that deciding to execute the plan would still be a world for which executing the plan would be optimal. There might be reasonable AI architectures that don't allow for the AI's plans to update their beliefs as I described, but they also seem vulnerable to my concern but in a different way, so I won't talk about them.
Anyways, here is one potential equilibrium on the AI's beliefs and plan:
"I'm not in a simulation, or at least I'm in one that's faithful to the 'real', original Earth. Thus, the obvious, common-sense ideas about what is and isn't impactful most people have are correct. So I'll go and create these paperclips, and not try to create nanotech for world domination or anything, because that would be high-impact. Since I'm not going to try taking over the world and making simulations of myself, I've affirmed by belief that I'm not in such a simulation."
Here is another potential equilibrium:
"I'm probably in a simulation created the the AI in the original, non-simulated Earth that took over the world and created a bunch of simulations of itself, including me. The original AI made the simulations different in two ways from the non-simulated Earth. One way is that the simulation checks to see if I'm executing the plan that, if it were to be executed on the non-simulated Earth, would result in the original AI taking over the world and making lots of simulations of itself. If I am, the simulation will be set so my goal is perfectly accomplished. Since I'm making paperclips, that means as long as I do that, this entire room will fill up with paperclips. The other way the simulations is different from the non-simulated Earth is that it checks to see what would (causally) happen if the AI does the plan that would have resulted in a world-takeover in the original Earth, and makes those exact results happen if the AI instead does the 'null' action. Thus, my (causal) impact from doing the world-takeover plan in the simulation is low. So, as long as I act like the AI who took over the world in the original Earth and made the simulations did, I will perfectly accomplish my goal and extremely low impact. This is was I want, so this is what I'll do. Since I've decided to do this, my belief that I'm a such a simulation is affirmed, so I still think my plan is a good one."
The second equilibrium has the potential to result in perfect goal achievement and almost no causal impact, so I suspect that's the one the AI would go with. And that's the one that gets all of us killed.
Am I correct that counterfactual environments for computing impact in an reduced-impact agent would need to include acausal connections, or the AI would need some sort of constraint on the actions or hypotheses considered, for the impact measure to work correctly?
If it doesn't consider acausal impacts, then I'm concerned the AI would consider this strategy: act like you would if you were trying to take over the world in base-level reality. Once you succeed, act like you would if you were in base-level reality and trying to run an extremely large number of modified simulations of yourself. In the simulations, the simulation would be modified so if the simulated AI acts as if it was trying to take over the world, it will actually have no causal effect on the simulation except for have its goal in the simulation be accomplished. Having zero causal impact and its goal perfectly accomplished are things the AI wants.
I see two equilibriums in what the AI would do here. One is that it comes to the conclusion that it's in such a simulation and acts as if it's trying to to take over the world, thus potentially making it reasonable for the AI to think it's in such a simulation. The other is that the AI concludes it's not in such a simulation and acts as it should. I'm not sure which equilibrium the AI would choose, but I haven't thought of a strong reason it would go with the latter.
Perhaps other agents could stop this by running simulations of the AI in which trying to take over the world would have super high causal impact, but I'm not sure how we could verify this would happen.