A putative new idea for AI control; index here.
EDIT: The definition of satisficer I'm using here is the informal one of "it tries to achieve a goal, without making huge changes on the universe" rather than "it's an agent that has utility u and threshold t". If you prefer the standard notation, think of this as a satisficer where t is not fixed, but dependent on some facts in the world (such as the ease of increasing u). I'm trying to automate the process of designing and running a satisficer: people generally chose t given facts about the world (how easy it is to achieve, for instance), and I want the whole process to be of low impact.
I've argued that the definition of a satisficer is underdefined, because there are many pathological behaviours all compatible with satsificer designs. This contradict the intuitive picture that many people have of a satisficer, which is an agent that does the minimum of effort to reach its goal, and doesn't mess up the outside world more than it has to. And if it can't accomplish the goals without messing up the outside world, it would be content not to.
In the spirit of "if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach", can we spell out what features we would want from such a satisficer? Preferably in a simpler format that our intuitions.
It seems to me that if you had a proper u-satisficer S(u), then for many (real or hypothetical) v-maximiser M(v) out there, M(v) would find that:
- Changing S(u) to S(v) is of low value.
- Similarly, utility function trading with S(u) is of low value.
- The existence or non-existence of S(u) is of low information content about the future.
- The existence or non-existence of S(u) has little impact on the expected value of v.
- Would not effectively aid M(u), a u-maximiser.
- Would not effectively resist M(-u), a u-minimizer.
- Would not have large impacts (if this can measured) for low utility gains.
A subsequent post will present an example of a satisficer using some of these ideas.
A few other much less-developed thoughts about satisficers:
- Maybe require that it learns what variables humans care about, and doesn’t set them to extreme values – try and keep them in the same range. Do the same for variables humans may care about or that resemble values they care about.
- Models the general procedure of detecting unaccounted-for variables set to extreme values.
- We could check whether it would kill all humans cheaply if it could (or replace certain humans cheaply). ie give it hypothetical destructive superpowers with no costs to using them, and see whether it would use them.
- Have the AI establish a measure/model of optimisation power (without reference to any other goal), then put itself low on that.
- Trade between satisficers might be sub-Pareto.
- When talking about different possible v's in the first four points above, it might be better to use something else than an expectation over different v's, as that could result in edge cases dominating - maybe a soft minimum of value across different v instead.
I don't think "satisficer" is a good name for the concept you're describing here. For one thing, I think it's weird to see a satisficer with just a function (I presume that's what u is?) as its input--where's the threshold of acceptability?
I think a well-specified satisficer as traditionally conceived looks like this:
Thus we can consider some problem (basically, the set X that describes all possible solutions), and throw a satisficer S(u,u_0,(x_t)) at it, and know exactly which solution x will be picked by S. (I think my notation for the proposal sequence is awkward; if there's a finite set of solutions, we can describe it as a permutation of that set, but that implies some restrictions I don't like. I'm using the parentheses so I can differentiate (x_t), the set of all of them, and x_t, the t-th one.)
Some obvious generalizations suggest themselves: the desirability threshold, rather than just looking at u(x_t), could look at some function of u(x_i) and x_i where i ranges from 0 to t. x_t+1 could depend on u(x_i) rather than just t. x_t+1 could have a source of randomness (in which case we now have a distribution over solutions, rather than a single known solution).
We can then talk about other properties. Perhaps we want to describe satisficers that consider the entire solution space X as "complete," or the ones that only consider each solution at most once "nonrepeating." Most importantly for you, though, a "well-behaved" satisficer has that the property that the proposal sequence x_t is arranged in ascending order by some measure n(x_t) of effort and negative externalities. Maintaining the correct level of illumination in a room by adjusting the curtains comes very early in the ordering; launching a satellite to block the sun comes much, much later; launching a mission to destroy the sun comes so late in the ordering it is almost certain it will not be reached.
This property suggests that a well-behaved satisficer is basically solving two optimization problems simultaneously, on u and n. (The way you'd actually write it is min n(x) s.t. u(x)>u_0, x\in X.)
To maintain the computational simplicity that satsificers are useful for, though, we wouldn't want to write it as a minimization problem. This requires us to use a less restrictive version of 'well-behaved,' where the proposal function is 'generally' increasing in effort rather than strictly nondecreasing in effort.
Thanks for your suggestion.
Thinking aloud here:
Say I'm an agent that wants to increase u, but not "too strongly" (this whole thing is about how to formalize "too strongly"). Couldn't I have a way of estimating how much other agents who don't care about u might still care about what I do, and minimize that? i.e. avoid anything that would make other agents want to model my working as something more than "wants to increase u".
(back in agent-designer shoes) So we could create a "moderate increaser" agent, give it a utility function u and inform it of other agents trying to increase v, w, x, y, and somehow have it avoid any strategies that would involve "decision theory interaction" with those other agents; i.e. threats, retaliation, trade ... maybe something like "those agents should behave as if you didn't exist".
Not too far away from my ideas here: http://lesswrong.com/r/discussion/lw/lv0/creating_a_satisficer/
I generally think of satisficing being a local property of a utility function. I may want to maximize my utility, but locally speaking I am indifferent among many, many things, so I satisfice in respect to those things.
I think that looking at it in this way might be more productive.
Aside from that,
It is likely to do the former up to a point, and would certainly do the latter after a point. As Vaniver noted, you haven't specified that point.
Quite agreed. A satisficer is an agent with a negative term in their utility function for disruption and effort. It's not limiting it's utility, it's limited in how much stuff it can do without reducing utility.
Dagon, despite agreeing, you seem to be saying the opposite to Luke_A_Somers (and your position is closer to mine).
I'm not sure how that's supposed to work. S(u) won't do much as long as the desirability threshold is obtained, but if M(-u) comes along and makes this difficult, S(u) would use everything it has to stop M(-u). Are you using something beyond desirability threshold? Something where S(u) stops not when the solution is good enough, but when it gets difficult to improve?
See my edit above. "would use everything it has to..." is the kind of behaviour we want to avoid. So I'm more following the sastisficing intuition than the formal definition. I can justify this by going meta: when people design/imagine satisficers, they generally look around at the problem, see what can be achieved, how hard it is, etc... and then set the threshold. I want to automate "set a reasonable threshold" as well as "be a reasonable satisficer" in order to achieve "don't have a huge impact on the world".
The general problem seems difficult. There are plenty of ways to model satisficing behavior.
Suppose we are given a utility function u. If we have some information about the maximum, here are some possibilities:
The satisficer follows some maximization algorithm and terminates once some threshold is attained (say, 90% of max(u)).
The satisficer uses a probabilistic algorithm that terminates after a certain threshold is probably attained.
The satisficer has some threshold, but continues searching as long as it's "cheap" to do so, where "cheap" depends on some auxiliary cost function. It could, for example, measure the number of steps since attaining the best known value thus far or it could measure the rate of improvement and stop once that dips too low.
If we don't know much about the maximum, we can't very well have a predefined threshold, but there some strategies:
The satisficer follows some algorithm for maximizing u and terminates after, say, N steps and chooses the best value found. Or one can add some adaptive behavior and say that it terminates if there has been no improvement in a certain time interval.
In a constrained problem where searching is expensive, the satisficer accepts the first solution satisfying the constraints.
What the above have in common is that the satisficer is basically carrying out some optimization algorithm with alternative stopping conditions. As such, it may exhibit some of the drawbacks one finds with optimizers. Maybe there's a model that's not built on an optimizer that I haven't thought of.
There's a tacit assumption here that messing with the outside world involves effort. Without it, you can get all kinds of unexpected behavior. If the satisficer is fine with "make 10 paperclips", but the difference between that and "make 10 paperclips and kill all humans" is small, the satisficer may well go with whatever showed up first in its search algorithm.
That's just a relic of mathematics though. If you're trying to find the maximum of, say, the size of the projection of some 10-dimensional set on a particular 2-dimensional plane, there may be a lot of room in those extra dimensions that doesn't impact the actual value to be maximized. Trying to find something merely close to the max would seem to exacerbate this issue.
Yep, that's the problem. That's why I'm trying to address the issue directly by penalising things like "...and kill all humans".