Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This [EDIT: final version, presentation] is a design for a conservative agent that I worked on with Marcus Hutter. Conservative agents are reluctant to make unprecedented things happen. The agent also approaches at least human-level reward acquisition.

The agent is made conservative by being pessimistic. Pessimism is tuned by a scalar parameter . When the agent is more pessimistic, it is more conservative. When it is made more pessimistic, it would be less likely to exceed human-level reward acquisition (almost definitely, but I haven’t tried to prove that). It would also require more observations before it started acting, the more pessimistic it is. It is not clear to me how useful the agent would be at the level of pessimism where we could be confident it is safe. At 0 pessimism, it is similar to AIXI (although technically stronger, because AIXI doesn’t have the performance guarantee of matching or exceeding human-level reward acquisition).

The agent has access to a human mentor, and at every timestep, it can either act or defer to the mentor. The only assumption we make is that the true environment belongs to the agent’s (countable) set of world-models. First a bit of math, then the main results.

A bit of math:

An event is a subset of interaction histories that end with an action. Letting , , and be the action, observation, and reward spaces, an event . An element of would look like . Below, I will say “[to] take an action which immediately causes event ”, by which I mean “to take an action such that now the interaction history is an element of .”

The main results:

1) (At least) mentor-level reward acquisition,

2) Probability of querying the mentor ,

and this will take some time read, but I figured I’d spell it all out properly:

3) For any complexity class C (defined on normal Turing machines, not e.g. non-deterministic ones), we can construct a set of world-models such that for all events in the complexity class C and for all , there exists a such that: when the pessimistic agent has the model class and a pessimism , the following holds with probability : for the whole lifetime of the pessimistic agent, if has never happened before, the agent will not take an action which immediately causes event ; if the event ever happens, it will be because the mentor took the action that made it happen.

Comment 1: “The agent takes an action which eventually causes (with probability at least )” is an event itself, and it happens immediately if the agent takes the action in question, so the theorem above applies. But this event may not be in a complexity class that is in.

Comment 2: The less simple is, the higher has to be.

Some other interesting results follow from the “Probably Respecting Precedent Theorem” above. One of which is roughly that (using the same there-exists and for-alls as in the main theorem) it is not instrumentally useful for the agent to cause to happen. Note there is no need for the qualifier “immediately”.

Here is an event E that makes the Probably Respecting Precent Theorem particularly interesting: “Everyone is probably about to be dead.” If we want the agent to avoid an unprecedented bad outcome, all we have to know is an upper bound of the computational complexity of the bad outcome. We don’t have to know how to define the bad outcome formally.

Here’s how the agent works. It has a belief distribution over countably many world-models. A world-model is something that gives a probability distribution over observations and rewards given an interaction history (that ends in an action). It has a belief distribution over countably many mentor-models. A mentor-model is a policy—a probability distribution over actions given an interaction history. At each timestep, it takes the top world-models in its posterior until the total posterior weight of those world-models is at least . The pessimistic value of a policy is the minimum over those world-models of the expected future discounted reward when following that policy in that world-model. The agent picks a policy which maximizes the pessimistic value. Either it follows this policy, or it defers to the mentor. To decide, it samples a world-model and a mentor-model from its posterior; then, it calculates the expected future discounted reward when following the mentor-model (which is a policy) in that world-model. If this value is greater than the pessimistic value plus positive noise, the agent defers to the mentor. Also, if the pessimistic value is 0, it defers to the mentor. This is called the zero condition, and to ensure that it only happens finitely often, the actual rewards we give have to always be greater than some . (If for some reason we failed to do this, despite that being in no one’s interest, the only results that would break are performance results, not safety results).

Here is an intuitive argument that some might find more persuasive than the formal results. An advanced RL agent run on a computer in Oxford might come up with two hypotheses about how the environment produces rewards: (1) the environment produces rewards according to how satisfied the human operators are with my performance; (2) the environment produces reward according to which keys are pressed on the keyboard of a certain computer. An agent which assigns sufficient weight to (2) will take over the world if possible to make sure those keys are pressed right. A pessimistic agent (that is sufficiently pessimistic to include (1) in its set of top world-models that cover of its posterior) will predict that taking over the world will make the human operators unsatisfied, which puts an upper bound on the pessimistic value of such a policy. Better to play it safe, and take actions which satisfy the human operators and cause them to press the right keys accordingly. (With the help of mentor-demonstrations, it will have seen enough to have all its top models be approximately accurate about the effects of normal actions). Intuitively, I think much lower values of are required to get this sort of behavior than the value of that would be required to get very a small for the event “everyone is probably about to be dead” in the Probably Respecting Precedent Theorem.

This agent is definitely not tractable. I mentioned that when is large enough to make it safe, it might never learn to be particularly superhuman. It is also possible that we never manage to come up with heuristic approximations to this agent (for the sake of tractability) without ruining the safety results. (The most powerful “heuristic approximations” will probably look like “applying the state of the art in AI in place of proper Bayesian reasoning and expectimax planning.”) These are the main reasons I see for pessimism about pessimism.

One thought I’ve had on tractable approximations: I imagine the over world-models being approximated by an adversary, who takes the agent’s plan and searches for a simple world-model that retrodicts past observations well, but makes the plan look dumb.

Just a warning: the paper is dense.

“I was sweating blood” — Marcus Hutter

Some kind, kind people who read drafts and who were not familiar with the notation said it took them 2-3 hours (excluding proofs and appendices). Sorry about that. I’ve tried to present the agent and the results as formally as I can here without lots of equations with Greek letters and subscripts. Going a level deeper may take some time.

Thanks to Marcus Hutter, Jan Leike, Mike Osborne, Ryan Carey, Chris van Merwijk, and Lewis Hammond for reading drafts. Thanks to FHI for sponsorship. We’ve just submitted this to COLT. EDIT: It's been accepted! We’ll post it to ArXiv after we’ve gotten comments from reviewers. If you’d like to cite this in a paper in the meantime, you can cite it as an unpublished manuscript; if you’re citing it elsewhere, you can link to this page if you like. Hopefully, theorem numbers will stay the same in the final version, but I can’t promise that. I might not be super-responsive to comments here. EDIT: I have more time to respond to comments now.

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 11:20 AM

Planned summary for the Alignment Newsletter:

The argument for AI risk typically involves some point at which an AI system does something unexpected and bad in a new situation that we haven't seen before (as in e.g. a treacherous turn). One way to mitigate the risk is to simply detect new situations, and ensure the AI system does something known to be safe in such situations, e.g. deferring to a human, or executing some handcoded safe baseline policy. Typical approaches involve a separate anomaly detection model. This paper considers: can we use the AI system itself to figure out when to defer to a mentor?
_The key insight is that if an AI system maintains a distribution over rewards, and "assumes the worst" about the reward in new situations, then simply by deferring to the mentor with higher probability when the mentor would get higher expected reward, it will end up deferring to the mentor in new situations._ Hence, the title: by making the agent pessimistic about unknown unknowns (new situations), we get a conservative agent that defers to its mentor in new situations.
This is formalized in an AIXI-like setting, where we have agents that can have beliefs over all computable programs, and we only consider an online learning setting where there is a single trajectory over all time (i.e. no episodes). The math is fairly dense and I didn't try to fully understand it; as a result my summary may be inaccurate. The agent maintains a belief over world models (which predict how the environment evolves and how reward is given) and mentor models (which predict what the mentor will do, where the mentor's policy can depend on the **true** world model). It considers the β most likely world models (where β is a hyperparameter between 0 and 1). It computes the worst-case reward it could achieve under these world models, and the expected reward that the mentor achieves. It is more likely to defer to the mentor when the mentor's expected reward is higher (relative to its worst-case reward).
Such an agent queries the mentor finitely many times and eventually takes actions that are at least as good as the mentor's choices in those situations. In addition, for events with some bound on complexity, we can set things up (e.g. by having a high β) such that for any event, with high probability the agent never causes the event to occur unless the mentor has already caused the event to occur some time in the past. For example, with high probability the agent will never push the big red button in the environment, unless it has seen the mentor push the big red button in the past.

Planned opinion:

I think it is an underrated point that in some sense all we need to do to avoid x-risk is to make sure AI systems don't do crazy high-impact things in new situations, and that risk aversion is one way to get such an agent. This is also how <@Inverse Reward Design@> gets its safety properties: when faced with a completely new "lava" tile that the agent has never seen before, the paper's technique only infers that it should be _uncertain_ about the tile's reward. However, the _expected_ reward is still 0, and to get the agent to actually avoid the lava you need to use risk-averse planning.
The case for pessimism is similar to the case for impact measures, and similar critiques apply: it is not clear that we can get a value-agnostic method that is both sufficiently safe to rule out all catastrophes, and sufficiently useful to replace other AI techniques. The author himself points out that if we set β high enough to be confident it is safe, the resulting agent may end up always deferring to the mentor, and so not actually be of any use. Nonetheless, I think it's valuable to point out these ways that seem to confer some nice properties on our agents, even if they can't be pushed to the extremes for fear of making the agents useless.

It's an interesting sort of conservative that cannot be roused to (extreme) action by extreme circumstance.