Epistemic status: possibly confused thoughts from a research brainstorm

Suppose that an agent has a world-model that predicts the effects of any behavior on the environment, which is good enough to capture the "important" effects of a behavior, and the agent has a near-optimal policy as measured by a base optimizer.

It is well known that a utility function over behaviors/policies can describe any policy. We can decompose the agent's utility function into , where is the predicted parts of the world-trajectory that the base optimizer might possibly care about, and b is everything about a behavior, including things about the behavior that don't affect the future world in expectation (e.g. twitching, or random thermal fluctuations). If we define such that it explains the maximum variance possible in or something, then this representation is a unique decomposition of any possible utility function, and thus can describe any possible policy.

We can think of as the agent's proxy goals and g as the "non-consequentialist" component of the agent's utility function. If goal-directed behavior is selected for, then we might expect g would be small in some sense.

But there are many that achieve high reward on the training set but do not have small g:

  • Only the argmax of for each input matters for the policy chosen by , so can be literally anything on non-maximal actions (almost anything even with regularization).
  • If is bounded below, then the policy can give any output on a small percentage of training examples and still get close to minimum loss.
  • If has many points close to the maximum, or a broad maximum, then the policy can choose any of them and still get close to minimum loss.

So the only sense in which g is small is something contrived like "probably approximately small, but only on maxima of ". Applying some regularization condition to , or saying that the agent maximizes expected utility, don't seem to help.

Example: Say we train a system to maximize paperclips. It could have the proxy goal of maximizing the number of paperclips it sees, but also refuse to ever take any action starting with the letter Q, and also insist on standing around and watching every solar eclipse.


New Comment
17 comments, sorted by Click to highlight new comments since: Today at 1:11 AM

(Haven't read the OP thoroughly so sorry if not relevant; just wanted to mention...)

If any part of the network at any point during training corresponds to an agent that "cares" about an environment that includes our world then that part can "take over" the rest of the network via gradient hacking.

This seems like a weird claim - if there are multiple objectives within the agent, why would the one that cares about the external world decisively “win” any gradient-hacking-fight?

Agents that don't care about influencing our world don't care about influencing the future weights of the network.

I see, so you’re comparing a purely myopic vs. a long-term optimizing agent; in that case I probably agree. But if the myopic agent cares even about later parts of the episode, and gradients are updated in between, this fails, right?

I wouldn't use the myopic vs. long-term framing here. Suppose a model is trained to play chess via RL, and there are no inner alignment problems. The trained model corresponds to a non-myopic agent (a chess game can last for many time steps). But the environment that the agent "cares" about is an abstract environment that corresponds to a simple chess game. (It's an environment with less than states). The agent doesn't care about our world. Even if some potential activation values in the network correspond to hacking the computer that runs the model and preventing the computer from being turned off etc., the agent is not interested in doing that. The computer that runs the agent is not part of the agent's environment.

This comes from a research brainstorm. I think people have had this thought before, but I couldn't find it anywhere on LW/AF.

All of this is predicated on the agent having unlimited and free access to computation.

This is a standard assumption, but is worth highlighting.

I don't think I make this assumption. The biggest flaw in this post is that some of the definitions don't quite make sense, and I don't think assuming infinite compute helps this.

I don't think I make this assumption.

You don't explicitly; it's implicit in the following:

It is well known that a utility function over behaviors/policies is sufficient to describe any policy.

The VNM axioms do not necessarily apply for bounded agents. A bounded agent can rationally have preferences of the form A ~[1] B and B ~ C but A ≻[2] C, for instance[3]. You cannot describe this with a straight utility function.

  1. ^

    is indifferent to

  2. ^

    is preferred over

  3. ^

    See https://www.lesswrong.com/posts/AYSmTsRBchTdXFacS/on-expected-utility-part-3-vnm-separability-and-more?commentId=5DgQhNfzivzSdMf9o, which is similar but which does not cover this particular case. That being said, the same technique should 'work' here.

I agree that a bounded agent can be VNM-incoherent and not have a utility function over bettable outcomes. Here I'm saying you can infer a utility function over behaviors for *any* agent with *any* behavior. You can trivially do this by setting the utility gained by every action the agent actually takes to 1, and utility of every action the agent doesn't take to 0. For example for twitch-bot, the utility at each step is 1 if it twitches and 0 if it doesn't.

That's a very different definition of utility function than I am used to. Interesting.

What would the utility function over behaviors for an agent that chose randomly at every timestep look like?

My guess is if the randomness is pseudorandom, then 1 for the behavior it chose and 0 for everything else; if the randomness is true randomness and we use Boltzmann rationality then all behaviors are equal utility; if the randomness is true and the agent is actually maximizing, then the abstraction breaks down?

I want to clarify that this is not a particularly useful type of utility function, and the post was a mostly-failed attempt to make it useful.

I want to clarify that this is not a particularly useful type of utility function, and the post was a mostly-failed attempt to make it useful.

Fair! Here's another[1] issue I think, now that I've realized you were talking about utility functions over behaviours, at least if you allow 'true' randomness.

Consider a slight variant of matching pennies: if an agent doesn't make a choice, their choice is made randomly for them.

Now consider the following agents:

  1. Twitchbot.
  2. An agent that always plays (truly) randomly.
  3. An agent that always plays the best Nash equilibrium, tiebroken by the choice that results in them making the most decisions. (And then tiebroken arbitrarily from there, not that it matters in this case.)

These all end up with infinite random sequences of plays, ~50% heads and ~50% tails[2][3][4]. And any infinite random (50%) sequence of plays could be a plausible sequence of plays for either of these agents. And yet these agents 'should' have different decompositions into  and .

  1. ^

    Maybe. Or maybe I was misconstruing what you meant by 'if the randomness is true and the agent is actually maximizing, then the abstraction breaks down' and this is the same issue you recognized.

  2. ^

    Twitchbot doesn't decide, so its decision is made randomly for it, so it's 50/50.

  3. ^

    The random agent decides randomly, so it's 50/50.

  4. ^

    'The' best Nash equilibrium is any combination of choosing 50/50 randomly, and/or not playing. The tiebreak means the best combination is playing 50/50.

W is a function, right? If so, what’s its type signature?

As written w takes behaviors to "properties about world-trajectories that the base optimizer might care about" as Wei Dai says here. If there is uncertainty, I think w could return distributions over such world-trajectories, and the argument would still work.

Ah I see, and just to make sure I'm not going crazy, you've edited the post now to reflect this?

New to LessWrong?