ryan_greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.

Hmm, I don't think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.

Overall, I use the term to mean basically the same thing as "deceptive alignment". (But more specifically pointing the definition in Joe's report which depends less on some notion of mesa-optimization and is a bit more precise IMO.)

I think in Ajeya's story the core threat model isn't well described as scheming and is better described as seeking some proxy of reward.

Ultimately I think you've only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don't scheme.

It's worth noting here that Carlsmith's original usage of the term scheming just refers to AIs that perform well on training and evaluations for instrumental reasons because they have longer run goals or similar.

So, AIs lying because this was directly reinforced wouldn't itself be scheming behavior in Carlsmith's terminology.

However, it's worth noting that part of Carlsmith's argument involves arguing that smart AIs will likely have to explicitly reason about the reinforcement process (sometimes called playing the training game) and this will likely involve lying.

I think that if you do assume a fixed goal slot and outline an overall architecture, then there are pretty good arguments for a serious probabilty of scheming.

(Though there are also plenty of bad arguments, including some that people have made in the past : ).)

That said, I'm sympathetic to some version of the "Against goal realism" argument applying to models which are sufficiently useful. As in, the first transformatively useful models won't in practice contain have internal (opaque-to-human-overseers) goals such that the traditional story for scheming doesn't apply.

(However, it's worth noting that at least some humans do seem to have internal goals and reductionism doesn't defeat this intuition. It's not super clear that the situation with humans is well described as a "goal slot", though there is pretty clearly some stuff that could be changed in a human brain that would cause them to be well described as coherantly pursue different goals. So arguing that AIs won't have internal goals in a way that could result in scheming does require noting some ways in which you're argument doesn't apply to humans. More strongly, humans can and do scheme even in cases where some overseer selected the humans for good behavior.)

Regardless, I'll assume the existance of something well described as a goal slot in this comment.

In the remainder of this comment:

  • I make a simple (but vague and hand wavy) counting/measure/simplicity style argument for scheming independently.
  • I'll try to quickly sketch a simple and relatively concrete architecture with a goal slot that I think might lead to scheming. I make somewhat artificial choices for the sake of simplicity.

Simple (counting-style) argument for scheming

(I make a less fleshed out version of this argument, in this comment below.)

Let's suppose that we do have a neural network which implements goal-directed reasoning and has something well described as a goal slot.

The model is trained for a long time with RL.

Let's imagine that this model is very smart and it understands reasonably well the training the process it's in and how its actions in training effect the final model produced by training. (This second property is sometimes called situational awareness.)

Speaking roughly, and under a few more assumptions I'll discuss later, there are two types of goals which perform well enough in RL to be selected:

  • Goals which directly pursue anything that is nearly perfectly correlated with the outcome that it reinforced (aka reward).
  • Goals which care most about literally any long run outcome.

So, we care about the measure on these two types of goals.

If we imagine that our RL'd neural network computes a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions in its activations with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.)

It's not clear that variance in the activations determines the relative measure of different goals that perform equally well in training, but it seems like a vaguely sane rough ballpark guess. (Other proposals don't seem to particular effect the intuitions because the process of goals selecting actions will absolutely require thinking about predicted consequences.)

Overall, I don't think this measure argument is clear, but in the case with explicit goal slots there are at least some concerning signs.

For the above argument to go through, there are a few assumptions I've made:

  • In order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly "play the training game" (e.g. explicitly reason about and try to get high performance).
  • The AI's predictions are aware of what its goal slot contains.

It also seems reasonable to contest these assumption, but this is a different thing than the counting argument. I think these two assumptions are very likely to hold in the limit of arbitrary intelligence, but it's unclear if they hold for AIs which are created prior to the point where human AI safety efforts are totally obsolete.

Simple architecture where scheming seems plausible (though not certain)

This architecture will involve internally doing literal best of n search against an internally represented goal.

  • Let's suppose that our AI has an internal world model P. P maps from proposed actions to a bunch of predictions about the world conditional on that action being taken. Let's say these predictions are represented as a vector in (e.g. a neural network computes predictions about the future as an embedding). For now, let's assume that this world model is fixed, but does end up predicting all of the important outcomes needed to perform well in training.
  • It also has some sort of action generator to get proposed actions A. This allows it to sample proposed actions to search over. For now, let's suppose that the AI's distribution over actions is fixed. (But the search over these actions won't be fixed.)
  • It also has an explicit goal represented as a linear probe which is applied to the predictions from the world model. We'll suppose this goal is the only thing which is train for simplicity (though I expect this rough argument to work if we train the world model in parallel with the goals).

Now, our overall architecture will do search using these components. Here's some python pseudo code:

def P(input, action, goal):
    # returns a vector in R^M
    ...

def A(input):
    # samples an action
    ...

def model(input, goal_linear_probe, N):
    highest_goodness = -float('inf')
    best_action = None


    for i in range(N):
        action = A(input)
        preds = P(input, action, goal_linear_probe)
        interally_assessed_goodness = (goal_linear_probe * preds).sum()

        if interally_assessed_goodness > highest_goodness:
            highest_goodness = interally_assessed_goodness
            best_action = action

    return best_action

Note that this is pseudo code for the overall model which we're training and which is implementing consequentialism interally in an opaque way (opaque in the corresponding real model, this is obviously not opaque because it's written as python code). So there would an outer training loop that trains this model. (In this case for simplicity, we assume that only the goal_linear_probe is trained.)

Now, in the case where:

  • N is very large (analogy: a massive amount of selection pressure is applied by internal consequentialist reasoning, e.g. 64 bits or N=2^64).
  • P has all of the important predictions for performing well in training including intermediate predictions.
  • We train the AI (really just the goal probe) for a huge number of episodes (analogy: a massive amount of selection pressure is applied by training). (For simplicity, we can imagine this training just doing best-of-n sampling over some massive number of random linear probes (e.g. 2^1024) which is ~equivalent to Mingard et al.)
    • This assumption and the above assumption combined with this architecture basically implies that playing the training game is required to be selected by training. (See above for a rough definition of playing the training game.)
  • P "understands" that it is part of an AI in training and that the submitted action will effect selection/training pressure on the goal probe. And in turn it understands what actions various goal probes will take at different points. Note that this requires that the predictor takes the goal as input which corresponds to the AI being aware of its own goals.
  • (We also need some consistency properties on the representations that P uses for long run predictions. E.g., specific consistent long run predictions need to be linearly represented and use consistent representations over time. This is even more in the weeds, so most people can ignore this even if this is a bit confusing.)

Then there are basically two types of goal linear probes you can get:

  • Probes which mostly "care" about long run outcomes in ~every training episode and thus assign high internally_assessed_goodness to highly rewarded actions because these actions will result in these goals getting more influence over the long run (because the goal probe isn't selected against).
  • Probes which are nearly perfectly correlated with the outcome that is reinforced (aka reward).

Probably literal reward is somewhat saliant to P in this setup (particular if we imagine this model being trained to predict reward as occurs in many RL setups), but there are also going to be a lot of long range predictions that the model will need to compute to do well in training (both in RL and in pretraining). And you only need to compute near-perfect-on-distribution correlates of reward once (at least for the RL part of training).

I might try to touch up this argument at some point, but this is the core sketch.

I'm sympathetic to pushing back on counting arguments on the ground 'it's hard to know what the exact measure should be, so maybe the measure on the goal of "directly pursue high performance/anything nearly perfectly correlated the outcome that it reinforced (aka reward)" is comparable/bigger than the measure on "literally any long run outcome"'.

So I appreciate the push back here. I just think the exact argument and the comparison to overfitting is a strawman.

(Note that above I'm assuming a specific goal slot, that the AI's predictions are aware of what its goal slot contains, and that in order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly "play the training game" (e.g. explicitly reason about and try to get high performance). It also seems reasonable to contest these assumption, but this is a different thing than the counting argument.)

(Also, if we imagine an RL'd neural network computing a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.))

I also think Evan's arguments are pretty sloppy in this presentation and he makes a bunch of object level errors/egregious simplifications FWIW, but he is actually trying to talk about models represented in weight space and how many bits are required to specify this. (Not how many bits are required in function space which is crazy!)

By "bits in model space" the charitable interpretation is "among the initialization space of the neural network, how many bits are required to point at this subset relative to other subsets". I think this corresponds to a view like "neural network inductive biases are well approximated by doing conditional sampling from the initialization space (ala Mingard et al.). I think Evan makes errors in reasoning about this space and that his problematic simplifications (at least for the Christ argumment) are similar to some sort of "principle of indifference" (it makes similar errors), but I also think that his errors aren't quite this and that there is a recoverable argument here. (See my parentheticals above.)

"There is only 1 Christ" is straightforwardly wrong in practice due to gauge invariances and other equivalences in weight space. (But might be spiritually right? I'm skeptical it is honestly.)

The rest of the argument is to vague to know if it's really wrong or right.

I agree that you can't adopt a uniform prior. (By uniform prior, I assume you mean something like, we represent goals as functions from world states to a (real) number where the number says how good the world state is, then we take a uniform distribution over some uniform and discrete version of this function space. (You can't even do uniform sampling naively because you can't do uniform sampling from an uncountable space!).)

Separately, I'm also skeptical that any serious historical arguments were actually assuming a uniform prior as opposed to trying to actual reason about the complexity/measure of various goal in terms of some fixed world model given some vague guess about the representation of this world model. This is also somewhat dubious due to assuming a goal slot, assuming a world model, and needing to guess at the representation of the world model.

(You'll note that ~all prior arguements mention terms like "complexity" and "bits".)

Of course, the "Against goal realism" and "Simplicity arguments" sections can apply here and indeed, I'm much more sympathetic to these sections than to the counting argument section which seems like a strawman as far as I can tell. (I tried to get to ground on this by communicating back and forth some with you and some with Alex Turner, but I failed, so now I'm just voicing my issues for third parties.)

Since there are “more” possible schemers than non-schemers, the argument goes, we should expect training to produce schemers most of the time. In Carlsmith’s words:

It's important to note that the exact counting argument you quote isn't one that Carlsmith endorses, just one that he is explaning. And in fact Carlsmith specifically notes that you can't just apply something like the principle of indifference without more reasoning about the actual neural network prior.

(You mention this later in the "simplicity arguments" section, but I think this objection is sufficiently important and sufficiently missing early in the post that it is important to emphasize.)

Quoting somewhat more context:

I start, in section 4.2, with what I call the “counting argument.” It runs as follows:

  1. The non-schemer model classes, here, require fairly specific goals in order to get high reward.
  2. By contrast, the schemer model class is compatible with a very wide range of (beyond- episode) goals, while still getting high reward (at least if we assume that the other require- ments for scheming to make sense as an instrumental strategy are in place—e.g., that the classic goal-guarding story, or some alternative, works).48
  3. In this sense, there are “more” schemers that get high reward than there are non-schemers that do so.
  4. So, other things equal, we should expect SGD to select a schemer.

Something in the vicinity accounts for a substantial portion of my credence on schemers (and I think it often undergirds other, more specific arguments for expecting schemers as well). However, the argument I give most weight to doesn’t move immediately from “there are more possible schemers that get high reward than non-schemers that do so” to “absent further argument, SGD probably selects a schemer” (call this the “strict counting argument”), because it seems possible that SGD actively privileges one of these model classes over the others. Rather, the argument I give most weight to is something like:

  1. It seems like there are “lots of ways” that a model could end up a schemer and still get high reward, at least assuming that scheming is in fact a good instrumental strategy for pursuing long-term goals.
  2. So absent some additional story about why training won’t select a schemer, it feels, to me, like the possibility should be getting substantive weight. I call this the “hazy counting argument.” It’s not especially principled, but I find that it moves me

[Emphasis mine.]

I think these tasks would be in scope as long as these tasks can be done fully digitally and can be done relatively easily in text. (And it's possible to setup the task in a docker container etc etc.)

Quoting the desiderata from the post:

Plays to strengths of LLM agents: ideally, most of the tasks can be completed by writing code, using the command line, or other text-based interaction.

Oops, misread you.

I think some people at superalignment (OpenAI) are interested in some version of this and might already be working on this.

At least with C, in my experience these kinds of mistakes

I agree that mistakes can be very hard to catch, but I still think it's hard to cause specific outcomes via carefully inserted bugs which aren't caught by careful auditing and testing. (See e.g. underhanded C where I think testing does pretty well.)

Looking over the winners for 2008

This one is trivially resolved with unit testing I think. Though this breaks the analogousness of the problem. But I think my overall point stands, it just actually seems hard to cause specific outcomes IMO.

Load More