Wiki Contributions


Behavior Cloning is Miscalibrated

(Moderation note: added to the Alignment Forum from LessWrong.)

Omicron Variant Post #1: We’re F***ed, It’s Never Over

Yes, at least that's also my understanding—especially since we are still vaccinating new people, not just giving out boosters. My point is just that it seems like we shouldn't rely on updated vaccines being able to change the course of the pandemic in a major way.

Why don't our vaccines get updated to the delta spike protein?

One major reason may be because it wouldn't actually be very effective to do so due to original antigenic sin. See my comment here.

Omicron Variant Post #1: We’re F***ed, It’s Never Over

My guess is it’s actually quite a bit of evidence against any strong potential gains from updating, but weak evidence against weak gains.

So far, all talk of immune escape has mostly been exactly that, talk. That should make us wary of expecting it out of a new variant, or of updating too much from people’s concerns.

If a new variant comes along that does offer substantial escape from the vaccines, we will need to update the vaccines and get new versions out as quickly as possible. Will we be able to do that?

Technologically I have no worries. We’ll have that part solved within the week and probably within one day.

The problem here is original antigenic sin. Once you've been vaccinated against a particular variant, getting another vaccine for a very close relative generally won't do anything other than reactivate the antibodies your body developed the first time around, rather than causing you to develop new antibodies specific to the new variant. Here is a study looking at this in the context of Covid vaccines, which does suggests some things we could do to mitigate the effect, but overall I think you shouldn't expect vaccination against a specific variant to help all that much relative to just getting another wild-type booster.

So, independent of the fact that new, variant-specific vaccines not being produced is evidence that they wouldn't help much, our prior should also be on them not helping much due to original antigenic sin (except for people who have yet to be vaccinated or infected at all). Original antigenic sin is a reason that, if some new variant does come along with significant vaccine escape—but which is still similar enough to the wild-type for original antigenic sin to apply—just developing new, variant-specific vaccines might not actually be an effective way out.

Yudkowsky and Christiano discuss "Takeoff Speeds"

But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).

And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).

Seems interestingly similar to the grokking phenomenon.

A positive case for how we might succeed at prosaic AI alignment

are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?

No—I'm separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you're going to get it. My step (1) above, which is what I understand that we're talking about, is just about that first piece: understanding what we're going to be shooting for when we set up our training process (and then once we know what we're shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.

It's worth pointing out, however, that even when we're just focusing on that first part, it's very important that we pay attention to the total complexity that we're paying in specifying what sort of model we want, since that's going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).

What is a "robust" Cartesian boundary, why do you think this stops an agent from trying to get more compute

Broadly speaking, I'd say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.

The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that's just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.

Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn't quite enough.

A positive case for how we might succeed at prosaic AI alignment

To be clear, I agree with this as a response to what Edouard said—and I think it's a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don't think it's a response to what I'm advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn't be too surprised that it's not fully clear).

In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that's clearly not a safe model to amplify, and probably not even a safe model to train in the first place. That's because instead of getting a model that actually cares about imitating Bob or anything like that, you probably just got some pseudo-aligned mesa-optimizer with an objective that produces behavior that happens to correlate well with Bob's.

However, there does exist a purely theoretical construct—what would happen if you actually amplified Bob, not an imitation of Bob—that is very likely to be safe and superhuman (though probably still not fully competitive, but we'll put that aside for now since it doesn't seem to be the part you're most skeptical of). Thus, if you could somehow get a model that was in fact trying to imitate amplified Bob, you might be okay—except that that's not true, because most types of agents, when given the objective of imitating a safe thing, will end up with a bunch of convergent instrumental goals that break that safety. However, I claim that there are natural types of agents (that is, not too complex on a simplicity prior) that, when given the objective of imitating a safe thing, do so safely. That's what I mean by my step (1) above (and of course, even if such natural agents exist, there's still a lot you have to do to make sure you get them—that's the rest of the steps).

But since you seem most skeptical of (1), maybe I'll try to lay out my basic case for how I think we can get a theory of simple, safe imitators (including simple imitators with arbitrary levels of optimization power):

  • All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary (similarly to why an approval-directed agent wouldn't do this sort of thing—the main problem with approval-directed agents just being that human approval is not a very good thing to optimize for).
  • Specifying a robust Cartesian boundary is not that hard—you just need a good multi-level world-model, which any powerful agent should have to have anyway.
  • There are remaining issues related to superrationality, but those can be avoided by having a decision theory that ignores them (e.g. the right sort of CDT variant).
  • There are also some remaining issues related to tiling, but those can be avoided if the Cartesian boundary is structured in such a way that it excludes other agents (this is exactly the trick that LCDT pulls).
A positive case for how we might succeed at prosaic AI alignment

If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?

Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you'll get deception. The solution isn't to somehow “filter out the unwanted instrumental behavior from the wanted instrumental behavior,” though, it's just to not imitate something that would be deceptive.

It's perhaps worth pointing out why, if we have something to imitate already that isn't deceptive, why we don't just run that thing directly—and the answer is that we can't: all of the sorts of things that might be both competitive and safe to myopically imitate are things like HCH that are too inefficient to run directly.

A positive case for how we might succeed at prosaic AI alignment

How does a "myopic optimizer" successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about?

It just reasons about them, using deduction, prediction, search, etc., the same way any optimizer would.

To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

The sense that it's still myopic is in the sense that it's non-deceptive, which is the only sense that we actually care about.

it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper

The safety improvement that I'm claiming is that it wouldn't be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?

A positive case for how we might succeed at prosaic AI alignment

Note that (A) and (B) are not actually that hard—e.g. LCDT solves both problems.

Your (C), in my opinion, is where all the action is, and is in fact the hardest part of this whole story—which is what I was trying to say in the original post when I said that (2) was the hard part.

Load More