by jsd
1 min read13th Apr 20228 comments
This is a special post for quick takes by jsd. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
8 comments, sorted by Click to highlight new comments since: Today at 7:34 AM

I've been thinking about these two quotes from AXRP a lot lately:

From Richard Ngo's interview:

Richard Ngo: Probably the main answer is just the thing I was saying before about how we want to be clear about where the work is being done in a specific alignment proposal. And it seems important to think about having something that doesn’t just shuffle the optimization pressure around, but really gives us some deeper reason to think that the problem is being solved. One example is when it comes to Paul Christiano’s work on amplification, I think one core insight that’s doing a lot of the work is that imitation can be very powerful without being equivalently dangerous. So yeah, this idea that instead of optimizing for a target, you can just optimize to be similar to humans, and that might still get you a very long way. And then another related insight that makes amplification promising is the idea that decomposing tasks can leverage human abilities in a powerful way.

Richard Ngo: Now, I don’t think that those are anywhere near complete ways of addressing the problem, but they gesture towards where the work is being done. Whereas for some other proposals, I don’t think there’s an equivalent story about what’s the deeper idea or principle that’s allowing the work to be done to solve this difficult problem.

From Paul Christiano's interview:

Paul Christiano: And it’s nice to have a problem statement which is entirely external to the algorithm. If you want to just say, “here’s the assumption we’re making now; I want to solve that problem”, it’s great to have an assumption on the environment be your assumption. There’re some risk if you say, “Oh, our assumption is going to be that the agent’s going to internalize whatever objective we use to train it.” The definition of that assumption is stated in terms of, it’s kind of like helping yourself to some sort of magical ingredient. And, if you optimize for solving that problem, you’re going to push into a part of the space where that magical ingredient was doing a really large part of the work. Which I think is a much more dangerous dynamic. If the assumption is just on the environment, in some sense, you’re limited in how much of that you can do. You have to solve the remaining part of the problem you didn’t assume away. And I’m really scared of sub-problems which just assume that some part of the algorithm will work well, because I think you often just end up pushing an inordinate amount of the difficulty into that step.

Great quotes. Posting podcast excerpts is underappreciated. Happy to read more of them.

A few ways that StyleGAN is interesting for alignment and interpretability work:

  • It was much easier to interpret than previous generative models, without trading off image quality.
  • It seems like an even better example of "capturing natural abstractions" than GAN Dissection, which Wentworth mentions in Alignment By Default.
    • First, because it's easier to map abstractions to StyleSpace directions than to go through the procedure in GAN Dissection.
    • Second, the architecture has 2 separate ways of generating diverse data: changing the style vectors, or adding noise. This captures the distinction between "natural abstraction" and "information that's irrelevant at a distance".
  • Some interesting work was built on top of StyleGAN:

However, StyleGAN is not super relevant in other ways:

  • It generally works only on non-diverse data: you train StyleGAN to generate images of faces, or to generate images of churches. The space of possible faces is much smaller than e.g. the space of images that could make it in ImageNet. People recently released StyleGAN-XL, which is supposed to work well on diverse datasets such as ImageNet. I haven't played around with it yet.
  • It's an image generation model. I'm more interested in language models, which work pretty differently. It's not obvious how to extend StyleGAN's architecture to build competitive yet interpretable language models. This paper tried something like this but didn't seem super convincing (I've mostly skimmed it so far).

When talking about AI risk from LLM-like models, when using the word "AI" please make it clear whether you are referring to:

  • A model
  • An instance of a model, given a prompt

For example, there's a big difference between claiming that a model is goal-directed and claiming that a particular instance of a model given a prompt is goal-directed.

I think this distinction is obvious and important but too rarely made explicit.

Can you give a few examples where it's both confusing and important?  Almost all concrete experiments and examples I've seen are the latter (an instance with a context and prompt(s)), because that's really the point of interaction and existence for LLMs.  I'm not even sure what it would mean for a non-instantiated model without input to do anything.

I'm not even sure what it would mean for a non-instantiated model without input to do anything.

For goal-directedness, I'd interpret it as "all instances are goal-directed and share the same goal".

As an example, I wish Without specific countermeasures had made the distinction more explicit. 

More generally, when discussing whether a model is scheming, I think it's useful to keep in mind worlds where some instances of the model scheme while others don't.

I don't think I've seen any research about cross-instance similarity, or even measuring the impact of instance-differences (including context and prompts) on strategic/goal-oriented actions.  It's an interesting question, but IMO not as interesting as "if instances are created/selected for their ability to make and execute long-term plans, how do those instances behave".

How would you say humanity does on this distinction?  When we talk about planning and goals, how often are we talking about "all humans", vs "representative instances"?

Mostly I care about this because if there's a small number of instances that are trying to take over, but a lot of equally powerful instances that are trying to help you, this makes a big difference. My best guess is that we'll be in roughly this situation for "near-human-level" systems.

I don't think I've seen any research about cross-instance similarity

I think mode-collapse (update) is sort of an example.

How would you say humanity does on this distinction?  When we talk about planning and goals, how often are we talking about "all humans", vs "representative instances"?

It's not obvious how to make the analogy with humanity work in this case - maybe comparing the behavior of clones of the same person put in different situations?