I am a Research Fellow at MIRI working on inner alignment for amplification.

See: "What I'll doing at MIRI."

Pronouns: he/him/his

evhub's Comments

The Epistemology of AI risk

The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.

I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.

It seems even harder to do productive work, since I'm skeptical of very short timelines.

Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety? Surely there are things we can be doing now to gain insight, build research/organizational capacity, etc. that will at least help somewhat, no? (And it seems to me like “probably helps somewhat” is enough when it comes to existential risk.)

Have epistemic conditions always been this bad?

First, as someone who just (class of 2019) graduated college at a very liberal, highly regarded, private U.S. institution, the description above definitely does not match my experience. In my experience, I found that dissenting opinions and avid discussion were highly encouraged. That being said, I suspect Mudd may be particularly good on that axis due to factors such as being entirely STEM-focused (also Debra Mashek was one of my professors).

Second, I think it is worth pointing out that there are definitely instances where, at least in my opinion, “canceling” is a valid tactic. Deplatforming violent rhetoric (e.g. Nazism, Holocaust denial, etc.) comes to mind as an obvious example.

Third, that being said, I do think there is a real problem along the lines of what you're pointing at. For example, one thing I saw recently was what's been happening to Natalie Wynn, a YouTuber who goes by the name “ContraPoints.” She's a very popular leftist YouTuber who mainly talks about various left-wing social issues, particularly transgender issues (she herself is transgender). In one of her recent videos, she cast a transgender man named Buck Angel as a voice actor for part of it, and people (mostly on Twitter) got extremely upset at her because Buck Angel had at one point previously said something that maybe possibly could be interpreted as anti-non-binary-people. I think that Natalie's recent video responding to her “canceling” is probably the best analysis of the whole phenomenon that I've seen, and aligns pretty well with my views on the topic, though it's quite long.

There are a lot of things about Natalie's canceling that give me hope, though. First, it seemed like her canceling was highly concentrated on Twitter, which makes a lot of sense to me—I tend to think that it's almost impossible to have good discourse in any sort of combative/argumentative setting, especially when it's online, and especially when everyone is limited only to tiny tweets, which lend themselves particularly well to snarky quippy one-liners without any actual real substance. Second, it was really only a fringe group of people canceling her—it's just that the people who were doing it were very loud, which again strikes me as exactly the sort of thing that is highly exacerbated by the internet, and especially by Twitter. Third, I think there's a real movement on the left towards rejecting this sort of thing—I think Natalie is a good example of a very public leftist strongly rejecting “cancel culture,” though I've met lots of other die-hard leftists who think similarly while I was in college. There are a lot of really smart people on the left and I think it's quite reasonable to expect that this will broadly get better over time—especially if people move to better forms of online discourse than Twitter (or Facebook, which I also think is pretty bad). YouTube and Reddit, though, are mainstream platforms that I think produce significantly better discourse than Twitter, so I do think there's hope there.

Exploring safe exploration

Hey Aray!

Given this, I think the "within-episode exploration" and "across-episode exploration" relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call "safe exploration".

I agree with this. I jumped the gun a bit in not really making the distinction clear in my earlier post “Safe exploration and corrigibility,” but I think that made it a bit confusing, so I went heavy on the distinction here—but perhaps more heavy than I actually think is warranted.

The problem I have with relaxing within-episode and across-episode exploration into each other, though, is precisely the problem I describe in “Safe exploration and corrigibility,” however, which is that by default you only end up with capability exploration not objective exploration—that is, an agent with a goal (i.e. a mesa-optimizer) is only going to explore to the extent that it helps its current goal, not to the extent that it helps it change its goal to be more like the desired goal. Thus, you need to do something else (something that possibly looks somewhat like corrigibility) to get the agent to explore in such a way that helps you collect data on what its goal is and how to change it.

Malign generalization without internal search

I don't feel like you're really understanding what I'm trying to say here. I'm happy to chat with you about this more over video call or something if you're interested.

Malign generalization without internal search

I think that piecewise objectives are quite reasonable and natural—and I don't think they'll make transparency that much harder. I don't think there's any reason that we should expect objectives to be continuous in some nice way, so I fully expect you'll get these sorts of piecewise jumps. Nevertheless, the resulting objective in the piecewise case is still quite simple such that you should be able to use interpretability tools to understand it pretty effectively—a switch statement is not that complicated or hard to interpret—with most of the real hard work still primarily being done in the optimization.

I do think there are a lot of possible ways in which the interpretability for mesa-optimizers story could break down—which is why I'm still pretty uncertain about it—but I don't think that a switch-case agent is such an example. Probably the case that I'm most concerned about right now is if you get an agent which has an objective which changes in a feedback loop with its optimization. If the objective and the optimization are highly dependent on each other, then I think that would make the problem a lot more difficult—and is the sort of thing that humans seem to do, which suggests that it's the sort of thing we might see in AI systems as well. On the other hand, a fixed switch-case objective is pretty easy to interpret, since you just need to understand the simple, fixed heuristics being used in the switch statement and then you can get a pretty good grasp on what your agent's objective is. Where I start to get concerned is when those switch statements themselves depend upon the agent's own optimization—a recursion which could possibly be many layers deep and quite difficult to disentangle. That being said, even in such a situation you're still using search to get your robust capabilities.

Malign generalization without internal search

Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above.

I feel like what you're describing here is just optimization where the objective is determined by a switch statement, which certainly seems quite plausible to me but also pretty neatly fits into the mesa-optimization framework.

More generally, while I certainly buy that you can produce simple examples of things that look kinda like capability generalization without objective generalization on environments like the lunar lander or my maze example, it still seems to me like you need optimization to actually get capabilities that are robust enough to pose a serious risk, though I remain pretty uncertain about that.

Outer alignment and imitative amplification

Is "outer alignment" meant to be applicable in the general case?

I'm not exactly sure what you're asking here.

Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we'd call that an "outer alignment failure".

I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn't consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization,” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure).

So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?

Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition.

Also, related to Ofer's comment, can you clarify whether it's intended for this definition that the loss function only looks at the model's input/output behavior, or can it also take into account other information about the model?

I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function.

I'm also curious whether you have HBO or LBO in mind for this post.

I generally think you need HBO and am skeptical that LBO can actually do very much.

Outer alignment and imitative amplification

I think I'm quite happy even if the optimal model is just trying to do what we want. With imitative amplification, the true optimum—HCH—still has benign failures, but I nevertheless want to argue that it's aligned. In fact, I think this post really only makes sense if you adopt a definition of alignment that excludes benign failures, since otherwise you can't really consider HCH aligned (and thus can't consider imitative amplification outer aligned at optimum).

Exploring safe exploration

Like I said in the post, I'm skeptical that “preventing the agent from making an accidental mistake” is actually a meaningful concept (or at least, it's a concept with many possible conflicting definitions), so I'm not sure how to give an example of it.

Exploring safe exploration

I definitely was not arguing that. I was arguing that safe exploration is currently defined in ML as the agent making an accidental mistake, and that we should really not be having terminology collisions with ML. (I may have left that second part implicit.)

Ah, I see—thanks for the correction. I changed “best” to “current.”

I assume that the difference you see is that you could try to make across-episode exploration less detrimental from the agent's perspective

No, that's not what I was saying. When I said “reward acquisition” I meant the actual reward function (that is, the base objective).


That being said, it's a little bit tricky in some of these safe exploration setups to draw the line between what's part of the base objective and what's not. For example, I would generally include the constraints in constrained optimization setups as just being part of the base objective, only specified slightly differently. In that context, constrained optimization is less of a safe exploration technique and more of a reward-engineering-y/outer alignment sort of thing, though it also has a safe exploration component to the extent that it constrains across-episode exploration.

Note that when across-episode exploration is learned, the distinction between safe exploration and outer alignment becomes even more muddled, since then all the other terms in the loss will implicitly serve to check the across-episode exploration term, as the agent has to figure out how to trade off between them.[1]

  1. This is another one of the points I was trying to make in “Safe exploration and corrigibility” but didn't do a great job of conveying properly. ↩︎

Load More