Ajeya Cotra

Comments

Anna and Oliver discuss Children and X-Risk

Belatedly, I did a bit of outside-view research on the time and monetary costs of kids (though a couple parent friends kindly sanity-checked some of it). I presented it at my house's internal conference, but some folks suggested I share more broadly in case it's helpful to others: here is the slide deck. The assumptions are Bay Area, upper-middle-class parents (e.g. both programmers or something like that) who both want to keep their careers and are therefore willing to pay a lot for childcare.

Notes from "Don't Shoot the Dog"

Thanks for writing this up! Appreciate the personal anecdotes too. Curious if you or Jeff have any tips and tricks for maintaining the patience/discipline required to pull off this kind of parenting (for other readers, I enjoyed some of Jeff's thoughts on predictable parenting here). Intuitively to me, it seems like this is a reason that the value-add from paying for childcare might be higher than you'd think naively — not only do you directly save time, you might also have more emotional reserves to be consistent and disciplined if you get more breaks.

The case for aligning narrowly superhuman models

I'm personally skeptical that this work is better-optimized for improving AI capabilities than other work being done in industry. In general, I'm skeptical of perspectives that work that the rationalist/EA/alignment crowd does Pareto-dominates the other work going on -- that is, that it's significantly better for both alignment and capabilities than standard work, such that others are simply making a mistake by not working on it regardless of what their goals are or how much they care about alignment. I think sometimes this could be the case, but I wouldn't bet on it being a large effect. In general, I expect work optimized to help with alignment to be worse on average at pushing forward capabilities, and vice versa.

The case for aligning narrowly superhuman models

In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn't trying here to make something different sound like it's about practice. I don't think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I'd be similarly excited about or maybe more excited about.

In my mind, the "better than evaluators" part is kind of self-evidently intriguing for the basic reason I said in the post (it's not obvious how to do it, and it's analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn't strongly tied to a particular theoretical framing):

I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.

A lot of people in response to the draft were pushing in the direction that I think you were maybe gesturing at (?) -- to make this more specific to "knowing everything the model knows" or "ascription universality"; the section "Why not focus on testing a long-term solution?" was written in response to Evan Hubinger and others. I think I'm still not convinced that's the right way to go.

The case for aligning narrowly superhuman models

I don't feel confident enough in the frame of "inaccessible information" to say that the whole agenda is about it. It feels like a fit for "advice", but not a fit for "writing stories" or "solving programming puzzles" (at least not an intuitive fit -- you could frame it as "the model has inaccessible information about [story-writing, programming]" but it feels more awkward to me). I do agree it's about "strongly suspecting it has the potential to do better than humans" rather than about "already being better than humans." Basically, it's about trying to find areas where lackluster performance seems to mostly be about "misalignment" rather than "capabilities" (recognizing those are both fuzzy terms).

The case for aligning narrowly superhuman models

Yeah, you're definitely pointing at an important way the framing is awkward. I think the real thing I want to say is "Try to use some humans to align a model in a domain where the model is better than the humans at the task", and it'd be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there's some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse.

I don't want to just call it "align superhuman AI today" because people will be like "What? We don't have that", but at the same time I don't want to drop "superhuman" from the name because that's the main reason it feels like "practicing what we eventually want to do." I considered "partially superhuman", but "narrowly" won out.

I'm definitely in the market for a better term here.

MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

The conceptual work I was gesturing at here is more Paul's work, since MIRI's work (afaik) is not really neural net-focused. It's true that Paul's work also doesn't assume a literal worst case; it's a very fuzzy concept I'm gesturing at here. It's more like, Paul's research process is to a) come up with some procedure, b) try to think of any "plausible" set of empirical outcomes that cause the procedure to fail, and c) modify the procedure to try to address that case. (The slipperiness comes in at the definition of "plausible" here, but the basic spirit of it is to "solve for every case" in the way theoretical CS typically aims to do in algorithm design, rather than "solve for the case we'll in fact encounter.")

MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

This was a really helpful articulation, thanks! I like "frankness", "forthrightness", "openness", etc. (These are all terms I was brainstorming to get at the "ascription universality" concept at one point.)

The case for aligning narrowly superhuman models

The case in my mind for preferring to elicit and solve problems at scale rather than in toy demos (when that's possible) is pretty broad and outside-view, but I'd nonetheless bet on it: I think a general bias toward wanting to "practice something as close to the real thing as possible" is likely to be productive. In terms of the more specific benefits I laid out in this section, I think that toy demos are less likely to have the first and second benefits ("Practical know-how and infrastructure" and "Better AI situation in the run-up to superintelligence"), and I think they may miss some ways to get the third benefit ("Discovering or verifying a long-term solution") because some viable long-term solutions may depend on some details about how large models tend to behave.

I do agree that working with larger models is more expensive and time-consuming, and sometimes it makes sense to work in a toy environment instead, but other things being equal I think it's more likely that demos done at scale will continue to work for superintelligent systems, so it's exciting that this is starting to become practical.

The case for aligning narrowly superhuman models

Yeah, in the context of a larger alignment scheme, it's assuming that in particular the problem of answering the question "How good is the AI's proposed action?" will factor down into sub-questions of manageable size.

Load More