Richard_Ngo

Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com

Sequences

Twitter threads
Understanding systematization
Stories
Meta-rationality
Replacing fear
Shaping safer goals
AGI safety from first principles

Wiki Contributions

Comments

Sorted by

The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:

  1. Talking about partial hypotheses rather than full hypotheses. You can't have a prior over partial hypotheses, because several of them can be true at once (though you can still assign them credences and update those credences according to evidence).
  2. Talking about models with degrees of truth rather than just hypotheses with degrees of likelihood. E.g. when using a binary conception of truth, general relativity is definitely false because it's inconsistent with quantum phenomena. Nevertheless, we want to say that it's very close to the truth. In general this is more of an ML approach to epistemology (we want a set of models with low combined loss on the ground truth).

Scott Garrabrant just convinced me that my notion of conservatism was conflating two things:

  1. Obligations to (slash constraints imposed by) the interests of existing agents.
  2. The assumption that large agents would grow in a bottom-up way (e.g. by merging smaller agents) rather than in a top-down way (e.g. by spinning up new subagents).

I mainly intend conservatism to mean the former.

Richard_NgoΩ330

Whose work is relevant, according to you?

Richard_NgoΩ790

If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science.

Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris's interpretability work and still not know whether or not he's an "alignment researcher". On your definition, when I read a paper by a researcher I haven't heard of, I don't know anything about whether it's alignment research or not until I stalk them on facebook and find out how socially proximal they are to the AI safety community. That doesn't seem great.

Back to Chris. Because I've talked to Chris and read other stuff by him, I'm confident that he does care about alignment. But I still don't know whether his actual motivations are more like 10% intrinsic interest in how neural networks work and 90% in alignment, or vice versa, or anything in between. (It's probably not even a meaningful thing to measure.) It does seem likely to me that the ratio of how much intrinsic interest he has in how neural networks work, to how much he cares about alignment, is significantly higher than that of most alignment researchers, and I don't think that's a coincidence—based on the history of science (Darwin, Newton, etc) intrinsic interest in a topic seems like one of the best predictors of actually making the most important breakthroughs.

In other words: I think your model of what produces more useful research from an alignment perspective overprioritizes towards first-order effects (if people care more they'll do more relevant work) and ignores the second-order effects that IMO are more important (1. Great breakthroughs seem, historically, to be primarily motivated by intrinsic interest; and 2. Creating research communities that are gatekept by people's beliefs/motivations/ideologies is corrosive, and leads to political factionalism + ingroupiness rather than truth-seeking.)

I'm not primarily trying to judge people, I'm trying to exhort people

Well, there are a lot of grants given out for alignment research. Under your definition, those grants would only be given to people who express the right shibboleths.

I also think that the best exhortation of researchers mostly looks like nerdsniping them, and the way to do that is to build a research community that is genuinely very interested in a certain set of (relatively object-level) topics. I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting). But any step in the pipeline that prioritizes "alignment researchers" (like: who gets invited to alignment workshops, who gets alignment funding or career coaching, who gets mentorship, etc) will prioritize the latter over the former if they're using your definition.

I think we're interpreting "pluralism" differently. Here are some central illustrations of what I consider to be the pluralist perspective: 

If I change "i.e. the pluralist focus Alex mentions" to "e.g. the pluralist focus Alex mentions" does that work? I shouldn't have implied that all people who believe in heuristics recommended by many religions are pluralists (in your sense). But it does seem reasonable to say that pluralists (in your sense) believe in heuristics recommended by many religions, unless I'm misunderstanding you. (In the examples you listed these would be heuristics like "seek spiritual truth", "believe in (some version of) God", "learn from great healers", etc.)

I think this doesn't work for people with IQ <= 100, which is about half the world. I agree that an understanding of these insights is necessary to avoid incorporating the toxic parts of Christianity, but I think this can be done even using the language of Christianity. (There's a lot of latitude in how one can interpret the Bible!)

I personally don't have a great way of distinguishing between "trying to reach these people" and "trying to manipulate these people". In general I don't even think most people trying to do such outreach genuinely know whether their actual motivations are more about outreach or about manipulation. (E.g. I expect that most people who advocate for luxury beliefs sincerely believe that they're trying to help worse-off people understand the truth.) Because of this I'm skeptical of elite projects that have outreach as a major motivation, except when it comes to very clearly scientifically-grounded stuff.

Richard_NgoΩ341

What if your research goal is "I'd like to understand how neural networks work?" This is not research primarily about how to make AIs aligned. We tend to hypothesize, as a community, that it will help with alignment more than it helps with capabilities. But that's not an inherent part of the research goal for many interpretability researchers.

(Same for "I'd like to understand how agency works", which is a big motivation for many agent foundations researchers.)

Conversely, what if your research goal is "I'm going to design a training run that will produce a frontier model, so that we can study it to advance alignment research"? Seems odd, but I'd bet that (e.g.) a chunk of Anthropic's scaling team thinks this way. Counts as alignment under your definition, since that's the primary goal of the research.

More generally, I think it's actually a very important component of science that people judge the research itself, not the motivations behind it—since historically scientific breakthroughs have often come from people who were disliked by establishment scientists. A definition that basically boils down to "alignment research is whatever research is done by the people with the right motivations" makes it very easy to prioritize the ingroup. I do think that historically being motivated by alignment has correlated with choosing valuable research directions from an alignment perspective (like mech interp instead of more shallow interp techniques) but I think we can mostly capture that difference by favoring more principled, robust, generalizable research (as per my definitions in the post).

Whereas I don't think it's particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)

I agree. I'll add a note in the post saying that the point you end up on the alignment spectrum should also account for feasibility of the research direction.

Though note that we can interpret your definition as endorsing this too: if you really hate the idea of making AIs more capable, then that might motivate you to switch from scalable oversight to agent foundations, since scalable oversight will likely be more useful for capabilities progress.

Some quick reactions:

  1. I believe that religions contain lot of psychological and cultural insight, and it's plausible to me that many of them contain many of the same insights.
  2. Religions can be seen as solutions to the coordination problem of how to get many very different people to trust each other. However, most of them solve it in a way which conflicts with other valuable cultural technologies (like science, free speech, liberal democracy, etc). I'm also sympathetic to the Nietzschean critique that they solve it in a way which conflicts with individual human agency and flourishing.
  3. Religions are, historically, a type of entity that consolidate power. E.g. Islam right now has a lot of power over a big chunk of the world. We should expect that the psychological insights within religions (and even the ones shared across religions) have been culturally selected in part for allowing those religions to gain power.

So my overall position here is something like: we should use religions as a source of possible deep insights about human psychology and culture, to a greater extent than LessWrong historically has (and I'm grateful to Alex for highlighting this, especially given the social cost of doing so).

But we shouldn't place much trust in the heuristics recommended by religions, because those heuristics will often have been selected for some combination of:

  1. Enabling the religion as a whole (or its leaders) to gain power and adherents.
  2. Operating via mechanisms that break in the presence of science, liberalism, individualism, etc (e.g. the mechanism of being able to suppress criticism).
  3. Operating via mechanisms that break in the presence of abrupt change (which I expect over the coming decades).
  4. Relying on institutions that have become much more corrupt over time.

Where the difference between a heuristic and an insight is something like the difference between "be all-forgiving" and "if you are all-forgiving it'll often defuse a certain type of internal conflict". Insights are about what to believe, heuristics are about what to do. Insights can be cross-checked against the rest of our knowledge, heuristics are much less legible because in general they don't explain why a given thing is a good idea.

IMO this all remains true even if we focus on the heuristics recommended by many religions, i.e. the pluralistic focus Alex mentions. And it's remains true even given the point Alex made near the end: that "for people in Christian Western culture, I think using the language of Christianity in good ways can be a very effective way to reach the users." Because if you understand the insights that Christianity is built upon, you can use those to reach people without the language of Christianity itself. And if you don't understand those insights, then you don't know how to avoid incorporating the toxic parts of Christianity.

Richard_NgoΩ450

Fair point. I've now removed that section from the post (and also, unrelatedly, renamed the post).

I was trying to make a point about people wanting to ensure that AI in general (not just current models) is "aligned", but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.

Afaict, the current world we’re in is basically the worst case scenario

the status quo is not, imo, a remotely acceptable alternative either

Both of these quotes display types of thinking which are typically dangerous and counterproductive, because they rule out the possibility that your actions can make things worse.

The current world is very far from the worst-case scenario (even if you have very high P(doom), it's far away in log-odds) and I don't think it would be that hard to accidentally make things considerably worse.

AGI is heavy-tailed in both directions I think. I don't think we get utopias by default even without misalignment, since governance of AGI is so complicated.

Load More