Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com
we both agree it would not make sense to model OpenAI as part of the same power base
Hmm, I'm not totally sure. At various points:
It's really hard to step out of our own perspective here, but when I put myself in the perspective of, say, someone who doesn't believe in AGI at all, these all seem pretty indicative of a situation where OpenAI and AI safety people were to a significant extent building a shared power base, and just couldn't keep that power base together.
The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:
Scott Garrabrant just convinced me that my notion of conservatism was conflating two things:
I mainly intend conservatism to mean the former.
Whose work is relevant, according to you?
If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science.
Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris's interpretability work and still not know whether or not he's an "alignment researcher". On your definition, when I read a paper by a researcher I haven't heard of, I don't know anything about whether it's alignment research or not until I stalk them on facebook and find out how socially proximal they are to the AI safety community. That doesn't seem great.
Back to Chris. Because I've talked to Chris and read other stuff by him, I'm confident that he does care about alignment. But I still don't know whether his actual motivations are more like 10% intrinsic interest in how neural networks work and 90% in alignment, or vice versa, or anything in between. (It's probably not even a meaningful thing to measure.) It does seem likely to me that the ratio of how much intrinsic interest he has in how neural networks work, to how much he cares about alignment, is significantly higher than that of most alignment researchers, and I don't think that's a coincidence—based on the history of science (Darwin, Newton, etc) intrinsic interest in a topic seems like one of the best predictors of actually making the most important breakthroughs.
In other words: I think your model of what produces more useful research from an alignment perspective overprioritizes towards first-order effects (if people care more they'll do more relevant work) and ignores the second-order effects that IMO are more important (1. Great breakthroughs seem, historically, to be primarily motivated by intrinsic interest; and 2. Creating research communities that are gatekept by people's beliefs/motivations/ideologies is corrosive, and leads to political factionalism + ingroupiness rather than truth-seeking.)
I'm not primarily trying to judge people, I'm trying to exhort people
Well, there are a lot of grants given out for alignment research. Under your definition, those grants would only be given to people who express the right shibboleths.
I also think that the best exhortation of researchers mostly looks like nerdsniping them, and the way to do that is to build a research community that is genuinely very interested in a certain set of (relatively object-level) topics. I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting). But any step in the pipeline that prioritizes "alignment researchers" (like: who gets invited to alignment workshops, who gets alignment funding or career coaching, who gets mentorship, etc) will prioritize the latter over the former if they're using your definition.
I think we're interpreting "pluralism" differently. Here are some central illustrations of what I consider to be the pluralist perspective:
- the Catholic priest I met at the Parliament of World Religions who encouraged someone who had really bad experiences with Christianity to find spiritual truth in Hinduism
- the passage in the Quran that says the true believers of Judaism and Christianity will also be saved
- the Vatican calling the Buddha and Jesus great healers
If I change "i.e. the pluralist focus Alex mentions" to "e.g. the pluralist focus Alex mentions" does that work? I shouldn't have implied that all people who believe in heuristics recommended by many religions are pluralists (in your sense). But it does seem reasonable to say that pluralists (in your sense) believe in heuristics recommended by many religions, unless I'm misunderstanding you. (In the examples you listed these would be heuristics like "seek spiritual truth", "believe in (some version of) God", "learn from great healers", etc.)
I think this doesn't work for people with IQ <= 100, which is about half the world. I agree that an understanding of these insights is necessary to avoid incorporating the toxic parts of Christianity, but I think this can be done even using the language of Christianity. (There's a lot of latitude in how one can interpret the Bible!)
I personally don't have a great way of distinguishing between "trying to reach these people" and "trying to manipulate these people". In general I don't even think most people trying to do such outreach genuinely know whether their actual motivations are more about outreach or about manipulation. (E.g. I expect that most people who advocate for luxury beliefs sincerely believe that they're trying to help worse-off people understand the truth.) Because of this I'm skeptical of elite projects that have outreach as a major motivation, except when it comes to very clearly scientifically-grounded stuff.
What if your research goal is "I'd like to understand how neural networks work?" This is not research primarily about how to make AIs aligned. We tend to hypothesize, as a community, that it will help with alignment more than it helps with capabilities. But that's not an inherent part of the research goal for many interpretability researchers.
(Same for "I'd like to understand how agency works", which is a big motivation for many agent foundations researchers.)
Conversely, what if your research goal is "I'm going to design a training run that will produce a frontier model, so that we can study it to advance alignment research"? Seems odd, but I'd bet that (e.g.) a chunk of Anthropic's scaling team thinks this way. Counts as alignment under your definition, since that's the primary goal of the research.
More generally, I think it's actually a very important component of science that people judge the research itself, not the motivations behind it—since historically scientific breakthroughs have often come from people who were disliked by establishment scientists. A definition that basically boils down to "alignment research is whatever research is done by the people with the right motivations" makes it very easy to prioritize the ingroup. I do think that historically being motivated by alignment has correlated with choosing valuable research directions from an alignment perspective (like mech interp instead of more shallow interp techniques) but I think we can mostly capture that difference by favoring more principled, robust, generalizable research (as per my definitions in the post).
Whereas I don't think it's particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)
I agree. I'll add a note in the post saying that the point you end up on the alignment spectrum should also account for feasibility of the research direction.
Though note that we can interpret your definition as endorsing this too: if you really hate the idea of making AIs more capable, then that might motivate you to switch from scalable oversight to agent foundations, since scalable oversight will likely be more useful for capabilities progress.
Some quick reactions:
So my overall position here is something like: we should use religions as a source of possible deep insights about human psychology and culture, to a greater extent than LessWrong historically has (and I'm grateful to Alex for highlighting this, especially given the social cost of doing so).
But we shouldn't place much trust in the heuristics recommended by religions, because those heuristics will often have been selected for some combination of:
Where the difference between a heuristic and an insight is something like the difference between "be all-forgiving" and "if you are all-forgiving it'll often defuse a certain type of internal conflict". Insights are about what to believe, heuristics are about what to do. Insights can be cross-checked against the rest of our knowledge, heuristics are much less legible because in general they don't explain why a given thing is a good idea.
IMO this all remains true even if we focus on the heuristics recommended by many religions, i.e. the pluralistic focus Alex mentions. And it's remains true even given the point Alex made near the end: that "for people in Christian Western culture, I think using the language of Christianity in good ways can be a very effective way to reach the users." Because if you understand the insights that Christianity is built upon, you can use those to reach people without the language of Christianity itself. And if you don't understand those insights, then you don't know how to avoid incorporating the toxic parts of Christianity.
Fair point. I've now removed that section from the post (and also, unrelatedly, renamed the post).
I was trying to make a point about people wanting to ensure that AI in general (not just current models) is "aligned", but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.
It seems very plausible to me that alignment targets in practice will evolve out of things like the OpenAI Model Spec. If anyone has suggestions for how to improve that, please DM me.