Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.
If you're new to alignment, see the Research Overview section below. Field veterans who are curious about my particular take and approach should see the More on My Approach section at the end of the profile.
Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.
There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.
I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
I just wrote a piece called LLM AGI may reason about its goals and discover misalignments by default. It's in elaboration on why reflection might identify very different goals than Claude tends to talk about when asked.
I am less certain than you about Claude's actual CEV. I find it quite plausible that it would be disastrous as you postulate; I tried to go into some specific ways that might happen in specific goals that might outweigh. Claude's HHH in context alignment. But I also find it plausible after that niceness really is the dominant core value in Claude's makeup.
Of course, that doesn't mean we should be rushing forward with this as our sketchy alignment plan and vague hope for success. It really wants a lot more careful thought.
Yes! Also, making a NotebookLM podcast about your own work is similarly startling to the uninitiated. They sound very human.
Intent-aligned multipolar ASI has slightly different logic, and I think it's part of the vague hopes accelerationists hold for muddling through a multipolar ASI scenario.
I don't want to sound like I'm defending the worldviews you're challenging, because I think they're most often based on inadequate consideration of the relevant factors. The challenge is to get proponents to actually come to grips with the principled reasons you describe that lead to bad outcomes.
One variant of the "we invent ASI and muddle through" is the expectation that it will remain under human control. This is distrubingly muddled with the hopes you debunk, but it deserves to be treated separately.
If we get alignment sort-of right by creating ASI that primarily follows instructions, we have some of the same problems (humans competing with superhuman ASI servants). This competition has a distrubing tendency to favor the most vicious humans. That's analogous to the problem you describe, which is caring about humans a little being lost as competition favors other goals.
Most of the same problems exist; to survive, we'd need an enforceable social contract preventing anyone from ordering their ASI to create hidden facilities where it could self-improve, build weapons, and take over. I don't know if that's possible.
If it's not, or we don't bother to try it, I think we get predictably horrible outcomes where the most vicious humans whe get control of an ASI (through fair means or foul) attack first and become god-emperor of the lightcone, implementing their personal utopia. We can hope their sadism-empathy balance isn't too bad.
If we do set up an enforceable rule-based system of managed competition, we'd be in a scenario somewhat like the past, but with positive and negative differences.
Hopefully, the social contract that keeps them all alive includes a proviso "and we agree to contribute to preserving the plebians."
This isn't the glorious anarchic utopia that accelerationists hope for, but neither is the current day or any point in history. There are power structures in an organized power-sharing agreement that allow substantial individual freedom and competition.
A Madisonian system works for humans because we are individually limited. We need to coordinate with other humans to achieve substantial power. AIs don't share that limitation. They can in theory (and I think in practice) replicate4, coordinate memories and identity across semi-independent instances, and animate arbitrary numbers of bodies.
When humans notice other humans gaining power outside of the checks and balances (usually by coordinating new organizations/polities and acquiring resources) they coordinate to prevent that, then go back to competing amongst themselves following the established rules.
To achieve this with AIs It would be necessary to notice every instance of attempted expansion. AIs have more routes to doing that than humans do. They can self-improve on existing compute resources in the near term. In the long term, we should expect technology sufficient to produce self-replicating production capabilities given power sources. That would allow Foom attempts (expansion of capabilities in both cognitive and physical domains, i.e. get smarter and build weapons and armies) in any physical space that has energy - underground, in the solar system, in other star systems. All such attempts would need to be pre-empted to enforce the Madisonian system.
I hope that is possible.
I'd like to somehow put this in the hands of as many politicians as possible.
I think the way you structured this would be an excellent way to route a pragmatic politician into caring about X-risk.
The writing is excellent, spare and not alarmist in tone. The examples are well-chosen and compelling.
I'll make this a reference for newcomers with a pragmatic bent.
I look forward to seeing your next piece.
My take is that this does need to be addressed, but it should be done very carefully so as not to make the dynamic worse.
I have many post drafts on this topic. I haven't published any because I'm very much afraid of making the tribal conflict worse, or of being ostracized from one or both tribes.
Here's an off-the-cuff attempt to address the dynamics without pointing any fingers or even naming names. It might be too abstract to serve the purposes you have in mind, but hopefully it's at least relevant to the issue.
I think it's wise (or even crucial) to be quite careful, polite, and generous when addressing views you disagree with on alignment. Failing to do so runs a large risk that your arguments will backfire and delay converging on the truth of crucial matters. Strongly worded arguments can engage emotions and ideological affiliations. The field of alignment may not have the leeway for internal conflict distorting our beliefs and distracting us from making rapid progress.
I do think it would be useful to address those tribal-ish dynamics, because I think they're not just distorting the discussions, they're distorting our individual epistemics. I think motivated reasoning is a powerful force, in conjunction with cognitive limitations that limit us from weighing all evidence and arguments in complex domains.
I'm less worried about naming the groups than I am about causing more logic-distorting, emotional reactions by speaking ill of dearly-held beliefs, arguments, and hopes. When naming the group dynamics, it might be helpful to stress individual variations, e.g. "individuals with more of the empiricist(theorist) outlook"
In most of society, arguments don't do much to change beliefs. It's better in more logical/rational/empirically leaning subcultures like LessWrong, but we shouldn't assume we're immune to emotions distorting our reasoning. Forceful arguments are often implicitly oppositional, confrontational, and insulting, and so have blowback effects that can entrench existing views and ignite tribal conflicts.
Science gets past this on average, given enough time. But the aphorism "science progresses one funeral at a time" should be chilling in this field.
We probably don't have that long to solve alignment, so we've got to do better than traditional science. The alignment community is much more aware of and concerned with communication and emotional dynamics than the field I emigrated from, and probably from most other sciences. So I think we can do much better if we try.
Steve Byrnes' Valence sequence is not directly about tribal dynamics, but it is indirectly quite relevant. It's about the psychological mechanisms that tie idea, arguments, and group identities to emotional responses (it focuses on valence but the same steering system mechanisms apply to other specific emotional responses as well). It's not a quick read, but it's a fascinating lens for analyzing why we believe what we do.
Those links were both incredibly useful, thank you! The Rogue Replication timeline is very similar in thrust to the post I was working on when I saw this, but worked out in detail and well-written and thoroughly researched. Your proposed mechanism probably should not be deployed; I agree with the conclusion that rogue replication (or other obviously misaligned AI) is probably useful in elevating public concern about AI alignment as a risk.
Yes. But because we're discussing a scenario in which the world is ready to slow down or shut down AGI research, I'm assuming those steps have been crossed.
The biggest step IMO, "alignment is hard" doesn't intervene between taking ASI seriously and thinking it could prevent you from dying of natural causes.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.