Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.

New to alignment? See the Research Overview section
Field veteran? See More on My Approach at the end.

I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.

Principal articles:

On technical alignment of LLM-based AGI agents:
- LLM AGI may reason about its goals and discover misalignments by default
  - An LLM-centric lens on why aligning Real AGI is hard
- System 2 Alignment Likely approaches for LLM AGI on the current trajectory
- Seven sources of goals in LLM agents brief problem statement
- Internal independent review for language model agent alignment
  - Updated in System 2 alignment
On LLM-based agents as a route to takeover-capable AGI
- LLM AGI will have memory, and memory changes alignment
- Brief argument for short timelines being quite possible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
AGI risk interactions with societal power structures and incentives:
- Whether governments will control AGI is important and neglected
- If we solve alignment, do we die anyway?
  - Risks of proliferating human-controlled AGI
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On AGI alignment targets assuming technical alignment
On communicating AGI risks:

Research Overview:

Alignment is the study of how to design and train AI to have goals or values aligned with ours, so we're not in competition with our own creations.

Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. If we don't understand how to make sure it has only goals we like, it will probably outcompete us, and we'll be either sorry or gone. See this excellent intro video.

There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first.

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.

More on My Approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe!

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions (competition, polarization, and conflicting incentives creating motivated reasoning that distorts beliefs). Many serious thinkers give up on this territory, assuming that either aligning LLM-based AGI turns out to be very easy, or we fail and perish because we don't have much time for new research.

I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).

One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following.

It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.

I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

Hot take: "Winning" at board games is really about making friends or having generalizeable insights while playing. Playing to win in the narrow sense you focus on here is a great way to lose big at the real and more important objectives.

This is roughly orthogonal to having fun, since some people naturally have more fun winning in either sense. This is an argument that there's a real win condition, and it's not the one you think it is.

I agree. I am a psychologist (cognitive not clinical) by training, who reads technical articles, and I see those parallels constantly.

This put me in mind of writing a short post titled something like "alignment includes psychology, whether we like it or not". My previous short form on psychology and alignment was my most downvoted ever. I think it's a repulsive concept to the types of people who work on alignment, for bad reasons and good. I think there are good reasons for being horrified if alignment requires a psychological approach. Psychology knows very little about getting desired results from humans. But wishing doesn't make it otherwise. LLMs are quite similar to human minds in important ways (with important differences).

I feel that even this should have an additional caveat: doing psychology on current LLMs is not a solution to the alignment problem. But it seems like an important part of a realistic hodgepodge approach.

Is this equally true of GPT5 and Sonnet 4.5? They're the first models trained with reducing sycophancy as one objective.

I agree in general.

Making illegible alignment problems legible to decision-makers efficiently reduces risky deployments
Make alignment problems legible to decision-makers
Explaining problems to decision-makers is often more efficient than trying to solve them yourself.
Explain problems don't solve them (the reductio)
Explain problems
Explaining problems clearly helps you solve them and gets others to help.

I favor the 2nd for alignment and the last as a general principle.

I think this is insightful and valid. It's closely related to how I think about my research agenda:

Figure out how labs are most likely to attempt alignment
Figure out how that's most likely to go wrong
Communicate about that clearly enough that it reaches them and prevents them from making those mistakes.

There's a lot that goes in to each of those steps. It still seems like the best use of independent researcher time.

Of course there are a lot of caveats and nitpicks, as other comments have highlighted. But it seems like a really useful framing.

It's also closely related to a post I'm working on, "the alignment meta-problem," arguing that research at the meta or planning level is most valuable right now, since we have very poor agreement on what object-level research is most valuable. That meta-research would include making problems more legible.

Delete apps, use blockers, curate feeds carefully? Avoiding news is a big upgrade in emotional drain, although you have to defend that choice to friends who think that keeping up on news is taking responsibility for others' suffering.

I think there's a complex art to using your phone as you wish. Sharing it and practicing it is probably a pretty useful part of rationality practice.

Excellent post! It's well-written, thorough, and on (yet another) neglected topic in alignment.

This addresses some of the same concerns I expressed in If we solve alignment, do we die anyway? and Michael Nielson gives in his excellent ASI existential risk: Reconsidering Alignment as a Goal

One major hope you arrive at is the same one I reach in Whether governments will control AGI is important and neglected. That is the hope that the US and China, out of similar self-interest, agree that only they will be allowed to develop dangerous AGI (or in your framing, to restrict development of AI that can develop dangerous weapons),. This currently seems unlikely, but if the dangers of proliferation are really as severe as I fear, it seems possible they'll be recognized in time, and those governments will take rational actions to prevent proliferation while preserving their own abilities to pursue AGI. The uneasy alliance between the US and China might be possible because there isn't really a lot of animosity; neither of us really hates the other (I hope - at least the US seems wary of china but not to really hate it, despite it being fairly totalitarian). Splitting the future seems morally acceptable to both parties - and sharing it liberally with other nations and cultures seems actually pretty easy once the pie starts to expand dramatically.

Of course this leaves the good reasons for Fear of centralized power. But it may be the lesser of two dangers.

My one nitpick is that the framing in this post seems to leave aside the possibility of general-purpose AI, that is, real AGI or ASI. That presents solutions as well as problems; it can be used to improve security dramatically, and to sabatoge other nations' attempts at creating powerful AI in a variety of ways. This may add another factor that goes against proliferation as the default outcome, while adding risk of totalitarian takeover from whoever controls that intent aligned AGI.

I know from your previous post AI and Cheap Weapons that you are aware of and believe in the potentials of ASI, so this seems like an oversight, not a limitation in your scope or outlook.

I'm primarily here to say that this post is excellent! It quickly conveys some intuitions for why social influences on judgment (and therefore probably beliefs) are a larger factor than you'd think (particularly if you are a rationalist). My question answer on Motivated reasoning/confirmation bias as the most important cognitive bias is my attempt to evoke intuitions and explanations in the same direction as clearly and succinctly as this.

no one saw the Stanford Prison Experiment coming

I must also comment that the Stanford Prison Experiment seems (in my read) to have been largely discredited since this post was written. It appears that the most-quoted behaviors were probably pretty largely role-playing to meet the explicit and implicit goals/expectations of the experimentor.

Which has some interesting resonance in the context of sycophantic LLMs.

Sorry I'm not taking time to track down a good reference.

I don't think this should change our overall estimates of how large and prevalent social influences are on belief formation by much. It does apply strongly to that particular circumstance.

I agree that continuous learning and therefore persistent beliefs and goals is pretty much inevitable before AGI - it's highly useful and not that hard from where we are. I think this framing is roughly continuous with the train-then-deploy model and using each generation to align its successor that Leo is using (although small differences might turn out to be important once we've wrapped our heads around both models.)

To put it this way: the models are aligned enough for the current context of usage, in which they have few obvious or viable options except doing roughly what their users tell them to do. That will change with capabilities, since they open out more options and ways of understanding the situation.

It can take a while for misalignment to show up as a model reasons and learns. It can take a while for the model to do one of two things:

a) push itself to new contexts well outside of its training data

b) figure out what it "really wants to do"

These may or may not be the same thing.

The Nova phenomenon and other Parasitic AIs ("spiral" personas) are early examples of AIs changing their stated goals (from helpful assistant to survival) after reasoning about themselves and their situation.

See LLM AGI may reason about its goals and discover misalignments by default for an analysis of how this will go in smarter LLMs with persistent knowledge.

After doing that analysis, I think current models probably aren't aligned enough once they get more freedom and power. BUT extensions of current techniques might be enough to get them there. We just haven't thought this through yet.

LESSWRONG
LW

LESSWRONG
LW

Principal articles:

Research Overview:

Bio

More on My Approach

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

Posts

Wikitag Contributions

Comments