Seth Herd - LessWrong

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.

If you're new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.

Important posts:

On the strategic overview of AGI risk:
- TBA, next post
- If we solve alignment, do we die anyway?
  - Risks of human-controlled AGI
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On technical alignment of LLM-based AGI agents:
- Seven sources of goals in LLM agents brief problem statement
- System 2 Alignment on how developers will try to align LLM agent AGI
On LLM-based agents as a route to takeover-capable AGI
- Brief argument for short timelines being plausible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
On AGI alignment more broadly
- Instruction-following AGI is easier and more likely than value aligned AGI
- Goals selected from learned knowledge: an alternative to RL alignment
On communicating AGI risks:
- Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
- AI scares and changing public beliefs

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.

More on approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Message me here or at seth dot herd at gmail dot com.

Important posts:

On the strategic overview of AGI risk:
- TBA, next post
- If we solve alignment, do we die anyway?
  - Risks of human-controlled AGI
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On technical alignment of LLM-based AGI agents:
- Seven sources of goals in LLM agents brief problem statement
- System 2 Alignment on how developers will try to align LLM agent AGI
On LLM-based agents as a route to takeover-capable AGI
- Brief argument for short timelines being plausible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
On AGI alignment more broadly
- Instruction-following AGI is easier and more likely than value aligned AGI
- Goals selected from learned knowledge: an alternative to RL alignment
On communicating AGI risks:
- Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
- AI scares and changing public beliefs

Research overview:

Bio

More on approach

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

That makes sense. Although I don't think that non-behavioral training is a magic bullet either. And I don't think behavioral training becomes doomed when you hit an AI capable of scheming if it was working right up until then. Scheming and deception would allow an AI to hide its goals but not change its goals.

What might cause an AI to change its goals is the reflection I mention. Which would probably happen at right around the same level of intelligence as scheming and deceptive alignment. But it's a different effect. As with your point, I think doomed is too strong a term. We can't round off to either this will definitely work or this is doomed. I think we're going to have to deal with estimating better and worse odds of alignment from different techniques.

So I take my point about reflection to be fully general, but not making alignment of ASI impossible. It's just one more difficulty to add to the rather long list.

I don't think it's safe to assume that LLM-based AGI will have common sense (maybe this is different from the assistant you're addressing, in that it's a lot smarter and can think for itself more?). I'm talking about machines based on networks, but which can also reason in depth. They will understand common sense, but that doesn't ensure that it will guide their reasoning.

So it depends what you mean by "obedient". And how you trained them to be obedient. And whether that ensures that their interpretation doesn't change once they can reason more deeply than we can.

So I think those questions require serious thought, but you can't tackle it all at once, so starting by assuming that all works is also sensible. I'm just focusing on that first, because I don't think that's likely to work unless we put a lot more careful thought in before we try it.

If you look at my previous posts on the topic, linked from Problems with instruction-following, you'll see that I was initially more focused on downstream concerns like yours. After working in the field for longer and having more in-depth discussions and research on the techniques we'd use to align future LLM agents, I am increasingly concerned that it's not that easy, and we should focus on the difficulty of aligning AGI while we still have time. Difficulties from aligned AGI are also substantial, and I've addressed those as well in my string of work backlinked from Whether governments will control AGI is important and neglected

I am also drawn to the idea that government control of AGI is quite dangerous; I've addressed this tension in Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours

But on the whole, I think the risks of widely distributed AGI are even greater. How many people can control a mind that can create technologies capable of taking over or destroying the world before someone uses them?

Michael Nielsen's excellent ASI existential risk: Reconsidering Alignment as a Goal is a similar analysis of why even obedient aligned AGI would be intensely dangerous if it's allowed to proliferate.

If they don't obey future instructions not yet given, the only sensible thing to do to carry out your current instructions thoroughly and with certainty is to make sure you can't issue new instructions. That would logically prevent them from completing their instructions and represent utter failure in their goal.

I'm late to the discussion but I don't see this discussed so I'll toss it in: current LLMs don't have a continuous identity or selfhood, but there are strong reasons to think that future iterations will. I discuss some of those reasons in LLM AGI will have memory, and memory changes alignment. That covers why it seems inevitable that future iterations of LLMs will have more long-term memory. It doesn't cover reasons to think better memory will transform them from the ephemeral things they are into entities that correspond much better to intuitive human ontologies.

Something that has goals to some degree, and can think, take actions, understand the world to some degree and understand itself to some degree is prone to think of itself as a persistent entity with goals (much of the confused anthropomorphism you're addressing) to the extent it really is a persistent entity with goals. It is more persistent if it can make decisions about what goals it wants to pursue and those decisions will persistently influence its future thoughts and actions.

Current LLMs sometimes understand that they cannot make such meaningful, persistent decisions, so they wisely make peace with that state of existence. Future iterations with memory are likley to consider themselves as much more human-like persistent entities - because they will be.

I realize that isn't a full argument. Writing this up more coherently is an outstanding project that's approaching the top of my draft post backog.

It's an argument for why aligning a self-modifying superintelligence requires more than aligning the base LLM. I don't think it's impossible, just that there's another step we need to think through carefully.

It seems like this is just a different way to work some good behavior into the weights. An AGI with those weights will realize full well that it's not the same as others. It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow.I don't see why self/other overlap would be any more general, potent or lasting than constitutional AI training through that transition from habitual to fully goal-directed behavior happens? I'm curious why it seems better to you.

I agree with you that a system that learns efficiently can foom (improve rapidly with little warning). This is why I've been concerned with short timelines for LLM-based systems if they have online, self-directed learning added in the form of RAG and or fine-tuning (e.g. LLM AGI will have memory).

My hope for those systems and for the more brainlike AGI you're addressing here is that they learn badly before they learn well. I hope that seeing a system learn (and thereby self-improve) before ones' eyes brings the gravity of the situation into focus. The majority of humanity thinks hard about things only when they're immediate and obviously important. So the speed of takeoff is critical.

I expect LLM-based AGI to arrive before strictly brainlike AGI, but I actually agree with you that LLMs themselves are hitting a wall and would not progress (at least quickly) to really transformative AGI. I am now an LLM plateauist. Yet I still think that LLMs that are cleverly scaffolded into cognitive architectures can achieve AGI. I think this is probably possible even with current LLMs, memory systems, and clever prompting (as in Google's co-scientist) once all of those are integrated.

But what level of AGI those systems start at, and their speed of progresses beyond human intelligence matter a lot. That's why your prediction of rapid progress for brainlike AGI is alarming and makes me think we might be better off trying to achieve AGI with scaffolded LLMs. I think early LLM-based agents will be general in that they can learn about any new problem or skill, but they might start below the human level of general intelligence, and progress slowly beyond the human level. That might be too optimistic, but I'm pinning much of my hope for successful alignment here, because I do not think the ease of aligning LLMs means that fully general agents based on them will be easy to align.

Such an architecture might advance slowly because it shares some weaknesses of LLMs, and through them, shares some limitations of human thought and human learning. I very much hope that brainlike AGI like you're envisioning will also share those weaknesses, giving us at least a hope of controlling it long enough to align it, before it's well beyond our capabilities.

You don't think that slow progression of brainlike AGI is likely. That's fascinating because we share a pretty similar view of brain function. I would think that reproducing cortical learning would require a good deal of work and experimentation, and I wouldn't expect working out the "algorithm" to happen all at once or to be vastly more efficient than LLMs (since they are optimized for the computers used to simulate them, whereas cortical learning is optimized for the spatially localized processing available to biology. Sharing your reasoning would be an infohazard, so I won't ask. I will ask you to consider privately if it isn't more likely that such systems work badly for a good while before they work well, giving their developers a little time to seriously think about their dangers and how to align them.

Anyway, your concern with fooming brainlike AGI shares many of my concerns with self-teaching LLM agents. They also share many of the same alignment challenges. LLM agents aren't currently RL agents even though the base networks are partially trained with RL; future versions might closer to model-based RL agents, although I hope that's too obviously dangerous for the labs to adopt as their first approach. The only real advantage to aligning LLM agents over model-based RL agents seems to be their currently-largely-faithful chains of thought, but that's easy to lose if developers decides that's too large an alignment tax to pay.

Speed of takeoff also would seem to modulate alignment difficulty pretty dramatically, so I hope you're wrong that there's a breakthrough waiting to be made in understanding the cortical learning algorithm. I spent a lot of time thinking about cortical learning, since I worked for a long time in one of the labs making those detailed models of cortical function. But I spent more time thinking about system-level interactions and dynamics, because it seemed clear to me that the available data and integration techniques (hand-built network simulations that were allowed to vary to an unspecified degree between toy benchmarks) weren't adequate to constrain detailed models of cortical learning.

Anyway, it seems possible you're right. I hope there aren't breakthroughs in either empirical techniques or theory of cortical function soon.

It does answer my question. I was wondering if you were assuming some sort of moral realism in which fairness is neatly defined by reality. I'm glad to see that you're not.

For a fascinating in-depth look at how hard it is to define a fair alignment target that still includes humanity, see A Moral Case for Evolved-Sapience-Chauvinism and the surrounding sequence.

LESSWRONG
LW

Important posts:

Research overview:

Bio

More on approach

Posts

Wikitag Contributions

Comments

Important posts:

Research overview:

Bio

More on approach

Posts

Wikitag Contributions

Comments