Seth Herd

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. Now I'm applying what I've learned to the study of AI alignment. 

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — an treat us as we do ants or monkeys. See this excellent intro video for more. 

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.  

More on approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are.  Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll do the obvious thing: design it to follow instructions. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Comments

Sorted by

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.

I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.

LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.

In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.

I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.

Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.

Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.

 

 

  1. ^

    I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.

Thank you.

What an oddly off-topic but perfect question. As it happens, that's something I've thought about a lot. Here's the old version: Capabilities and alignment of LLM cognitive architectures

And how to align it: Internal independent review for language model agent alignment

These are both older versions. I was worried about pushing capabilities at the time, but progress has been going that direction anyway, so I'm working on updated versions that are clearer.

I've been catching up on your recent work in the past couple of weeks; it seems on-target for my projected path to AGI.

Seth Herd2-4

At first I thought oh no, Connor is planning to Start Over and Do It Right. And people will follow him because Connor is awesome. And then we're more likely to all die, because there isn't time to start over and nobody has a plan to stop progress toward ASI on the current route.

Then I saw the section on Voyager. Great, I thought; Connor is going to make a better way to create language model agents with legible and faithful chains of thought (in structured code). Those seem like our best route to survival, since they're as alignable as we're going to get, and near the default trajectory. Conjecture's previous attempt to make LMA COEMs seemed like a good idea. Hopefully everyone else is hitting walls just as fast, but an improved technique can still beat training for complex thought in ways that obscure the chain of thought.

Then I saw implications that those steps are going to take a lot of time, and the note that of course we die if we go straight to AGI. Oh dang, back to thought #1: Connor will be building perfect AGI while somebody else rushes to get there first, and a bunch of other capable, motivated people are going to follow him instead of trying to align the AGI we are very likely going to get.

The narrow path you reference is narrow indeed. I have not heard a single plan for pause that comes to grips with how difficult it would be to enforce internationally, and the consequences of Russia and China pushing for AGI while we pause. It might be our best chance, but we haven't thought it through.

So I for one really wish Connor and Conjecture would put their considerable talents toward what seems to me like a far better "out:" the possibility that founation model-based AGI can be aligned well enough to follow instructions even without perfect alignment or a perfectly faithful chain of thought.

I realize you currently think this isn't likely to work, but I can't find a single place where this discussion is carried to its conclusion. It really looks like we simply don't know yet. All of the discussions break down into appeals to intuition and frustration with those from the "opposing camp" (of the nicest, most rational sort when they're on LW). And how likely would it have to be to beat the slim chance we can pause ASI development for long enough?

This is obviously a longer discussion, but I'll make just one brief point about why that might be more likely than many assume. You appear to be assuming (I'm sure with a good bit of logic behind it) that we need our AGI to be highly aligned for success - that if network foundation models do weird things sometimes, that will be our doom.

Making a network into real AGI that's reflective, agentic, and learns continuously introduces some new problems. But it also introduces a push toward coherence. Good humans can have nasty thoughts and not act on them or be corrupted by them. A coherent entity might need to be only 51% aligned, not the 99.99% you're shooting for. Particularly if that alignment is strictly toward following instructions, so there's corrigibility and a human in the loop.

Some of this is in Internal independent review for language model agent alignment and Instruction-following AGI is easier and more likely than value aligned AGI, but I haven't made that coherence point clearly. I think it's another Crux of disagreement on alignment difficulty that I missed in that writup - and one that hasn't been resolved.

Edit: it seems like a strategy could split the difference by doing what you're describing, but accelerating much faster if you thought agent coherence could take care of some alignment slop.

I for one don't want to die while sticking to principles and saying I told you so when we're close to doom; I want to take our best odds of survival - which seems to include really clarifying which problems we need to solve.

Good point.

Alignment theory and AGI prediction spring to mind again; there it's not just our self-concepts at stake, but the literal fate of the world.

Answer by Seth Herd4820

Motivated reasoning/confirmation bias.

As Scott Alexander said in his review of Julia Galif's The Scout Mindset:

Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias.

He goes on to argue that this bias is the source of polarization in society, which is distorting our beliefs and setting us at each other's throats. How could someone believe such different things unless they're either really stupid or lying to conceal their selfishness? I think this is right, and I think it's at play even in the best rationalist communities like LessWrong. I think it's particularly powerful in difficult domains, like AGI prediction and alignment theory. When there's less real evidence, biases play a larger role.

I reached this conclusion independently while studying those and the remaining ~149 biases listed on Wikipedia at that point. You can get a little more rational by making your estimates carefully. That covers most of the biases. But the world is being destroyed by people believing what is comfortable to believe instead of what the evidence suggests. This is usually also what they already believe, so the definition of confirmation bias is highly overlapping with motivated reasoning. 

I studied the brain basis of cognitive biases for four years while funded by an IARPA program; I thought it was more worthwhile than the rest of what we were doing in cognitive neuroscience, so I kept up with it as part of my research for the remaining four years I was in the field.

I think motivated reasoning is a better conceptual term for understanding what's going on, but let's not quibble about terminology. I'm going to mostly call it motivated reasoning, MR, but you can take almost everything I'm going to say and apply it to confirmation bias- because mostly it's comfortable to keep believing what we already do. We chose to believe it partly because it was comfortable, and now it fits with all of our other beliefs, so changing it and re-evaluating the rest of our connected beliefs is uncomfortable.

Wait, you're saying! I'm a rationalist! I don't just believe what's comfortable!

Yes, that's partly true. Believing in seeking truth when it's hard does provide some resistance to motivated reasoning. A hardcore rationalist actually enjoys changing their mind sometimes. But it doesn't confer immunity. We still have emotions, and it's still more comfortable to think that we're already right because we're good rationalists who've already discerned the truth.

There are two ways confirmation bias works. One is that it's easier to think of confirming evidence than disconfirming evidence. The associative links tend to be stronger. When you're thinking of a hypothesis you tend to believe, it's easy to think of evidence that supports it. 

The stronger one is that there's a miniature Ugh field[1] surrounding thinking about evidence and arguments that would disprove a belief you care about. It only takes a flicker of a thought to make the accurate prediction about where considering that evidence could lead: admitting you were wrong, and doing a bunch of work re-evaluating all of your related beliefs. Then there's a little unconscious yuck feeling when you try to pay attention to that evidence.

This is just a consequence of how the brain estimates the value of predicted outcomes and uses that to guide its decision-making, including its micro-decisions about what to attend to. I wrote a paper reviewing all of the neuroscience behind this, Neural mechanisms of human decision-making, but it's honestly kind of crappy based on the pressure to write for a super-specialized audience, and my reluctance at the time to speed up progress on brainlike AGI. So I recommend Steve Byrnes' valence sequence over that complex mess; it perfectly describes the psychological level, and he's basing it on those brain mechanisms even though he's not directly talking about them. And he's a better writer than I am. 

Trapped priors is at least partly overlapping with confirmation bias. Or it could even just be strong priors. The issue is that everyone has seen different evidence and arguments - and we've very likely spent more time attending to evidence that supports our original hypothesis, because of the subtle push of motivated reasoning.

Motivated reasoning isn't even strictly speaking irrational. Suppose there's some belief that really doesn't make a difference in your daily life, like that there's a sky guy with a cozy afterlife, or which of two similar parties should receive your vote (which will almost never actually change any outcomes). Here the two definitions of rationality diverge: believing the truth is now at odds with doing what works. It will obviously work better to believe what your friends and neighbors believe, so you won't be in arguments with them and they'll support you more when you need it.

If we had infinite cognitive capacity, we could just believe the truth while claiming to believe whatever works. And we could keep track of all of the evidence instead of picking and choosing which to attend to.

But we don't. So motivated reasoning, confirmation bias, and the resulting tribalism (which happens when other emotions like irritation and outrage get involved in our selection of evidence and arguments) are powerful factors, even for a devoted rationalist.

The only remedy I know of is to cultivate enjoying being wrong. This involves giving up a good bit of ones' self-concept as a highly intelligent individual. This gets easier if you remember that everyone else is also doing their thinking with a monkey brain that can barely chin itself on rationality.

Thanks for asking this question; it's a very smart question to ask. And I've been meaning to write about this on LW and haven't prioritized doing a proper job, so it's nice to have an excuse to do a brief writeup.

 

  1. ^

    See also Defeating Ugh Fields In Practice for some interesting and useful review.

Yes, the math crowd is saying something like "give us a hundred years and we can do it!". And nobody is going to give them that in the world we live in.

Fortunately, math isn't the best tool to solve alignment. Foundation models are already trained to follow instructions given in natural language. If we make sure this is the dominant factor in foundation model agents, and use it carefully (don't say dumb things like "'go solve cancer, don't bug me with the hows and whys, just git er done as you see fit", etc), this could work.

We can probably achieve technical intent alignment if we're even modestly careful and pay a modest alignment tax. You've now read my other posts making those arguments.

Unfortunately, it's not even clear the relevant actors are willing to be reasonably cautious or pay a modest alignment tax.

The other threads are addressed in responses to your comments on my linked posts.

We can now see some progress with o1 and the similar family of models. They are doing some training of the "outer loop" (to the limited extent they have one) with RL, but r1 and QwQ still produce very legible CoTs.

So far.

See also my clarification on how an opaque CoT would still allow some internal review, but probably not an independent one, in this other comment.

See also Daniel Kokatijlo's recent work on a "Shoggoth/Face" system that maintains legibility, and his other thinking on this topic. Maintaining legibility seems quite possible, but it does bear an alignment tax. This could be as low as a small fraction if the CoT largely works well when it's condensed to language. I think it will; language is made for condensing complex concepts in order to clarify and communicate thinking (including communicating it to future selves to carry on with. 

It won't be perfect, so there will be an alignment tax to be paid. But understanding what your model is thinking is very useful for developing further capabilities as well as for safety, so I think people may actually implement it if the tax turns out to be modest, maybe something like 50% greater compute during training and similar during inference.

On RSI, see The alignment stability problem and my response to your comment on Instruction-following AGI...

WRT true value alignment, I agree that this is just a stepping stone to that better sort of alignment. See Intent alignment as a stepping-stone to value alignment.

I agree that including non-linguistic channels is going to be a strong temptation. Language does nicely summarize most of our really abstract thought, so I don't think it's necessary. But there are many training practices that would destroy the legible chain of thought needed for external review. See the case for CoT unfaithfulness is overstated for the inverse.

Legible CoT is actually not necessary for internal action review. You do need to be able to parse what the action is for another model to predict and review its likely consequences. And it works far better to review things at a plan level rather than action-by-action, so the legible CoT is very useful. But if the system is still trained to respond to prompts, you could still use the scripted internal review no matter how opaque the internal representations had become. But you couldn't really make that review independent if you didn't have a way to summarize the plan so it could be passed to another model, like you can with language.

 

BTW your comment accidentally was formatted as a quote along with the bit you meant to quote from the post. Correcting that would make it easier for others to parse, but it was clear to me.

I guess I didn't address RSI in enough detail. The general idea is to have a human in the loop during RSI, and to talk extensively with the current version of your AGI about how this next improvement could disrupt its alignment before you launch it.

WRT "I don't want his attempted in any light-cone I inhabit", well, neither do I. But we're not in charge of the light cone.

All we can do is convince the people who currently very much ARE one the road to attempting exactly this to not do it - and saying "it's way too risky and I refuse to think about how you might actually pull it off" is not going to do that.

Or else we can try to make it work if it is attempted.

Both paths to survival involve thinking carefully about how alignment could succeed or fail on our current trajectory.

Load More