Seth Herd

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.

If you're new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.

Important posts:

 

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more. 

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.  

More on approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are.  Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter. 

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions. 

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Wikitag Contributions

Comments

Sorted by

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

There are a lot of merits to avoiding unnecessary premises when they might be wrong.

There are also a lot of merits for reasoning from premises when they allow more progress, and they're likely to be correct. That is, of course, what I'm trying to do here.

Which of these factors is larger has to be evaluated on the specific instances. There's lots more to be said about those in this case, but I don't have time to dig into it now, and it's worth a full post and discussion.

Yes, I think it does make sense ot think of this as a continuum, something I haven't emphasized to date. There's also at least one more dimension, that of how many (and which) humans you're trying to align to. There's a little more on this in Conflating value alignment and intent alignment is causing confusion.

IF is definitely an attempt to sidestep the difficulties of value alignment, at least partially and temporarily.

What we want from an instruction-following system is exactly what you say: one that does what we mean, not what we say. And getting that perfectly right would demand a perfect understanding of our values. BUT it's much more fault-tolerant than a value-aligned system. The Principal can specify what they mean as much as they want, and the AI can ask for clarification as much as it thinks it needs to- or in accord with the Principal's previous instructions to "check carefully about what I meant before doing anything I might hate" or similar.

If done correclty, value alignment would solve the corrigibility problem. But that seems far harder than using corrigibility in the form of instruciton-following to solve the value alignment problem.

I am definitely thinking of IF as it applies to systems with capability for unlimited autonomy. Intent alignment as a concept doesn't end at some level of capability - although I think we often assume it would.

How it would understand "the right thing" is the question. But yes, intent alignment as I'm thinking of it does scale smoothly into value alignment plus corrigibility if you can get it right enough.

(A small rant, sorry) In general, it seems you're massively overanchored on current AI technology, to an extent that it's stopping you from clearly reasoning about future technology.

You are right that I am addressing AGI with a lot of similarities to LLMs. This is done in the interest of reasoning clearly about future technologies. I think good reasoning is a mix of predicting the most likely forms of AGI, and reasoning more broadly. Perhaps I didn't make clear enough in the post that I'm primarily addressing LLM-based AGI. Much of my alignment work (informed by my systems neuroscience work) is on routes from LLMs to AGI. In this theory, LLMs/foundation models are expanded (by adding memory systems and training/scaffolding them for better metacognition) into loosely brainlike cognitive architectures. In those posts I elaborate reasons to think such scaffolded LLMs may soon be "real AGI" in the sense of reasoning and learning about any topic, including themselves and their cognition and goals.  (although that sort of AGI wouldn't be dramatically superhuman in any area, and initially subhuman in some capabilities).

If you have an alternate theory of the likely form of first takeover-capable AGI, I'd love to hear it! It's good to reason broadly where possible, and I think a lot of the concerns are general to any AGI at all or any network-based AGI. But constraining alignment work to address specific likely types of AGI lets us reason much more specifically, which is a lot more useful in the worlds where that type of AGI is what we really are faced with aligning. 

You're talking about AGI here. An agent capable autonomously doing research, play games with clever adversaries, detecting and patching it's own biases, etc. It should be obvious that you can't use current LLM flaws as a method of extrapolating the adversarial robustness of this program.

Yes, good point. I didn't elaborate here, but I do think there's a good chance that the more coherent, intelligent, and introspective nature of any real AGI might make jailbreaking a non-issue.  But jailbreaking might still be an issue, because the core thought generator in this scenario is an advanced LLM.

No. You're entirely ignoring inner alignment difficulties. The main difficulties. There are several degrees of freedom in goal specification that a pure goal target fails to nail down. These lead to an unpredictable reflective equilibrium.

Yes I am entirely ignoring inner alignment difficulties. I thought I'd made that clear by saying earlier

There are substantial technical challenges [including] problems with specifying goals well enough that they won’t be misgeneralized or misinterpreted (from the designer’s perspective). There are serious implementational problems for any alignment target. [...] Here I am leaving aside the more general challenges, and addressing only those that are specifically relevant to instruction-following (and in some cases corrigibility) as an alignment target.

I didn't use the term "inner alignment" because I don't find it intuitive or clarifying; there isn't a clear division between inner and outer, and they feel like jargon. So I use misgeneralization, which I feel encompasses inner misalignment as well as other (IMO more urgent) concerns. Maybe I should get on board and use inner and outer alignment just to speak the lingua franca of the realm.

Thanks, that's all relevant and useful!

Simplest first: I definitely envision a hierarchy of reporting and reviewing questionable requests. That seem like an obvious and cheap route to partly address the jailbreaking/misuse issues.

I've also envisioned smarter LLM agents "thinking through" the possible harms of their actions, and you're right that does need at least a pretty good grasp on human values. Their grasp on human values is pretty good and likely to get better, as you say. I haven't thought of this as value alignment, though, because I've assumed that developers will try to weigh following instructions over adhering to their moral strictures. But you're right that devs might try a mix of the two, instead of rules-based refusal training. And the value alignment part could become dominant whether by accident or on purpose.

I haven't worked through attempting to align powerful LLM agents to human values in as much detail as I've tried to think through IF alignment. It's seemed like losing corrigibility is a big enough downside that devs would try to keep IF dominant over any sort of values-based rules or training. But maybe not, and I'm not positive that value alignment couldn't work. Specifying values in language (a type of Goals selected from learned knowledge: an alternative to RL alignment) has some nontrivial technical challenges but seems much more viable than just training for behavior that looks ethical. (Here' it's worth mentioning Steve Byrnes' non-behaviorist RL in which some measure of representations is part of the reward model, like trying to reward the model only when it's thinking about doing good for a human). 

Language in the way we use it is designed to generalize well, so language might accurately convey goals in terms of human values for full value alignment. But you'd have to really watch for unintended but sensible interpretations - actually your sequence AI, Alignment, and Ethics (esp. 3-5) is the most thorough treatment I know of why something like "be good to everyone" or slightly more careful statements like "give all sentient beings as much empowerment as you can" will likely go very differently than the designer intended - even if the technical implementation goes perfectly!

I don't think this goes well by default, but I'm not sure an LLM architecture given goals in language very carefully, and designed very carefully to "want" to follow them, couldn't pull off value alignment.

I also think someone might be tempted to try it pretty quickly after developing AGI. This might happen if it seemed like proliferation of IF AGI was likely to lead to disastrous misuse, so a value-aligned recursively self-improving sovereign was our best shot. Or it might be tried based on worse or less altruistic logic.

Anyway, thanks for your thoughts on the topic.

I am sorry for your loss. Death is natural but it is so, so bad.

I'm assuming you're posting this here in part to foster a discussion of this tension between preventing deaths in the short term and taking more risks on killing everyone by getting alignment wrong if we rush toward AGI.

This tension is probably going to become more widespread as the concept of AGI becomes more prominent. Many people will want faster progress, in hopes of saving themselves or their loved ones. Longtermism is pretty dominant here on LW, but it is very much a minority view in society at large. Thus, this urge to rush will have to be countered by spreading an awareness of how rushing toward AGI is increasing the odds for older people, while risking the lives of their children and grandchildren. And all of the glorious generations to follow - while most people aren't longtermist, the idea of unimaginable flourishing does hold some weight in their minds, and it isn't that ahrd to imagine in broad form.

I face this dilemma myself. At 50 and in imperfect health, my likely end falls somewhere in the middle of my predicted range of hitting longevity takeoff. Any small speedup or slowdown might shift my odds substantially. I don't know what I'd do if I had real power over our rate of progress, but I don't. So I'll continue advocating that we slow down as much as we can, while also working as fast as we can to align our first AGI/ASIs. That speed will improve our odds of collective survival in the likely case that we can't slow down substantially. And it might even save a few more of the precious unique minds now alive.

I found this super useful, thank you so much! Relating the process of doing good work to mindfulness meditation clicks for me, and I'll be trying it. Gently putting attention back on-topic and practicing noticing when it's strayed is mindfulness, and it's addressing a key challenge in getting good work done. Treating the rest of the challenges in doing research as opportunities for learning about yourself and growing- if you pay attention to them - also makes sense.

I'm a bit suspicious of meditation as an end in itself. But getting better at research (or relationships, another suggested focus for mindfulness and growth) are worthy applications!

I wish I could attend the retreat; it would be good to really practice this attitude, and your offered rate is great. But the French Pyranees is a long expensive trip from where I'm at. So I'll try applying my own system of habit modification to this goal, and let you know how it goes.

Why do you ask? This is a somewhat interesting question, but I don't usually spend time on it. I think alignment/AI thinkers don't think about it much because we're usually more concerned with getting an AGI to reliably pursue any target. If we got it to actually have humanity's happiness as its goal, in the way we meant it and would like it, we'd just see what it does and enjoy the result. But getting it to reliably do anything at all is one problem, and making that thing something we actually want is another huge problem. See A case for AI alignment being difficult for a well-written intro on why most of us think alignment is at least fairly hard.

I don't spend much time thinking about different specific value alignment targets because I think we should first focus on how to achieve any of them. I couldn't see exactly what the world values survey was from that link at a quick glance, but I'm not sure the details matter. It's would probably produce a vastly better future than a value target like "solve hard problems" or "make me a lot of money" would; there are probably better-future-proofed targets that would be even better; but steering away from the worst and toward the better is my primary goal right now, because I don't hink we have that in hand at all.

Load More