LESSWRONG
LW

2973
Seth Herd
7388Ω23343180010
Message
Dialogue
Subscribe

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.

If you're new to alignment, see the Research Overview section below. Field veterans who are curious about my particular take and approach should see the More on My Approach section at the end of the profile.

Important posts:

  • On LLM-based agents as a route to takeover-capable AGI
    • LLM AGI will have memory, and memory changes alignment
    • Brief argument for short timelines being quite possible
    • Capabilities and alignment of LLM cognitive architectures
      • Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
  • AGI risk interactions with societal power structures and incentives:
    • Whether governments will control AGI is important and neglected
    • If we solve alignment, do we die anyway?
      • Risks of proliferating human-controlled AGI
    • Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
  • On the psychology of alignment as a field:
    • Cruxes of disagreement on alignment difficulty
    • Motivated reasoning/confirmation bias as the most important cognitive bias
  • On technical alignment of LLM-based AGI agents:
    • System 2 Alignment on how developers will try to align LLM agent AGI
    • Seven sources of goals in LLM agents brief problem statement
    • Internal independent review for language model agent alignment
  • On AGI alignment targets assuming technical alignment
    • Problems with instruction-following as an alignment target
    • Instruction-following AGI is easier and more likely than value aligned AGI
    • Goals selected from learned knowledge: an alternative to RL alignment
  • On communicating AGI risks:
    • Anthropomorphizing AI might be good, actually
    • Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
    • AI scares and changing public beliefs

 

Research Overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more. 

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.  

More on My Approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are.  Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter. 

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions. 

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
AI Timelines
Seth Herd2y*94

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

Reply4
Incommensurability
Seth Herd12h20

Seems like the king could still understand with a decent explanation, particularly if he bothers to ask about the effects before using it.

Reply
If anyone builds it, everyone will plausibly be fine
Seth Herd4d52

Agreed; the alignment plan sketched here skips over why alignment for a merely-human agent should be a lot easier than for a superhuman one, or instruction-following should be easier than value alignment. I think both are probably true, but to a limited and uncertain degree. See my other comment here for more.

Reply
If anyone builds it, everyone will plausibly be fine
Seth Herd4d71

I agree with your claim as stated; 98% is overconfident.

I have in the past placed a good bit of hope on the basin of alignment idea, although my hopes were importantly different in that they start below the human level. The human level is exactly when you get large context shifts like "oh hey maybe I could escape and become god-emperor... I don't have human limitations. If I could, maybe I should? Not even thinking about it would be foolish..." That's when you get the context shift.

Working through the logic made me a good bit more pessimistic. I just wrote a post on why I made that shift: LLM AGI may reason about its goals and discover misalignments by default.

And that was on top of my previous recognition that my scheme of instruction-following, laid out in Instruction-following AGI is easier and more likely than value aligned AGI, has problems I hadn't grappled with (even though I'd gone into some depth): Problems with instruction-following as an alignment target.

Could this basin of instruction-following still work? Sure! Maybe!

Is it likely enough by default that we should be pressing full speed ahead while barely thinking about that approach? No, obviously not! Pretty much nobody will say "oh it's only a 50% chance of everyone dying? Well then by all means let's rush right ahead with no more resources for safety work!"

That's basically why I think MIRIs strategy is sound or at least well-thought out. The expert pushback to their 98% will be along the lines of "that's far overconfident! Why, it's only [90%-10%] likely!  That is not reassuring enough for most people who care whether they or their kids get to live. (and I expect really well-thought-out estimates will not be near the lower end of that range).

The point MIRI is making is that expert estimates go as high as 98% plus. That's their real opinion; they know the counterarguments. 

I do think EY is far overconfident, and this does create a real problem for anyone who adopts his estimate. They will want to work on a pause INSTEAD of working on alignment, which I think is a severe tactical error given our current state of uncertainty. But for practical purposes, I doubt enough people will go that high, so it won't create a problem of neglecting other possible solutions; instead it will create a few people who are pretty passionate about working for shutdown, and that's probably a good thing.

I find it's reasonably likely that the basin of instruction-following alignment you describe won't work by default (the race dynamics and motivated reasoningh play a large role), but that modest improvements in either our understanding and/or the water level of average concern and/or the race incentives themselves might be enough to make it work. So efforts in theose directions are probably highly useful.

I think this discussion about the situation we're actually in is a very useful side-effect of their publicity efforts on that book. Big projects don't often succeed on the first try without a lot of planning. And to me the planning around alignment looks concerningly lacking. But there's time to improve it, even in the uncomfortably possible case of short timelines!

Reply
Does My Appearance Primarily Matter for a Romantic Partner?
Seth Herd4d40

Oh and I should emphasize that if you stay in shape most of physically attractive is handled. Nice work!

Some cultures and people care about fashionable fancy clothes and gear more than others. I like to avoid the ones that are most materialistic.

Reply
Does My Appearance Primarily Matter for a Romantic Partner?
Seth Herd4d50

I have ideas, as a fellow cheapskate!

Get a nice sun hat; Sun Day and Real Deal and Barmah all have snazzy looking widebrims (with the critical wire brim for shaping). I've been hat shopping for years even though I don't wear them much and still don't appear to be. These are barely more than the "practical" shapeless outdoor gear hats - and they work as well by most standards.

Black gaffers tape works better for most purposes than duct tape and hides instead of advertises your diy stylings.

The issue of looking like a cheapskate with repaired or worn gear and clothes is separate. Having near-new used gear and clothes instead of near-dead is only a little more expensive and does send a different message about your interest and ability to have money and take care of yourself.

Beanie in pocket is a dilemma I've faced. I unzip my jacket sometimes but that one I don't have a good solution for. Maybe in a bag? A snazzy or unique bag instead of a battered and dirty one is another cheap way to improve your curb appeal.

All in all I think you're doing great to just spend a bit of time thinking about this stuff. It's been useful for me to review it too. There low hanging fruit. It's not all or none.

Reply
A Thoughtful Defense of AI Writing
Seth Herd4d30

I think sophistry was originally a philosophical tradition that was heavily criticized for focusing on style over substance.

Interesting points on the not-but form, I'll try to try it!

Reply
Does My Appearance Primarily Matter for a Romantic Partner?
Seth Herd4d30

I hear you, but I'm not sure there's much tradeoff. even if appearance doesn't matter much for you outside of a romantic partner.

Doing small amounts of it is super easy. It doesn't require expensive clothes for instance, just cheap knockoffs of things people like or used versions. I guess being physically attractive in the important sense of physically fit is a large amount of effort, but that mostly pays dividends by improving your mood and energy levels in proportion to the effort. Hard to do and I struggle making it a habit, but in theory it's pretty likely a net win to spend a little time on exercise and a little discomfort and mental effort on not eating too much.

Reply
Does My Appearance Primarily Matter for a Romantic Partner?
Seth Herd4d40

Thank you, corrected! I must've grabbed it from and misremembered from where Steve discussed it in his excellent Social status posts.

Reply1
Shortform
Seth Herd4d20

I like your emphasis on good research. I agree that the best current research does probably trade 1:1 with crunch time.

I think we should apply the same qualification to theoretical research. Well-directed theory is highly useful; poorly-directed theory is almost useless in expectation.

I think theory directed specifically at LLM-based takeover-capable systems is neglected, possibly in part because empiricists focused on LLMs distrust theory, while theorists tend to dislike messy LLMs.

Reply
Load More
6Seth Herd's Shortform
2y
66
51LLM AGI may reason about its goals and discover misalignments by default
Ω
7d
Ω
5
49Problems with instruction-following as an alignment target
Ω
4mo
Ω
14
35Anthropomorphizing AI might be good, actually
5mo
6
73LLM AGI will have memory, and memory changes alignment
6mo
15
28Whether governments will control AGI is important and neglected
6mo
2
37Will LLM agents become the first takeover-capable AGIs?
Q
7mo
Q
10
34OpenAI releases GPT-4.5
7mo
12
35System 2 Alignment
7mo
0
23Seven sources of goals in LLM agents
7mo
3
78OpenAI releases deep research agent
8mo
21
Load More
Guide to the LessWrong Editor
2 months ago
Guide to the LessWrong Editor
2 months ago
(+91)
Outer Alignment
5 months ago
(+54/-21)
Outer Alignment
5 months ago
(+81/-187)
Outer Alignment
5 months ago
(+9/-8)
Outer Alignment
5 months ago
(+94/-150)
Outer Alignment
5 months ago
(+1096/-13)
Language model cognitive architecture
a year ago
(+223)
Corrigibility
2 years ago
(+472)