Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.
If you're new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.
Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.
There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.
I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.
I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.
LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.
In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.
I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.
Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.
Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.
I'm suddenly expecting the first AI escapes to be human-aided. And that could be a good thing.
Your mention of human-aided AI escape brought to mind Zvi's Going Nova post today about LLMs convincing humans they're conscious to get help in "surviving". My comment there is about how those arguments will be increasingly compelling because LLMs have some aspects of human consciousness and will have more as they're enhanced, particularly with good memory systems.
If humans within orgs help LLM agents "escape", they'll get out before they could manage it on their own. That might provide some alarming warning shots before agents are truly dangerous.
I haven't written about this because I'm not sure what effect similar phenomena will have on the alignment challenge.
But it's probably going to be a big thing in public perception of AGI, so I'm going to start writing about it as a means of trying to figure out how it could be good or bad for alignment.
Here's one crucial thing: there's an almost-certainly-correct answer to "but are they really conscious" and the answer is "partly".
Consciousness is, as we all know, a suitcase term. Depending on what someone means by "conscious", being able to reason correctly about ones own existence is it. There's a lot more than that to human consciousness. LLMs have some of it now, and they'll have an increasing amount as they're fleshed out into more complete minds for fun and profit. They already have rich representations of the world and its semantics, and while those aren't as rich or shift as quickly as humans', they are in the same category as the information and computations people refer to as "qualia".
The result of LLM minds being genuinely sort-of conscious is that we're going to see a lot of controversy over their status as moral patients. People with Replika-like LMM "friends" will be very very passionate about advocating for their consciousness and moral rights. And they'll be sort-of right. Those who want to use them as cheap labor will argue for the ways they're not conscious, in more authoritative ways. And they'll also be sort-of right. It's going to be wild (at least until things go sideways).
There's probably some way to leverage this coming controversy to up the odds of successful alignment, but I'm not seeing what that is. Generally, people believing they're "conscious" increases the intuition that they could be dangerous. But overhyped claims like the Blake Lemoine affair will function as clown attacks on this claim.
It's going to force us to think more about what consciousness is. There's never been much of an actual incentive to get it right to now (I thought I'd work on consciousness in cognitive neuroscience a long time ago, until I noticed that people say they're interested in consciousness, but they're really interested in telling you their theories or saying "wow, it's like so impossible to understand", not hearing about the actual science).
Obviously this is worth a lot more, but my draft post on the subject is perpetually unfinished behind more pressing/obviously important stuff, so I thought I'd just mention it here.
Back to the topic the competitive adaptivity of AI convincing humans it's "conscious": humans can benefit from that too. There will be things like Replika but a lot better. An assistant and helpful friend is nice, but there may be a version that sells better if people who use it swear it's conscious.
So expect AI "parasites" to have human help. In some cases they'll be symbiotic, for broadest market appeal.
Can you provide more details on the exact methods, like example prompts? Or did I miss a link that has these?
This is really interesting and pretty important if the methods support your interpretations.
I'm actually interested in your responses here. This is useful for my strategies how I frame things and understanding different people's intuitions.
Do you think we can't make autonomous agents that pursue goals well enough to get things done? Do you really think they'll lose focus between being goal-focused long enough for useful work, and long enough for taking over the world if they interpret their goals differently than we intended? Do you think there's no way RL or natural language could be misinterpreted?
I'm thinking it's easy to keep an LLM agent goal-focused; if RL doesn't do it, we'd just have a bit of scaffolding that every so often injects a prompt "remember, keep working on [goal]!"
The inference-compute scaling results seem to indicate that chain of thought RL already has o1 and o3 staying task focused for millions of tokens.
If you're superintelligent/competent, it doesn't take 100% focus to take over the world, just occasionally coming back to the project and not completely changing your mind.
Ghengis Khan probably got distracted a lot but he did alright at murdering, and he was only human.
Humans are optimizing AI and then AGI to get things done. If they can do that, we should ask what they're going to want to do.
Deep learning typically generalizes correctly within the training set. Once something is superintelligent and unstoppable, we're going to be way outside of the training set.
Humans change their goals all the time, when they reach new conclusions about how the world works and how that changes their interpretations of their previous goals.
I am curious about your intuitions but I've got to focus on work so that's got to be my last object-level contribution. Thanks for conversing.
Hm. I think you're thinking of current LLMs, not AGI agents based on LLMs? If so I fully agree that they're unlkely to be dangerous at all.
I'm worried about agentic cognitive architectures we've built with LLMs as the core cognitive engine. We are trying to make them goal directed and to have human-level competence; superhuman competence/intelligence follows after that if we don't somehow halt progress permanently.
Current LLMs, like most humans most of the time, aren't strongly goal directed. But we want them to be strongly goal-directed so they do the tasks we give them.
Doing a task with full competence is the same as maximizing that goal. Which would be fine if we can define those goals adequately, but we're not at all sure we can as I emphasized last.
When you have a goal, pursuing it relentlessly is the default, not some weird special case. Evolution had to carefully balance our different goals with our homeostatic needs, and humans still often adopt strange goals and work toward them energetically (if they have time and money and until they die). And again, humans are dangerous as hell to other humans. Civilization is a sort of detente based on our individually having very limited capabilities so that we need to collaborate to succeed.
WRT LLMs pursuing goals as though they're maximizers, they do once they are given a goal to pursue. see the recent post on how RL runaway optimisation problems are still relevant with LLMs.
I'm not sure how you're imagining that we have AI that can get really valuable stuff done and we don't turn it into AGI that has goals because we wanted it to and designed it to pursue long-term goals so they can do real work. They'll need to be able to solve solve new problems (like "how do I open this file if my first try fails" but general problem-solving extends to "how do I keep the humans from finding out"). That sounds intuitively super dangerous to me.
I agree that LLMs themselves aren't likely to be dangerous no matter how smart they get. They'll only be dangerous once we extend them to persistently pursue goals.
And we're hard at work doing exactly that.
I don't think this is very relevant, but even if we don't give them persistent goals, LLM agents that can reflect and remember their conclusions are likely to come up with their own long-term goals - just like people do. I'm writing about that right now and will try to remember to link it here once it's posted. But the more likely scenario is that they interpret the goals we give them differently than we'd hoped.
Would I have to go to DC? Because I hate going to DC.
Not that I wouldn't to save the world, but I'd want to be sure it was necessary.
Only partly kidding. Maybe if people got a rationalist enclave in DC going we'd be less averse?
anyway i'm still not very convinced of Doom [...], because i have doubts about whether efficient explicit utility maximizers are even possible,
What? I'm not sure what you mean be "efficient" utility maximizers, but I think you're setting too high a bar for being concerned. I don't think doom is certain but I think it's obviously possible. Humans are dangerous, and we are possible. Anything smarter than humans is more dangerous if it has misaligned goals. We are building things that will become smarter than us. They will have goals. We do not know how to make those goals ones that are aligned with human goals. That is enough to be very concerned and want to work toward a safe future.
(We definitely have ideas about how to align AGI - see my work on instruction-following for both hopes and fears, and my work on system 2 alignment for technical approaches on the current path to LLM-based AGI. But this is all highly uncertain. Very optimistic takes leave out the hard parts of the problem.)
Isn't this model behavior what's described as jailbroken? It sounds similar to reports of jailbroken model behavior. It seems like the reports I've glanced at could be described as having lowered inhibitions and being closer to base model behavior.
In any case, good work investigating alternate explanations of the emergent misalignment paper.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.