My primary email is seth dot herd at gee mail dot com (or message me here where I won't miss it amongst spam).
In short, I'm applying my conclusions from 23 years of research in the computational cognitive neuroscience of complex human thought to the study of AI alignment.
I'm exhilarated and a bit frightened by it.
Alignment is the study of how we can make sure our AI's goals are aligned with humanity's goals. So far, AIs haven't really had goals, nor been smart enough to worry about, so this can sound like paranoia or science fiction. But recent breakthroughs in AI make it quite possible that we'll have genuinely smarter-than-human AIs with their own goals sooner than we're ready for them. If their goals don't align well enough with ours, they'll probably outsmart us and get their way, possibly much to our chagrin. See this excellent intro video for more.
There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could actually accomplish that for all of humanity. So I focus on finding alignment solutions.
In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function. I've focused on the emergent interactions that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my best theories. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.
I think that the field of AGI alignment is "pre-paradigmatic:" we don't know what we're doing yet. We don't have anything like a consensus on what problems need to be solved, or how to solve them. So I spend a lot of my time thinking about this, in relation to specific problems and approaches. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with zero episodic memory and very little executive function for planning and goal-directed self-control. Adding those capabilities and others might expand LLMs into working cognitive architectures with human-plus abilities in all relevant areas. My work since then has convinced me that we could probably also align such AGI/ASI to keep following human instructions, by putting such a goal at the center of their decision-making process and therefore their "psychology", and then using the aligned proto-AGI as a collaborator in keeping it aligned as it grows smarter.
I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is probably far overstating their certainty.
That all makes sense. To expand a little more on some of the logic:
It seems like the outcome of a partial pause rests in part on whether that would tend to put people in the lead of the AGI race who are more or less safety-concerned.
I think it's nontrivial that we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.
On the other hand, the argument for alignment risks is quite strong, and we might expect more people to take the risks more seriously as those arguments diffuse. This might not happen if polarization becomes a large factor in beliefs on AGI risk. The evidence for climate change was also pretty strong, but we saw half of America believe in it less, not more, as evidence mounted. The lines of polarization would be different in this case, but I'm afraid it could happen. I outlined that case a little in AI scares and changing public beliefs
In that case, I think a partial pause would have a negative expected value, as the current lead decayed, and more people who believe in risks less get into the lead by circumventing the pause.
This makes me highly unsure if a pause would be net-positive. Having alignment solutions won't help if they're not implemented because the taxes are too high.
The creation of compute overhang is another reason to worry about a pause. It's highly uncertain how far we are from making adequate compute for AGI affordable to individuals. Algorithms and compute will keep getting better during a pause. So will theory of AGI, along with theory of alignment.
This puts me, and I think the alignment community at large, in a very uncomfortable position of not knowing whether a realistic pause would be helpful.
It does seem clear that creating mechanisms and political will for a pause are a good idea.
Advocating for more safety work also seems clear cut.
To this end, I think it's true that you create more political capitol by successfully pushing for policy.
A pause now would create even more capitol, but it's also less likely to be a win, and it could wind up creating polarization and so costing rather than creating capitol. It's harder to argue for a pause now when even most alignment folks think we're years from AGI.
So perhaps the low-hanging fruit is pushing for voluntary RSPs, and government funding for safety work. These are clear improvements, and likely to be wins that create capitol for a pause as we get closer to AGI.
There's a lot of uncertainty here, and that's uncomfortable. More discussion like this should help resolve that uncertainty, and thereby help clarify and unify the collective will of the safety community.
Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.
I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.
LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.
In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.
I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.
Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.
Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.
After reading all the comments threads, I think there's some framing that hasn't been analyzed adequately:
Why would humans be testing AGIs this way if they have the resources to create simulation that will fool a super intelligence?
Also, the risk of humanity being wiped out seems different and worse while that asi is attempting a takeover - during that time the humans are probably an actual threat.
Finally, leaving humans around would seem to pose a nontrivial risk that they'll eventually spawn a new ASI that could threaten the original.
The Dyson sphere is just a tiny part of the universe so using that as the fractional cost seems wrong. Other considerations in both directions would seem to dominate it.
It's the first, there's a lot of uncertainty. I don't think anyone is lying deliberately, although everyone's beliefs tend to follow what they think will produce good outcomes. This is called motivated reasoning.
I don't think this changes the situation much, except to make it harder to coordinate. Rushing full speed ahead while we don't even know the dangers is pretty dumb. But some people really believe the dangers are small so they're going to rush ahead. There aren't strong arguments or a strong consensus for the danger being extremely high, even though looking at opinions of the most thorough thinkers puts risks in the alarmingly high, 50‰ plus range.
Add to this disagreement the fact that most people are neither longtermist nor utilitarian; they'd like a chance to get rich and live forever even if it risks humanity's future.
I guess I have no idea what you mean by "consciousness" in this context. I expect consciousness to be fully explained and still real. Ah, consciousness. I'm going to mostly save the topic for if we survive AGI and have plenty of spare time to clarify our terminology and work through all of the many meanings of the word.
Edit - or of course if something else was meant by consciousness, I expect a full explanation to indicate that thing isn't real at all.
I'm an eliminativist or a realist depending on exactly what is meant. People seem to be all over the place on what they mean by the word.
Consciousness is not at all epiphenomenal, it's just not the whole mind and not doing everything. We don't have full control over our behavior, but we have a lot. While the output bandwidth is low, it can be applied to the most important things.
I think this is dependent on reading strategy, which is dependent on cognitive style. For someone who skims a lot, they are frequently making active decisions about what to read while reading, so they're skilled at this and not bothered by footnotes. I love footnotes. This style may be more characteristic of a fast-attention cognitive style (and ADHD-spectrum loosely defined).
For those I like to refer to as attention surplus disorder :) who do not skim much, I can see the problem.
One strategy is to simply not read any footnotes on your first pass. Footnotes are supposed to be optional to understanding the series of ideas in the writing. Then, if you're intterested enough to get further into details, you go back and read some or all of the footnotes.
I agree that we could use footnotes better by either using them one way and stating it, or providing a brief cue to how it's used in the text.
I strongly disagree that footnotes as classically used are not useful. And having any sort of hypertext improves the situation.
Footnotes are usually used to mean "here are some more thoughts/facts/claims related to those you just read before the footnote mark". Sometimes those will be in a whole different reference. After you glance at a couple, you know how this author is using them.
Appropriate use of footnotes is part of good writing. As such, it's dependent on the topic, the author, the reader, and their goals in writing/reading. And thus very much a matter of individual taste and opinion.
Endnotes of varied use, without two-way hypertext links, on the other hand, should die in a fire.
That sprang to my mind as the perfect solution to this problem.
Great reference! I found myself explaining this repeatedly but without the right terminology. The "but comparative advantage!" argument is quite common among economists trying to wrap their head around AI advances.
I think it applies for worlds with tool/narrow AI, but not with AGI that can do whole jobs for much lower wages than any human can do anything.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.