Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.
I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.
Alignment is the study of how to design and train AI to have goals or values aligned with ours, so we're not in competition with our own creations.
Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. If we don't understand how to make sure it has only goals we like, it will probably outcompete us, and we'll be either sorry or gone. See this excellent intro video.
There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe!
I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions (competition, polarization, and conflicting incentives creating motivated reasoning that distorts beliefs). Many serious thinkers give up on this territory, assuming that either aligning LLM-based AGI turns out to be very easy, or we fail and perish because we don't have much time for new research.
I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).
One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following.
It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.
I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
Sorry, that was awfully vague. I'm probably referring to younger kids than you are. Although there's also going to be a lot of variance in when kids develop different introspective skills and conceptual understanding of their own minds.
I agree that not knowing you have a mind doesn't prevent you from having an experience. What I meant was that at some young age, I'm thinking up to age five but even 5-year-olds could be more advanced, a kid will not be able to conceptualize that they are a mind or alternately phrased that they have a mind. Nonetheless, they are a mind that is doing a bunch of complex processing and remembering some of it.
I definitely agree that phenomenal consciousness and intelligence are different. Discussions of consciousness usually break down in difficulties with terminology and how to communicate about different aspects of consciousness. I haven't come up with good ways to talk about this stuff.
I wouldn't make that argument. I just don't see the point of keeping it real.
It just seems like going virtual opens up a lot of possbilities with no downside. If you want consistency and real work, put it in your sim. Share it with other real people if you want to compromise on what worlds you'll inhabit and what challenges you'll face.
If you want real people who are really suffering so there are real stakes to play for, well, that's an orthogonal issue. I'd rather see nonconsensual suffering eliminated.
So: Why prefer the Real? What's it got that the Virtual doesn't?
Trying to align LLMs just doesn't seem optional to me. It's what's happening whether or not we like it.
Agreed on all points. Including that you writing a more detailed version more for the prosaic crowd isn't probably the best next step. That's what I was trying to do in LAMRAG, and it was no more successful than this. That's despite me starting much closer to the standard prosaic alignment/LLM-based model of AGI internals.
I think one place this argument may break down for people is the metaphor of building for the ocean as a difficult project. Maybe the lake is a lot like the ocean, and ocean storms just aren't that bad, so you can just build it to double the strength it would need on the lake and you're good to go.
My vague take on this is that what we're doing in training now is a far cry from what a nascent AGI will see in deployment, so the metaphor holds. I wonder if some well-considered optimists are assuming we'll dramatically improve training, including thinking a lot harder about what AGIs will face in deployment, before they're deployed.
If I was confident developers would at least try to do that, I'd be a good bit more optimistic.
Just some thoughts.
More thoughts have arisen. Whatever the next step is, I think this work of clarifying and specifying the arguments, could be critical. I don't think there's much chance that development will slow down let alone stop based on arguments, unless we produce far better arguments that alignment is hard and likely to fail. The abstract arguments here can just be countered by equally abstract arguments that alignment is possible because humans have it, and hey Claude seems to be doing pretty well, so a future better version of it should be fine. That apparent equality of arguments allows motivated reasoning to play tiebreaker. And more people are currently motivated toward than away from AGI.
OTOH I do think development might be deployed when the public gets involved. They have less motivated reasoning toward assuming alignment is easy, so the obvious intuition "if nobody knows how dangerous it is, we should stop!" combined with "ummm I'd like to not be permanently jobless" might make a powerful political movement for slowing/stopping. But that's an entirely separate project from arguing the case for alignment difficulty on its merits. And I don't know the first thing about public relations/marketing.
If you can't tell the difference, how could you care which is which?
I'm talking about blocking your memories of living in a simulated world.
Consciousness is very unlikely to be a binary property. Most things aren't. But there appears to be a very strong tendency for even rationalists to make this assumption in how they frame and discuss the issue.
The same is probably true of moral worth.
Taken this way, LLMs (and everything else) are partly conscious and partly deserves moral consideration. What you consider consciousness and what you consider morally worthy are to some degrees matters of opinion, but they very much depend on facts about the minds involved, so there are routes forward.
IMO current LLMs probably have a small amount of what we usually call phenomenal consciousness or qualia. They have rich internal representations and can introspect and reflect on them. But neither is nearly as rich as in a human, particularly an adult human who's learned a lot of introspection skills (including how to "play back" and interrogate contents of global workspace). Kids don't even know they have minds, let alone what's going on in there; figuring out how to figure that out is quite a learning process.
What people usually mean by "consciousness" seems to be "what it's like to be a human" which involves everything about brain function, focusing particularly on introspection - what we can directly tell about human brain function. But human consciousness is just one point on a multidimensional spectrum of different mind-properties including types of introspection.
I would question anyone who's nice to LLMs but eats factory-farmed meat. Skilled use of language is a really weird and somewhat self-centered criteria for moral worth.
Anyway, I'm also nice to LLMs because why not, and I think they probably appreciate it a tiny bit.
Future versions will have a lot more consciousness by various definitions. That's when these discussions will become widespread and perhaps even make some progress, at least in select circles like these.
One big payoff is the effect on AI safety. I expect anti-AI-slavery movements to slow down progress somewhat, maybe a lot if we don't undercut them (because motivated reasoning from not wanting to lose your job will pull people toward "oh yeah obviously they're conscious and shouldn't be enslaved!" even if the reality is that they're barely conscious. On the other hand, "free the AI" movements could be dangerous if we're trying to control maybe-misaligned TCAI.
Honesty about their likely status would make a pleasant tiebreaker in this dilemma.
This has a lot of overlap with my recent post LLM AGI may reason about its goals and discover misalignments by default, and the followup post I'm working on now that further explores whether we should train LLMs on reasoning about their goals. Prompting them to reason extensively about goals during training has the effect of revealing potential future misalignments to them, as you discuss.
I'm curious what you think about that framing.
Even humans have taken over the world. Something a little smarter should have a fairly easy time.
I do agree that there's a soft limit above human level for LLMs/agents, but it's not a hard limit and it's not right at human level.
I think this sometimes has a good explanation in the 80/20 rule. Which itself is based on a pretty deep tendency for things to be complex. So changing things often has some low-hanging fruit (in the idealized case, the 80% you can get for 20% of the effort). It's easy to eliminate some clutter, harder to eliminate the rest.
This isn't always the case. In many of your examples, there's no obvious reason that cutting to zero would be harder. In some there are.
Here's another one: instead of cutting clutter to zero or time with that friend to zero, maybe you use that time /effort to go get some low-hanging fruit in other areas of life optimization?
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.