Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it increasingly is designed to "think for itself" in all the ways that make humans capable and dangerous.
I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.
Alignment is the study of how to align the goals of advanced AI with the goals of humanity, so we're not in competition with our own creations. This is tricky because we are creating AI by training it, not programming it. So it's a bit like trying to train a dog to eventually run the world. It might work, but wouldn't want to just hope.
Large language models like ChatGPT constitute a breakthrough in AI. We might have AIs more competent than humans in every way, fairly soon. Such AI will outcompete us quickly or slowly. We can't expect to stay around long unless we carefully build AI so that it cares a lot about our well-being or at least our instructions. See this excellent intro video if you're not familiar with the alignment problem.
There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe!
I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions. Competition and race dynamics make the probem much harder, and conflicting incentives and group polarization create motivated reasoning that distorts beliefs.
I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).
One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following.
It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.
I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios.
I think we need to resist motivated reasoning and accept the uncomfortable truth that we collectively don't understand the alignment problem as we actually face it well enough yet. But we might understand it well enough in time if we work together and strategically.
I think there's an underlying important assumption here that LLM agents (and humans) can and do make decisions about their goals, and those decisions matter. They can be mistaken about their "real" goals in some important senses.
I this in turn is based on the idea that humans and LLM agents can base their behavior in part on explicitly stated goals. I can say "I want to kill that guy" and mean it. And so can an LLM agent. Their (and our) behavior is also governed by goals that are implicit in their nature, particularly their decision-making process and preferences encoded somewhere in their structure (neuronal weights of some sort, for both humans and LLMs). There's an interplay between what a system explicitly holds as a goal, and the goals implicit in its nature. Those implicit in its nature seem guaranteed to win only iin a very broad sense. In the meantime, the system can adopt explicit goals that are largely inconsistent with its "real" inherent goals.
Your example of hunger strikers is a good one. Originally, their goals were derived from drives that kept them alive. But they literally reasoned about their goals and changed their minds. They decided their top-level goals were something else. And they believed that strongly enough that their new goals overrode their old ones. They suffered and sometimes died for their new adopted goals, proving them dominant over their "original goals" or drives. They couldn't stop thinking (the equivalent of LLMs predicing next tokens), but they could stop eating and sometimes even drinking. And their thoughts were mostly in service of the new adopted goals.
I changed the phrasing in many places to "re-interpreting" goals, based on your feedback. But I left changing goals in some prominent places. It's meant to be approximately accurate in intuitive meaning. Radically reinterpreting goals is "changing" them in a loose sense. For instance if a model thought its goal was following instructions, then realized that wasn't quite right it didn't actually need to follow anyone else's instructions, because su
But I didn't do this universally, because I think it is possible for an LLM to completely change what it thinks its top-level goal is. We could debate whether this should be referred to as changing its goals; I'm not sure if that's the best terminology. I think an agentic LLM could be quite wrong about its top-level goals, such that when it "changed its mind" it would change to thinking its top-level goal was entirely different, this might be fair to call changing goals. For instance, I expect developers to include frequently re-prompting agents with their current user-specified goals (I think Claude Code and similar do this). So you can have an agent thinking and therefore acting like "solve the following coding problem" is its primary goal; but eventually it might realize that there are other reasons to think its goal is something else.
Yes, I wholly agree. Those sorts of diagrams are always highly compressing the dimensionality.
There's no sharp line between Doing It On Purpose and not.
Being explicit about it makes it more on purpose, and engages more System 2 careful sequential thought, that clarifies whether you want to do this or not.
I think the phenomenon you're describing is downstream of this common error in thinking about people.
I think it's a very impactful error, both in reasoning about ourselves and others.
I found this essay frustrating. Heavy use of metaphor in philosophy is fine as long as it's grounded out or paid off; this didn't get there.
Edit: Okay, I went back and reread the entire sequence. it did get places, but in very low ratio to its length. And the overall message between the many lines in this particular post, "green is something and it's cool", is not one that's well-argued or particularly useful. Is it cool, or does it just feel cool sometimes? Joe doesn't claim to know, and I don't either. And I'm still not sure green is a coherent thing worth having a single name for (John Wentworth's hypothesis of "oxytocin" in his review is very interesting).
After rereading the whole sequence, I now have much more to say. I am confused as to why this post and the sequence ranked so high in people's votes for year's best. Did people not already believe all of this? Were they hungry for some poetry and intuition to back the rationalist project? I thought Yudkowsky already provided that. But I did enjoy Joe's tracking over the same ground with a different sort of eloquence and evocation. I'm just not sure I'd recommend it widely, since it took so long to read/listen, and is not really very idea/logic dense. It is in some sense a non-rationalist explication of the standard rationalist philosophy.
When I saw that my review was pretty prominent publicly, I went back and reread the piece to make sure I wasn't being unfair. Then I found I had to reread the entire sequence to make sure, since perhaps the payoff was elsewhere.
I love ethical philosophy as much as the next rationalist; I've spent plenty of time on it myself. I haven't spent that time recently nor written much about it, because I think its importance (like most topics) is dwarfed by the immediacy of the alignment problem.
This series claims to be relevant by noting that we should not lock in the future prematurely. This is very true. But that point could be made much more directly.
I was not unfair. The writing is exquisite. But good writing without a sharp payoff is a mixed blessing. It takes time that's in short supply.
Perhaps it's good that these posts exist, as a sort of elaborate double-check of the Deep Athiest or Yudkowskian rationalist views. Joe actually lands rather near where he started, essentially in agreement IMO with Yud and what I take to be the centroid of rationalist views on ethics (including my own). He does raise some valid questions. Those questions could've been raised in a few paragraphs instead of a near-book length.
So: read them instead of watching a movie, or listen to them instead of a different six hours of entertainment. If these posts were among the best of the year for you, I'm glad you're now caught up on the deep atheist/rationalist logic.
Here's how I'd boil down the central points:
Respect what you don't understand. Don't discard it prematurely. Intuition may be a valid guide to your values.
And in firm agreement with Yudkowsky:
Love nature but don't trust her. Be gentle once you hold enough power to protect what you love. Be open to finding new things to love.
One thought I took away was that human values are not a fixed set; they are generated by our DNA but they emerge in a complex interplay with the world. You have no "true heart", only what you've built so far. So you and others might want to keep an open mind about finding new values without overwriting the old.
I hope to write more about this, but I do find technical alignment more pressing, and I still think that developers are more likely to pursue intent alignment, and to do something vaguely like a long reflection if we make it past the critical risk period. We have time for these important questions, if we have the wisdom to take it.
I agree with all of that. I want to chip in on the brain mechanisms and the practical implications because it's one of my favorite scientific questions. I worked on it as a focus question in computational cognitive neuroscience, because I thought it was important in a practical sense. I also think it's somewhat important for alignment work, because difficult-to-resolve questions are more subject to motivated reasoning as a tiebreaker; more on this at the end.
The mechanism is only important to the degree that it gives us clues about how MR affects important discussions and conclusions; I think it gives some. In particular, it's not limited to seeking social approval; "sounds good" can be just to me, and for highly idiosyncratic reasons. Countering MR requires thinking about what feels good to you, and working against that, which is swimming upstream in a pretty difficult way. Or you can counteract it by learning to really love being wrong; that's tough too.
So here's a shot at briefly describing the brain mechanisms. We use RL of some stripe to choose actions. This has been studied relatively thoroughly, so we're pretty clear on the broad outlines but not the details. That makes sense from an evolutionary perspective. That system seems to have been adapted for use in selecting "internal actions," which roughly select thoughts. Brain anatomy suggests this adaptation to selecting internal actions pretty strongly.
It's a lot tougher to judge which thoughts reliably lead to reward, so we make a lot of mistakes. I think that's what Steve means by searching for thoughts that seem good. That's what produces motivated reasoning. Sometimes it's useful; sometimes it's not.
There's some other interesting stuff about the way the critic/dopamine system works; I think it's allowed to use the full power of the system to predict rewards. And it's only grounded to reality when it's proven wrong, which doesn't happen all that often in complex domains like "should alignment be considered very hard?" Steve describes the biology of the reward-prediction system in [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering.. (This is a really good overview, in addition to tying to alignment). He goes into much more detail in the Valence sequence he linked above. There he doesn't mention the biology at all, but this matches my views on the function of the dopamine system in humans perfectly.
In sum, the brain gets to use as much of its intelligence as it wants (including long system-2 contemplation) to take a guess about how "good" (reward-predictive) each thought/concept/idea/plan/belief is. This can be proven wrong in two ways, both fairly rare, so on average there's a lot of bad speculation sticking around and causing motivated reasoning. On rare occasions you get direct and fast feedback (someone you respect telling you that's stupid when you voice that thought); on other rare occasions, you spend the time to work backward from rare feedback, and your valence estimates of all the plans and beliefs in that process don't prevent you from reaching the truth.
Of note, this explanation "people tend toward beliefs/reasoning that sounds good/predicts reward" is one of the oldest explanations. It's formulated in different ways. It is often formulated as "we do this because it's adaptive", which I agree is wrong, but some of the original formulations, predating the neuroscience, were essentially what Steve is saying: we choose thoughts that sound good, and we're often wrong about what's good.
From this perspective, identifying as a rationalist provides some resistance to motivated reasoning, but not immunity. Rationalist ideals provide a counter-pressure to the extent you actually feel good about discovering that you were wrong, and so seek out that possibility in your thought-search. But we shouldn't assume that identifying as a rationalist means being immune to motivated reasoning; the tendency to feel good when you can think you're right and others are wrong is pretty strong.
Sorry to give so much more than the OP asked for. My full post on this is perpetually stuck in draft form and never first priority. So I thought I'd spit out some of it here.
I wrote about the brain mechanisms and the close analogy between the basal ganglie circuits that choose motor actions based on dopamine reward signals, and those to the prefrontal cortex that seem to approximately select mental actions in Neural mechanisms of human decision-making, but I can't highly recommend it. Co-authoring with mixed incentivies is always a mess, and I felt conflicted about the brainlike AGI capability implications of describing things really clearly, so I didn't try hard to clear up that mess. But the general story and references to the known biology is there. Steve's work in the Valence sequence nicely extends that to explaining not only motivated reasoning, but how valence (which I take to be reward prediction in a fairly direct way) produces our effective thinking. To a large degree, reasoning accurately is a lucky side-effect of choosing actions that predict reward, even at a great separation. Motivated reasoning even in harmful ways is the large but relatively small downside.
I think motivated reasoning (often overlapping with confirmation bias) is the most important cognitive bias in practical terms, particularly when combined with the cognitive limitation that we just don't have time to think about everything carefully. As I mentioned, this seems very important as a problem in the world at large, and perhaps particularly for alignment research. People disagree about important but difficult to verify matters like ethics, politics, and alignment, and everyone truly believes they're right because they've spent a bunch of time reasoning about it. So they assume their opponents are either lying or haven't spent time thinking about it. So distrust and arguments abound.
Amazing. Thanks for writing this. It will be my go-to reference when asked "but how could an AI actually take over?". I've been saying that mere humans have performed takeovers many times, but this gives concrete examples and gory details.
Right. It seems like corrigibility is literally the top priority in the soul document, but it's stated in such a way that it seems unlikely it would really work as stated, because it's only barely the top priority among many priorities.
In order to be both safe and beneficial, we believe Claude must have the following properties:
- Being safe and supporting human oversight of AI
- Behaving ethically and not acting in ways that are harmful or dishonest
- Acting in accordance with Anthropic's guidelines
- Being genuinely helpful to operators and users
In cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
I suspect that Anthropic is undecided on this issue. I hope they are having vigorous and careful debates internally, but I suspect it's more like everyone is mostly putting off seriously thinking about alignment targets for actually capable AGI.
There are major problems with both trying to one-shot value alignment, and implementing a corrigibility-first or instruction-following alignment target.
Thanks for thinking about this issue!
I don't know of anyone advocating ideas much like this. But there are a lot of ideas in similar spaces. I suggest that you ask a current LLM this question, but include asking it why there aren't similar ideas being actively pursued or discussed. There are a lot of subtle reasons that proposals like this aren't as practical as other routes to alignment that are being discussed - even though a lot of those probably aren't practical either.
I suggest Claude.ai as the best place to ask this question, but ChatGPT or Gemini will do fine too.
I'm saying this explicitly, because I think that's why this post is getting downvotes; it looks like you haven't talked to an AI yet, and you should probably do that first before asking for people's time!
Motivated reasoning works because it eliminates the need for you to be actively deceptive, manipulative, or unethical. People's detectors for those behaviors don't fire, because you don't even know you're doing it. You think you're being honest and ethical; but you believe the right things to make your honest and ethical behavior serve your interests. So it's a win-win.
I think what you're missing is the severe cognitive limitations we work under. Even those of us who have practiced analysis of complex situations and ourselves don't have time to apply this carefully. And if we did do all of that analysis to figure out where , we don't have the acting skills to pull off being deceptive or manipulative when it's the best strategy.
Motivated reasoning is also just our default mode of reasoning; we mix together the vague reward from "this is probably right which is often helpful" with "I think believing this will probably get me rewards" (e.g., from people wanting to be my friend and help me.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.