Seth Herd

Research overview:

If you don't already know the arguments for why aligning AGI is probably the most important and pressing question of our time, please see this excellent intro. There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that are readily implementable for the types of AGI we're most likely to develop first. 

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function. I've focused on the emergent interactions that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my best theories. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.  My primary email is seth dot herd at gee mail dot com. 

More on approach

I think that the field of AGI alignment is "pre-paradigmatic:" we don't know what we're doing yet. We don't have anything like a consensus on what problems need to be solved, or how to solve them. So I spend a lot of my time thinking about this, in relation to specific problems and approaches. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with zero episodic memory and very little executive function for planning and goal-directed self-control. Adding those capabilities and others might expand LLMs into working cognitive architectures with human-plus abilities in all relevant areas. My work since then has convinced me that we could probably also align such AGI/ASI to keep following human instructions, by putting such a goal at the center of their decision-making process and therefore their "psychology", and then using the aligned proto-AGI as a collaborator in keeping it aligned as it grows smarter.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Wiki Contributions

Comments

Seth Herd789

MIRI's communications strategy, with the public and with us

This is a super short, sloppy version of my draft "cruxes of disagreement on alignment difficulty" mixed with some commentary on MIRI 2024 Communications Strategy  and their communication strategy with the alignment community.

I have found MIRI's strategy baffling in the past. I think I'm understanding it better after spending some time going deep on their AI risk arguments. I wish they'd spend more effort communicating with the rest of the alignment community, but I'm also happy to try to do that communication. I certainly don't speak for MIRI.

On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.

The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership. 

Both of those camps consist of highly intelligent, highly rational people. Their disagreement should bother us for two reasons.

First, we probably don't know what we're talking about yet. We as a field don't seem to have a good grip on the core issues. Very different, but highly confident estimates of the problem strongly suggest this.

Second, our different takes will tend to make a lot of our communication efforts cancel each other out. If alignment is very hard, we must Shut It Down or likely die. If it's less difficult, we should primarily work hard on alignment.

MIRI must argue that alignment is very unlikely if we push forward. Those who think we can align AGI will argue that it's possible.

This suggests a compromise position: we should both work hard on alignment, and we should slow down progress to the extent we can, to provide more time for alignment. We needn't discuss shutdown much amongst ourselves, because it's not really an option. We might slow progress, but there's almost zero chance of humanity relinquishing the prize of strong AGI.

But I'm not arguing for this compromise, just suggesting that might be a spot we want to end up at. I'm not sure.

I suggest this because movements often seem to succumb to infighting. People who look mostly aligned from the outside fight each other, and largely nullify each other's public communications by publicly calling each other wrong and sort of stupid and maybe bad. That gives just the excuse the rest of the world wants to ignore all of them; even the experts think it's all a mess and nobody knows what the problem really is and therefore what to do. Because time is of the essence, we need to be a more effective movement than the default. We need to keep applying rationality to the problem at all levels, including internal coordination.

Therefore, I think it's worth clarifying why we have such different beliefs. So, in brief, sloppy form:

MIRI's risk model:

  1. We will develop better-than-human AGI that pursues goals autonomously
  2. Those goals won't match human goals closely enough
  3. Doom of some sort

That's it. Pace of takeoff doesn't matter. Means of takeover doesn't matter.

I mention this because even well-informed people seem to think there are a lot more moving parts to that risk model, making it less likely. This comment on the MIRI strategy post is one example. 

I find this risk model highly compelling. We'll develop goal-directed AGI because that will get stuff done; it's an easy extension of highly useful tool AI like LLMs; and it's a fascinating project. That AGI will ultimately be enough smarter than us that it's going to do whatever it wants. Whether it takes a day, or a hundred years doesn't matter. It will improve and we will improve it. It will ultimately outsmart us. What matters is whether its goals match ours closely enough. That is the project of alignment, and there's much to discuss and about how hard it is to make its goals match ours closely enough.

Cruxes of disagreement on alignment difficulty

I spent some time recently going back and forth through discussion threads, trying to identify why people continue to disagree after applying a lot of time and rationality practice. Here's a very brief sketch of my conclusions:

Whether we factor in humans' and society's weaknesses

I list this first because I think it's the most underappreciated. It took me a surprisingly long time to understand how much of MIRI's stance depends on this premise. Having seen it, I thoroughly agree. People are brilliant, for an entity trying to think with the brain of an overgrown lemur. Brilliant people do idiotic things, driven by competition and a million other things. And brilliant idiots organizing a society amplifies some of our cognitive weaknesses while mitigating others. MIRI leadership has occasionally said things to the effect of: alignment might be fairly easy, and there would still be a very good chance we'd fuck it up. I agree. If alignment is actually kind of difficult, that puts us into the region where we might want to be really really careful in how we approach it.

Alignment optimists are sometimes thinking something like: "sure I could build a safe aircraft on my first try. I'd get a good team and we'd think things through and make models. Even if another team was racing us, I think we'd pull it off". Then the team would argue and develop rivalries, communication would prove harder than expected so portions of the effort would be discovered too late to not fit the plan, corners would be cut, and the outcome would be difficult to predict.

Societal "alignment" is worth mentioning here. We could crush it at technical alignment, getting rapidly-improving AGI that does exactly what we want and still get doom.  It would probably be aligned to do exactly what its creators want, not have full value alignment with humanity - see below. They probably won't have the balls or the capabilities to try for a critical act that prevents others from developing similar AGI (even if they have the wisdom). So we'll have a multipolar scenario with few to many AGIs under human control. There will be human rivalries, supercharged and dramatically changed by having recursively self-improving AGIs to do their bidding and perhaps fight their wars. What does global game theory look like when the actors can develop entirely new capabilities? Nobody knows. Going to war first might look like the least-bad option.

Intuitions about how well alignment will generalize

The original alignment thinking held that explaining human values to AGI would be really hard. But that seems to actually be a strength of LLMs; they're wildly imperfect, but (at least in the realm of language) seem to understand our values rather well; for instance, much better than they understand physics or taking-over-the-world level strategy. So, should we update and think that alignment will be easy? The Doomimir and Simplicia dialogues capture the two competing intuitions very well: Yes, it's going well; but AGI will probably be very different than LLMs, so most of the difficulties remain.

I have yet to find a record of real rationalists putting in the work to get farther in this debate. If somebody knows of a dialogue or article that gets deeper into this disagreement, please let me know! Discussions trail off into minutia and generalities. This is one reason I'm worried we're trending toward polarization despite our rationalist ambitions. 

The other aspect of this debate is how close we have to get to matching human values to have acceptable success. One intuition is that "value is fragile" and network representations are vague and hard-to train, so we're bound to miss. But don't have a good understanding of either how close we need to get (But exactly how complex and fragile got little useful discussion), or how well training networks hits the intended target, with near-future networks addressing complex real-world problems like "what would this human want". 

For my part, I think there are important points on both sides: LLMs understanding values relatively well is good news, but AGI will not be a straightforward extension of LLMs, so many problems remain.

What alignment means. 

One mainstay of claiming alignment is near-impossible is the difficulty of "solving ethics" - identifying and specifying the values of all of humanity. I have come to think that this is obviously (in retrospect - this took me a long time) irrelevant for early attempts at alignment: people will want to make AGIs that follow their instructions, not try to do what all of humanity wants for all of time. This also massively simplifies the problem; not only do we not have to solve ethics, but the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.

I think this is the intuition of most of those who focus on current networks. Christiano's relative optimism is based on his version of corrigibility, which overlaps highly with the isntruction-following I think people will actually pursue for the first AGIs. But this massive disagreement often goes overlooked. I don't know which view is right; instruction-following or intent alignment might lead inevitably to doom from human conflict, and so not be adequate. We've barely started to think about it (please point me to the best thinking you know of for multipolar scenarios with RSI AGI).

What AGI means.

People have different definitions of AGI. Current LLMs are fairly general and near-human-level, so term "AGI" has been watered down to the point of meaninglessness. We need a new term. In the meantime, people are talking past each other, and their p(doom) means totally different things. Some are saying that near-term tool AGI is very low risk, which I agree with; others are saying further developments of autonomous superintelligence seem very dangerous, which I also agree with.

Second, people have totally different gears-level models of AGI. Some of those are much easier to align than others. We don't talk much about gears-level models of AGI because we don't want to contribute to capabilities, but not doing that massively hampers the alignment discussion.

Edit: Additional advanced crux: Do coherence theorems prevent corrigibility?

I initially left this out, but it deserves a place as I've framed the question here. The post What do coherence arguments actually prove about agentic behavior? reminded me about this one. It's not on most people's radar, but I think it's the missing piece of the puzzle that gets Eliezer from maybe 90% from all of the above, to 99%+ p(doom). 

The argument is roughly that a superintelligence is going to need to care about future states of the world in a consequentialist fashion, and if it does, it's going to resist being shut down or having its goals change. This is why he says that "corrigibility is anti-natural." The counterargument, nicely and succinctly stated by Steve Byrnes here (and in greater depth in the post he links in that thread) is that, while AGI will need to have some consequentialist goals, it can have other goals as well. I think this is true; I just worry about the stability of a multi-goal system under reflection, learning, and self-modification.

Sorry to harp on it, but having both consequentialist and non-consequentialist goals describes my attempt at stable, workable corrigibility in instruction-following ASI. Its consequentialist goals are always subgoals of  the primary goal: following instructions.

Implications

I think those are the main things, but there are many more cruxes that are less common. 

This is all in the interest of working toward within-field cooperation, by way of trying to understand why MIRI's strategy sounds so strange to a lot of us. MIRI leaderships thoughts are many and complex, and I don't think they've done enough to boil them down for easy consumption from those who don't have the time to go through massive amounts of diffuse text.

There are also interesting questions about whether MIRIs goals can be made to align with those of us who think that alignment is not trivial but is achievable. I'd better leave that for a separate post, as this has gotten pretty long for a "short form" post.

Context

This is an experiment in writing draft posts as short form posts. I've spent an awful lot of time planning, researching, and drafting posts that I haven't yet finished yet. Given how easy it was to write this (with previous draft material), relative to how difficult I find it to write a top-level post, I will be doing more, even if nobody cares. If I get some useful feedback or spark some useful discussion, better yet. 

Reply1111

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

That all makes sense. To expand a little more on some of the logic:

It seems like the outcome of a partial pause rests in part on whether that would tend to put people in the lead of the AGI race who are more or less safety-concerned.

I think it's nontrivial that we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.

On the other hand, the argument for alignment risks is quite strong, and we might expect more people to take the risks more seriously as those arguments diffuse. This might not happen if polarization becomes a large factor in beliefs on AGI risk. The evidence for climate change was also pretty strong, but we saw half of America believe in it less, not more, as evidence mounted. The lines of polarization would be different in this case, but I'm afraid it could happen. I outlined that case a little in AI scares and changing public beliefs

In that case, I think a partial pause would have a negative expected value, as the current lead decayed, and more people who believe in risks less get into the lead by circumventing the pause.

This makes me highly unsure if a pause would be net-positive. Having alignment solutions won't help if they're not implemented because the taxes are too high.

The creation of compute overhang is another reason to worry about a pause. It's highly uncertain how far we are from making adequate compute for AGI affordable to individuals. Algorithms and compute will keep getting better during a pause. So will theory of AGI, along with theory of alignment.

This puts me, and I think the alignment community at large, in a very uncomfortable position of not knowing whether a realistic pause would be helpful.

It does seem clear that creating mechanisms and political will for a pause are a good idea.

Advocating for more safety work also seems clear cut.

To this end, I think it's true that you create more political capitol by successfully pushing for policy.

A pause now would create even more capitol, but it's also less likely to be a win, and it could wind up creating polarization and so costing rather than creating capitol. It's harder to argue for a pause now when even most alignment folks think we're years from AGI.

So perhaps the low-hanging fruit is pushing for voluntary RSPs, and government funding for safety work. These are clear improvements, and likely to be wins that create capitol for a pause as we get closer to AGI.

There's a lot of uncertainty here, and that's uncomfortable. More discussion like this should help resolve that uncertainty, and thereby help clarify and unify the collective will of the safety community.

Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.

I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.

LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.

In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.

I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.

Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.

Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.

 

 

  1. ^

    I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.

I intended to refer to understanding the concept of manipulation adequately to avoid it if the AGI "wanted" to.

As for understanding the concept of intent, I agree that "true" intent is very difficult to understand, particularly if it's projected far into the future. That's a huge problem for approaches like CEV. The virtue of the approach I'm suggesting is that it entirely bypasses that complexity (while introducing new problems). Instead of inferring "true" intent, the AGI just "wants" to do what the human principal tells it to do. The human gets to decide what their intent is. The machine just has to understand what the human meant by what they said- and the human can clarify that in a conversation. I'm thinking of this as do what I mean and check (DWIMAC) alignment. More on this in Instruction-following AGI is easier and more likely than value aligned AGI.

I'll read your article.

Seth Herd2-1

Maloch is the name of this force, and rent-seeking is one of its faces.

I think this is basically correct, although as others have noted it doesn't completely counteract progress.

There is a form of rent-seeking from other sources than land ownership, like the rising college tuitions. Arguably, zoning is a separate form of rent-seeking that's not directly based on land ownership but control of government to make ones own life better at the expense of others' opportunities.

Those two are more clearly Maloch. Competition for good degrees and good zoning drives prices as high as people will pay.

Excellent point. But these changes are much less than the 100x wealth increase, which implies that there is a very strong poverty-inducing force, it's just not completely negating progress.

This doesn't address how the equilibrium would change if such basic income becomes universal.

Thank you!

The link to your paper is broken. I've read the Christiano piece. And some/most of the CEV paper, I think.

Any working intent alignment solution needs to prevent changing the intent of the human on purpose. That is a solvable problem with an AGI that understands the concept.

Asking people to listen to a long presentation is a bigger ask than a concise presentation with more details than the current post. Got anything in between?

Load More