Sammy Martin

MTAIR project and Center on Long-term Risk, PhD candidate working on cooperative AI, Philosophy and Physics BSc, AI MSc at Edinburgh. Interested in philosophy, longtermism and AI Alignment. I write science fiction at

Wiki Contributions


Essentially, the problem is that 'evidence that shifts Bio Anchors weightings' is quite different, more restricted, and much harder to define than the straightforward 'evidence of impressive capabilities'. However, the reason that I think it's worth checking if new results are updates is that some impressive capabilities might be ones that shift bio anchors weightings. But impressiveness by itself tells you very little.

I think a lot of people with very short timelines are imagining the only possible alternative view as being 'another AI winter, scaling laws bend, and we don't get excellent human-level performance on short term language-specified tasks anytime soon', and don't see the further question of figuring out exactly what human-level on e.g. MMLI would imply.

This is because the alternative to very short timelines from (your weightings on) Bio Anchors isn't another AI winter, rather it's that we do get all those short-term capabilities soon, but have to wait a while longer to crack long-term agentic planning because that doesn't come "for free" from competence on short-term tasks, if you're as sample-inefficient as current ML is.

So what we're really looking for isn't systems getting progressively better and better at short-horizon language tasks. That's something that either the lifetime-anchor Bio Anchors view or the original Bio Anchors view predicts, and we need something that discriminates between the two.

We have some (indirect) evidence that original bio anchors is right: namely that it being wrong implies evolution missed an obvious open goal to make bees and mice generally intelligent long term planners, and that human beings generally aren't vastly better than evolution at designing things anyway, and the lifetime anchor would imply that AGI is a glaring exception to this general trend.

As evidence, this has the advantage of being about something that really happened: human beings are the only human-level general intelligence that exists so far, so we have very good reasons to think matching the human brain is sufficient. However, it has the disadvantage of all the usual disanalogies between evolution and its requirements, and human designers and our requirements. Maybe this just is one of those situations where we can outdo evolution: that's not especially unlikely.


What's the evidence on the other side (i.e. against original bio anchors and for the lifetime anchor)?

There are two kinds that I tend to hear. One is that short-horizon competence is enough for dangerous/transformative capabilities. E.g. the claim that if you can build something that's "human level/superhuman at charisma/persuasion/propaganda/manipulation, at least on short timescales" that represents a gigantic existential risk factor that condemns us to disaster further down the line (the AI PONR idea), or that at this point actors with bad incentives will be far too influential/wealthy/advancing the SOTA in AI.

However, I'd consider this changing the subject: essentially it's not an argument for AGI takeover soon, rather it's an argument for 'certain narrow AIs are far more dangerous than you realize'. That means you have to go all the way back to the start and argue for why such things would be catastrophic in the first place. We can't rely on the simple "it'll be superintelligent and seize a DSA".

Suppose we get such narrow AIs, that can do most short-term tasks for which there's data, but don't generalize to long horizons consistently. This scenario 10 years from now looks something like: AI automates away lots of jobs, can do certain kinds of short-term persuasion and manipulation, can speed up capabilities and alignment research, but not fully replace human researchers. Some of these AIs are agentic and possibly also misaligned (in ways that are detectable and fall far short of the ability to take over, since by assumption they aren't competitive with humans at long-term planning). This certainly seems wild and full of potential danger, where slowing down progress could be much harder. It also looks like a scenario with far more attention on AI alignment than today, where the current funders of alignment research are much wealthier than now, and with plenty of obvious examples of what the problem is to catch people's attention. Overall, it doesn't seem like a scenario where (current AI alignment researchers + whoever else is working on it in 10 years) have considerably less leverage over the future than now: it could easily be more.


The other reason for favouring the lifetime anchor is you get long-horizon competence for free once you're excellent at (a given list of) short-horizon tasks. This is arguing, more or less, that for the tasks that matter, current architectures are brainlike in their efficiency, such that the lifetime anchor makes more sense. A lot of the arguments in favour of this have a structure roughly like: look at a wide-ranging comprehension benchmark like MMLI - when an AI is human level on all of this, it'll be able to keep a train of thought running continuously, keep a working memory and plan over very long timescales the same way humans do.

As evidence, this has the significant advantage of being relevant and not having to deal with the vagaries of what tradeoffs evolution may have made differently to human engineers. It has the disadvantage of being fiction. Or at least evidence that's not yet been observed. You see AIs getting more and more impressive at a wider range of short-horizon tasks, which is roughly compatible with either view, but you don't observe the described outcome of them generalizing out to much longer-term tasks than that.

So, to return to the original question, what would count as (additional) evidence in favour of the lifetime anchor? The answer clearly can't be "nothing", since if we build AGI in 5 years, that counts.

I think the answer is, anything that looks like unexpectedly cheap, easy, 'for free' generalization from relatively shorter to relatively longer horizon tasks (e.g. from single reasoning steps to many reasoning steps) without much fine-tuning.

This is different from many of the other signs of impressiveness we've seen recently: just learning lots of shorter-horizon tasks without much transfer between them, being able to point models successfully at particular short-horizon tasks with good prompting, getting much better at a wider range of tasks that can only be done over short horizons. All of these are expected on either view.

This unexpected evidence is very tricky to operationalize. Default bio anchors assumes we'll see a certain degree of generalizing from shorter to longer horizon tasks, and that we'll see AI get better and better sample-efficiency on few-shot tasks, since it assumes that in 20 or so years we'll get enough of such generalization to get AGI. I guess we just need to look for 'more of it than we expected to see'?

That seems very hard to judge, since you can't read off predictions about subhuman capabilities from bio anchors like that.

Does that mean the socratic models result from a few weeks ago, which does involve connecting more specialised models together, is a better example of progress?

The Putin case would be better if he was convincing Russians to make massive sacrifices or do something that will backfire and kill them, like start a war with NATO, and I don't think he has that power - e.g. him rushing to deny that Russia were sending conscripts to Ukraine because of the fear the effect that would have on public opinion

Is Steven Pinker ever going to answer for destroying the Long Peace?

It's really not at all good that were going into a period of much heightened existential risk (from AGI, but also other sources) under cold war like levels of international tension.

I think there's actually a ton of uncertainty here about just how 'exploitable' human civilization ultimately is. We could imagine that since actual humans (e.g. Hitler) by talking to people have seized large fractions of Earth's resources, we might not need an AI that's all that much smarter than a human. On the other hand, we might just say that attempts like that are filtered through colossal amounts of luck and historical contingency and actually to reliably manipulate your way to controlling most of humanity you'd need to be far smarter than the smartest human.

I think there's a few things that get in the way of doing detailed planning for outcomes where alignment is very hard and takeoff very fast. This post by David Manheim discusses some of the problems:

One is that, there's no clarity even among people who've made AI research their professional career about alignment difficulty or takeoff speed. So getting buy in in advance of clear warning signs will be extremely hard.

The other is that the strategies that might help in situations with hard alignment are at cross purposes to ones in Paul-like worlds with slow takeoff and easy alignment - promoting differential progress Vs creating some kind of global policing system to shut down AI research

One thing to consider, in terms of finding a better way of striking a balance between deferring to experts and having voters invested, is epistocracy. Jason Brennan talks about why, compared to just having a stronger voice for experts in government, epistocracy might be less susceptible to capture by special interests,

I think this is a good description of what agent foundations is and why it might be needed. But the binary of 'either we get alignment by default or we need to find the True Name' isn't how I think about it.

Rather, there's some unknown parameter, something like 'how sharply does the pressure towards incorrigibility ramp up, what capability level does it start at, how strong is it'?

Setting this at 0 means alignment by default. Setting this higher and higher means we need various kinds of Prosaic alignment strategies which are better at keeping systems corrigible and detecting bad behaviour. And setting it at 'infinity' means we need to find the True Names/foundational insights.


My rough model is that there's an unknown quantity about reality which is roughly "how strong does the oversight process have to be before the trained model does what the oversight process intended for it to do". p(doom) mainly depends on whether the actors training the powerful systems have sufficiently powerful oversight processes.

Maybe one way of getting at this is to look at ELK - if you think the simplest dumbest ELK proposals probably work, that's Alignment by Default. The harder you think prosaic alignment is, the more complex an ELK solution you expect to need. And if you think we need agent foundations, you think we need a worst-case ELK solution.

Much of the outreach efforts are towards governments, and some to AI labs, not to the general public.

I think that because of the way crisis governance often works, if you're the designated expert in a position to provide options to a government when something's clearly going wrong, you can get buy in for very drastic actions (see e.g. COVID lockdowns). So the plan is partly to become the designated experts.

I can imagine (not sure if this is true) that even though an 'all of the above' strategy like you suggest seems like on paper it would be the most likely to produce success, you'd get less buy in from government decision-makers and be less trusted by them in a real emergency if you'd previously being causing trouble with grassroots advocacy. So maybe that's why it's not been explored much.

This post by David Manheim does a good job of explaining how to think about governance interventions, depending on different possibilities for how hard alignment turns out to be:

Like I said in my first comment, the in practice difficulty of alignment is obviously connected to timeline and takeoff speed.

But you're right that you're talking about the intrinsic difficulty of alignment Vs takeoff speed in this post, not the in practice difficulty.

But those are also still correlated, for the reasons I gave - mainly that a discontinuity is an essential step in Eleizer style pessimism and fast takeoff views. I'm not sure how close this correlation is.

Do these views come apart in other possible worlds? I.e. could you believe in a discontinuity to a core of general intelligence but still think prosaic alignment can work?

I think that potentially you can - if you think that still enough capabilities in pre-HLMI AI (pre discontinuity) to help you do alignment research before dangerous HLMI shows up. But prosaic alignment seems to require more assumptions to be feasible assuming a discontinuity, like that the discontinuity doesn't occur before all the important capabilities you need to do good alignment research.

Load More