Sammy Bahia-Martin. Philosophy and Physics BSc, AI MSc at Edinburgh, starting a PhD at King's College London. Interested in ethics, general philosophy and AI Safety.


Pros and cons of working on near-term technical AI safety and assurance
Answer by SDMJun 19, 20212Ω2

It depends somewhat on what you mean by 'near term interpretability' - if you apply that term to research into, for example, improving the stability and ability to access the 'inner world models' held by large opaque langauge models like GPT-3, then there's a strong argument that ML based 'interpretability' research might be one of the best ways of directly working on alignment research,

And see this discussion for more, 

Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above.

Ajeya Cotra: Thanks! I'm also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I'm not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don't feel scalable to situations when we don't get the concepts the model is using, and I'm very interested in finding slogans to keep researchers focused on the superhuman stuff.

So language model transparency/interpretability tools might be useful on the basis of pro 2) and also 1) to some extent, because it will help build tools for intereting TAI systems and alos help align them ahead of time.

1. Most importantly, the more we align systems ahead of time, the more likely that researchers will be able to put thought and consideration into new issues like treacherous turns, rather than spending all their time putting out fires.

2. We can build practical know-how and infrastructure for alignment techniques like learning from human feedback.

3. As the world gets progressively faster and crazier, we’ll have better AI assistants helping us to navigate the world.

4. It improves our chances of discovering or verifying a long-term or “full” alignment solution.

Taboo "Outside View"

The way I understand it is that 'outside view' is relative, and basically means 'relying on more reference class forecasting / less gears-level modelling than whatever the current topic of discussion is relying on'. So if we're discussing a gears-level model of how a computer chip works in the context of if we'll ever get a 10 OOM improvement in computing power, bringing up moore's law and general trends would be using an 'outside view'.

If we're talking about very broad trend extrapolation, then the inside view is already not very gears-level. So suppose someone says GWP is improving hyperbolically so we'll hit a singularity in the next century. An outside view correction to that would be 'well for x and y reasons we're very unlikely a priori to be living at the hinge of history so we should lower our credence in that trend extrapolation'. 

So someone bringing up broad priors or the anti-weirdness heruistic if we're talking about extrapolating trends would be moving to a 'more outside' view. Someone bringing up a trend when we're talking about a specific model would be using an 'outside view'. In each case, you're sort of zooming out to rely on a wider selection of (potentially less relevant) evidence than you were before.


Note that what I'm trying to do here isn't to counter your claim that the term isn't useful anymore but just to try and see what meaning the broad sense of the term might have, and this is the best I've come up with. Since what you mean by outside view shifts dependent on context, it's probably best to use the specific thing that you mean by it in each context, but there is still a unifying theme among the different ideas.

Open and Welcome Thread – June 2021

Everyone says the Culture novels are the best example of an AI utopia, but even though it's a cliché to mention the culture, it's a cliché for a good reason. Don't start with Consider Phlebas (the first one), but otherwise just dive in. My other recommendation is the Commonwealth Saga by Peter F Hamilton and the later Void Trilogy - it's not on the same level of writing quality as the Culture, although still a great story, but it depicts an arguably superior world to that of the Culture - with more unequivocal support of life extension and transhumanism.

The Commonwealth has effective immortality, a few downsides of it are even noticeable (their culture and politics is a bit more stagnant than we might like), but there's never any doubt at all that it's worth it, and it's barely commented on in the story. The latter-day Void Trilogy Commonwealth is probably the closest a work of published fiction has come to depicting a true eudaemonic utopia that lacks the problems of the culture.

Looking for reasoned discussion on Geert Vanden Bossche's ideas?

According to this article ( - the key claim of (2), that natural antibodies are superior to vaccine antibodies and permanently replaced by them, is just wrong ('absolute unvarnished nonsense' was the quote). One or the other is right, and we just need someone who actually knows immunology to tell us

Principally, antibodies against SARS-CoV-2 could be of value if they are neutralizing. Bossche presents no evidence to support that natural IgM is neutralizing (rather than just binding) SARS-CoV-2.

Covid 6/3: No News is Good News

We have had quite significant news accumulating this week; Delta (Indian) COVID-19 has been shown to be (central estimate) 2.5 times deadlier than existing strains and the central estimate of its increased transmissibility is down a bit, but still 50-70% on top of B.1.1.7 (just imagine if we'd had to deal with Delta in early 2020!), though with a fairly small vaccine escape. This thread gives a good summary of the modelling and likely consequences for the UK, and also more or less applies to most countries with high vaccination rates like the US.

For the US/UK this situation is not too concerning, certainly not compared to March or December 2020, and if restrictions are held where they are now the R_t will likely soon go under 1 as vaccination rates increase. However, there absolutely can be a large exit wave that could push up hospitalizations and in the reasonable worst case lead to another lockdown. Also, the outlook for the rest of the world is significantly worse than it looked just a month ago thanks to this variant - see this by Zeynep Turfecki.

The data is preliminary, and I really hope that the final estimate ends up as low as possible. But coupled with what we are observing in India and in Nepal, where it is rampant, I fear that the variant is a genuine threat.

In practical terms, to put it bluntly, it means that the odds that the pandemic will end because enough people have immunity via getting infected rather than being vaccinated just went way up. 

Effective Altruists may want to look to India-like oxygen interventions in other countries over the next couple of months.

What will 2040 probably look like assuming no singularity?
Answer by SDMMay 16, 20219

You cover most of the interesting possibilities on the military technology front, but one thing that you don't mention that might matter especially considering the recent near-breakdowns of some of the nuclear weapon treaties e.g. NEWSTART, is the further proliferation of nuclear weapons including fourth generation nuclear weapons like nuclear shaped charge warheads, pure fusion and sub-kiloton devices or tactical nuclear weapons - and more countries fitting nuclear-armed cruise missiles or drones with nuclear capability which might be a destabilising factor. If laser technology is sufficiently developed we may also see other forms of directed energy weapons becoming more common such as electron beam weapons or electrolasers

Covid 5/13: Moving On

For anyone reading, please consider following in Vitalik's footsteps and donating to the GiveIndia Oxygen fundraiser, which likely beats givewell's top charities in terms of life-years saved per dollar.

One of the more positive signs that I've seen in recent times, is that well-informed elite opinion (going by, for example, the Economist editorials) has started to shift towards scepticism of institutions and a recognition of how badly they've failed. Among the people who matter for policymaking, the scale of the failure has not been swept under the rug. See here:

We believe that Mr Biden is wrong. A waiver may signal that his administration cares about the world, but it is at best an empty gesture and at worst a cynical one.


Economists’ central estimate for the direct value of a course is $2,900—if you include factors like long covid and the effect of impaired education, the total is much bigger. 

This strikes me as the sort of remark I'd expect to see in one of these comment threads, which has to be a good sign.

In that same issue, we also saw the first serious attempt that I've seen to calculate the total death toll of Covid, accounting for all reporting biases, throughout the world. The Economist was the only publication I've seen that didn't parrot the almost-meaningless official death toll figures. The true answer is, of course, horrifying: between 7.1m and 12.7m dead, with a central estimate of 10.2m - this unfortunately means that we ended up with the worst case scenario I imagined back in late February. Moreover, we appear to currently be at the deadliest point of the entire pandemic.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Great post! I'm glad someone has outlined in clear terms what these failures look like, rather than the nebulous 'multiagent misalignment', as it lets us start on a path to clarifying what (if any) new mitigations or technical research are needed.

Agent-agnostic perspective is a very good innovation for thinking about these problems - is line between agentive and non-agentive behaviour is often not clear, and it's not like there is a principled metaphysical distinction between the two (e.g. Dennett and the Intentional Stance). Currently, big corporations can be weakly modelled this way and individual humans are fully agentive, but Transformative AI will bring up a whole spectrum of more and less agentive things that will fill up the rest of this spectrum.


There is a sense in which, if the outcome is something catastrophic, there must have been misalignment, and if there was misalignment then in some sense at least some individual agents were misaligned. Specifically, the systems in your Production Web weren't intent-aligned because they weren't doing what we wanted them to do, and were at least partly deceiving us. Assuming this is the case, 'multipolar failure' requires some subset of intent misalignment. But it's a special subset because it involves different kinds of failures to the ones we normally talk about.

It seems like you're identifying some dimensions of intent alignment as those most likely to be neglected because they're the hardest to catch, or because there will be economic incentives to ensure AI isn't aligned in that way, rather than saying that there some sense in which the transformative AI in the production web scenario is 'fully aligned' but still produces an existential catastrophe.

I think that the difference between your Production Web and Paul Christiano's subtle creeping Outer Alignment failure scenario is just semantic - you say that the AIs involved are aligned in some relevant sense while Christiano says they are misaligned.

The further question then becomes, how clear is the distinction between multiagent alignment and 'all of alignment except multiagent alignment'. This is the part where your claim of 'Problems before solutions' actually does become an issue - given that the systems going wrong in Production Web aren't Intent-aligned (I think you'd agree with this), at a high level the overall problem is the same in single and multiagent scenarios.

So for it to be clear that there is a separate multiagent problem to be solved, we have to have some reason to expect that the solutions currently intended to solve single agent intent alignment aren't adequate, and that extra research aimed at examining the behaviour of AI e.g. in game theoretic situations, or computational social choice research, is required to avert these particular examples of misalignment.

A related point - as with single agent misalignment, the Fast scenarios seem more certain to occur, given their preconditions, than the slow scenarios.

A certain amount of stupidity and lack of coordination persisting for a while is required in all the slow scenarios, like the systems involved in Production Web being allowed to proliferate and be used more and more even if an opportunity to coordinate and shut the systems down exists and there are reasons to do so. There isn't an exact historical analogy for that type of stupidity so far, though a few things come close (e.g. covid response, leadup to WW2, cuban missile crisis).

As with single agent fast takeoff scenarios, in the fast stories there is a key 'treacherous turn' moment where the systems suddenly go wrong, which requires much less lack of coordination to be plausible than the slow Production Web scenarios.

Therefore, multipolar failure is less dangerous if takeoff is slower, but the difference in risk between slow vs fast takeoff for multipolar failure is unfortunately a lot smaller than the slow vs fast risk difference for single agent failure (where the danger is minimal if takeoff is slow enough). So multiagent failures seem like they would be the dominant risk factor if takeoff is sufficiently slow.

SDM's Shortform

Yes, its very oversimplified - in this case 'capability' just refers to whatever enables RSI, and we assume that it's a single dimension. Of course, it isn't, but we assume that the capability can be modelled this way as a very rough approximation.

Physical limits are another thing the model doesn't cover - you're right to point out that on the intelligence explosion/full RSI scenarios the graph goes vertical only for a time until some limit is hit

SDM's Shortform

Update to 'Modelling Continuous Progress'

I made an attempt to model intelligence explosion dynamics in this post, by attempting to make the very oversimplified exponential-returns-to-exponentially-increasing-intelligence model used by Bostrom and Yudkowsky slightly less oversimplified.

This post tries to build on a simplified mathematical model of takeoff which was first put forward by Eliezer Yudkowsky and then refined by Bostrom in Superintelligence, modifying it to account for the different assumptions behind continuous, fast progress as opposed to discontinuous progress. As far as I can tell, few people have touched these sorts of simple models since the early 2010’s, and no-one has tried to formalize how newer notions of continuous takeoff fit into them. I find that it is surprisingly easy to accommodate continuous progress and that the results are intuitive and fit with what has already been said qualitatively about continuous progress.

The page includes python code for the model.

This post doesn't capture all the views of takeoff - in particular it doesn't capture the non-hyperbolic faster growth mode scenario, where marginal intelligence improvements are exponentially increasingly difficult, and therefore we get a (continuous or discontinuous switch to a) new exponential growth mode rather than runaway hyperbolic growth.

But I think that by modifying the f(I) function that determines how RSI capability varies with intelligence we can incorporate such views.

In the context of the exponential model given in the post that would correspond to an f(I) function where

which would result in a continuous (determined by size of d) switch to a single faster exponential growth mode

But I think the model still roughly captures the intuition behind scenarios that involve either a continuous or a discontinuous step to an intelligence explosion.

Given the model assumptions, we see how the different scenarios look in practice:

If we plot potential AI capability over time, we can see how no new growth mode (brown) vs a new growth mode (all the rest), the presence of an intelligence explosion (red and orange) vs not (green and purple), and the presence of a discontinuity (red and purple) vs not (orange and green) affect the takeoff trajectory.

Load More