Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Cross-posted to the EA forum.


Like last year and the year before, I’ve attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to an securities analyst with regards to possible investments. It appears that once again no-one else has attempted to do this, to my knowledge, so I've once again undertaken the task.

This year I have included several groups not covered in previous years, and read more widely in the literature.

My aim is basically to judge the output of each organisation in 2018 and compare it to their budget. This should give a sense for the organisations' average cost-effectiveness. We can also compare their financial reserves to their 2019 budgets to get a sense of urgency.

Note that this document is quite long, so I encourage you to just read the sections that seem most relevant to your interests, probably the sections about the individual organisations. I do not recommend you skip to the conclusions!

I’d like to apologize in advance to everyone doing useful AI Safety work whose contributions I may have overlooked or misconstrued.

Methodological Considerations

Track Records

Judging organisations on their historical output is naturally going to favour more mature organisations. A new startup, whose value all lies in the future, will be disadvantaged. However, I think that this is correct. The newer the organisation, the more funding should come from people with close knowledge. As organisations mature, and have more easily verifiable signals of quality, their funding sources can transition to larger pools of less expert money. This is how it works for startups turning into public companies and I think the same model applies here.

This judgement involves analysing a large number papers relating to Xrisk that were produced during 2018. Hopefully the year-to-year volatility of output is sufficiently low that this is a reasonable metric. I also attempted to include papers during December 2017, to take into account the fact that I'm missing the last month's worth of output from 2017, but I can't be sure I did this successfully.

This article focuses on AI risk work. If you think other causes are important too, your priorities might differ. This particularly affects GCRI, FHI and CSER, who both do a lot of work on other issues.

We focus on papers, rather than outreach or other activities. This is partly because they are much easier to measure; while there has been a large increase in interest in AI safety over the last year, it’s hard to work out who to credit for this, and partly because I think progress has to come by persuading AI researchers, which I think comes through technical outreach and publishing good work, not popular/political work.


My impression is that policy on technical subjects (as opposed to issues that attract strong views from the general population) is generally made by the government and civil servants in consultation with, and being lobbied by, outside experts and interests. Without expert (e.g. top ML researchers at Google, CMU & Baidu) consensus, no useful policy will be enacted. Pushing directly for policy seems if anything likely to hinder expert consensus. Attempts to directly influence the government to regulate AI research seem very adversarial, and risk being pattern-matched to ignorant opposition to GM foods or nuclear power. We don't want the 'us-vs-them' situation, that has occurred with climate change, to happen here. AI researchers who are dismissive of safety law, regarding it as an imposition and encumbrance to be endured or evaded, will probably be harder to convince of the need to voluntarily be extra-safe - especially as the regulations may actually be totally ineffective. The only case I can think of where scientists are relatively happy about punitive safety regulations, nuclear power, is one where many of those initially concerned were scientists themselves. Given this, I actually think policy outreach to the general population is probably negative in expectation.

If you’re interested in this I’d recommend you read this blog post (also reviewed below).


I think there is a strong case to be made that openness in AGI capacity development is bad. As such I do not ascribe any positive value to programs to ‘democratize AI’ or similar.

One interesting question is how to evaluate non-public research. For a lot of safety research, openness is clearly the best strategy. But what about safety research that has, or potentially has, capabilities implications, or other infohazards? In this case it seems best if the researchers do not publish it. However, this leaves funders in a tough position – how can we judge researchers if we cannot read their work? Maybe instead of doing top secret valuable research they are just slacking off. If we donate to people who say “trust me, it’s very important and has to be secret” we risk being taken advantage of by charlatans; but if we refuse to fund, we incentivize people to reveal possible infohazards for the sake of money. (Is it even a good idea to publicise that someone else is doing secret research?)

With regard published research, in general I think it is better for it to be open access, rather than behind journal paywalls, to maximise impact. Reducing this impact by a significant amount in order for the researcher to gain a small amount of prestige does not seem like an efficient way of compensating researchers to me. Thankfully this does not occur much with CS papers as they are all on arXiv, but it is an issue for some strategy papers.

More prosaically, organisations should make sure to upload the research they have published to their website! Having gone to all the trouble of doing useful research it is a shame how many organisations don’t take this simple step to significantly increase the reach of their work.

Research Flywheel

My basic model for AI safety success is this:

  1. Identify interesting problems
    1. As a byproduct this draws new people into the field through nerd-sniping
  2. Solve interesting problems
    1. As a byproduct this draws new people into the field through credibility and prestige
  3. Repeat

One advantage of this model is that it produces both object-level work and field growth.

There is also some value in arguing for the importance of the field (e.g. Bostrom’s Superintelligence) or addressing criticisms of the field.

Noticeably absent are strategic pieces. In previous years I have found these helpful; however, lately fewer seem to yield incremental updates to my views, so I generally ascribe lower value to these. This does not apply to technical strategy pieces, about e.g. whether CIRL or Amplification is a more promising approach.

Near vs Far Safety Research

One approach is to research things that will make contemporary ML systems more safe, because you think AGI will be a natural outgrowth from contemporary ML, and this is the only way to get feedback on your ideas. I think of this approach as being exemplified by Concrete Problems. You might also hope that even if ML ends up leading us into another AI Winter, the near-term solutions will generalize in a useful way, though this is of course hard to judge. To the extent that you endorse this approach, you would probably be more likely to donate to CHAI.

Another approach is to try to reason directly about the sorts of issues that will arise with superintelligent AI, and won’t get solved anyway / rendered irrelevant as a natural side effect of ordinary ML research. To the extent that you endorse this approach, you would probably be more likely to donate to MIRI, especially for their Agent Foundations work.

I am not sure how to relatively value these two things.

There are a number of other topics that often get mentioned as AI Safety issues. I generally do not think it is important to support organisations or individuals working on these issues unless there is some direct read-through to AGI safety.

I have heard it argued that we should become experts in these areas in order to gain credibility and influence for the real policy work. However, I am somewhat sceptical of this, as I suspect that as soon as a domain is narrow-AI-solved it will cease to be viewed as AI.

Autonomous Cars

My view is that the localised nature of any tragedies plus the strong incentive alignment mean that private companies will solve this problem by themselves.


While technological advance continually mechanise and replace labour in individual categories, it also opens up new ones. Contemporaneous unemployment has more to do with poor macroeconomic policy and inflexible labour markets than robots. AI strong enough to replace humans in basically every job is basically AGI-complete. At that point we should be worried about survival, and if we solve the alignment problem well enough to prevent extinction we will have likely also solved it well enough to also prevent mass unemployment (or at least the negative effects of such, if you believe the two can be separated).

There has been an increase in interest in a ‘Basic Income’ – an unconditional cash transfer given to all citizens – as a solution to AI-driven unemployment. I think this is a big mistake, and largely motivated reasoning by people who would have supported it anyway. In a Hansonian scenario, all meat-based humanity has is our property rights. If property rights are strong, we will become very rich. If they are weak, and the policy is that every agent gets a fair share, all the wealth will be eaten up as Malthusian EMs massively outnumber physical humans and driving the basic income down to the price of some cycles on AWS.


The vast majority of discussion in this area seems to consist of people who are annoyed at ML systems are learning based on the data, rather than based on the prejudices/moral views of the writer. While in theory this could be useful for teaching people about the difficulty of the alignment problem, the complexity of human value, etc., in practice I doubt this is the case. This presentation is one of the better I have seen on the subject.

Other Existential Risks

Some of the organisations described below also do work on other existential risks, for example GCRI, FLI and CSER. I am not an expert on other Xrisks so they are hard for me to evaluate work in, but it seems likely that many people who care about AI Alignment will also care about them, so I will mention publications in these areas. The exception is climate change, which is highly non-neglected.

Financial Reserves

Charities like having financial reserves to provide runway, and guarantee that they will be able to keep the lights on for the immediate future. This could be justified if you thought that charities were expensive to create and destroy, and were worried about this occurring by accident due to the whims of donors.

Donors prefer charities to not have too much reserves. Firstly, those reserves are cash that could be being spent on outcomes now, by either the specific charity or others. Valuable future activities by charities are supported by future donations; they do not need to be pre-funded. Additionally, having reserves increases the risk of organisations ‘going rogue’, because they are insulated from the need to convince donors of their value.

As such, in general I do not give full credence to charities saying they need more funding because they want more than a year of runway in the bank. A year’s worth of reserves should provide plenty of time to raise more funding.

It is worth spending a moment thinking about the equilibrium here. If donors target a lower runway number than charities, charities might curtail their activities to allow their reserves to last for longer. At this lower level of activities, donors would then decide a lower level of reserves are necessary, and so on, until eventually the overly conservative charity ends up with a budget of zero, with all the resources instead given to other groups who turn donations into work more promptly. This is allows donor funds to be turned into research more quickly.

I estimated reserves = (cash and grants) / (2019 budget – committed annual funding). In general I think of this as something of a measure of urgency. This is a simpler calculation than many organisations (MIRI, CHAI etc.) shared with me, because I want to be able to compare consistently across organisations. I attempted to compare the amount of reserves different organisations had, but found this rather difficult. Some organisations were extremely open about their financing (thank you CHAI!). Others were less so. As such these should be considered suggestive only.

Donation Matching

In general I believe that charity-specific donation matching schemes are somewhat dishonest, despite my having provided matching funding for at least one in the past.

Ironically, despite this view being espoused by GiveWell (albeit in 2011), this is basically of OpenPhil’s policy of, at least in some cases, artificially limiting their funding to 50% of a charity’s need, which some charities argue (though not by OpenPhil themselves that I recall) effectively provides a 1:1 match for outside donors. I think this is bad. In the best case this forces outside donors to step in, imposing marketing costs on the charity and research costs on the donors. In the worst case it leaves valuable projects unfunded.

Obviously cause-neutral donation matching is different and should be exploited. Everyone should max out their corporate matching programs if possible, and things like the annual Facebook Match and the quadratic-voting match were great opportunities.

Poor Quality Research

Partly thanks to the efforts of the community, the field of AI safety is considerably more well respected and funded than was previously the case, which has attracted a lot of new researchers. While generally good, one side effect of this (perhaps combined with the fact that many low-hanging fruits of the insight tree have been plucked) is that a considerable amount of low-quality work has been produced. For example, there are a lot of papers which can be accurately summarized as asserting “just use ML to learn ethics”. Furthermore, the conventional peer review system seems to be extremely bad at dealing with this issue.

The standard view here is just to ignore low quality work. This has many advantages, for example 1) it requires little effort, 2) it doesn’t annoy people. This conspiracy of silence seems to be the strategy adopted by most scientific fields, except in extreme cases like anti-vaxers.

However, I think there are some downsides to this strategy. A sufficiently large miliu of low-quality work might degrade the reputation of the field, deterring potentially high-quality contributors. While low-quality contributions might help improve Concrete Problems’ citation count, they may use up scarce funding.

Moreover, it is not clear to me that ‘just ignore it’ really generalizes as a community strategy. Perhaps you, enlightened reader, can judge that “How to solve AI Ethics: Just use RNNs” is not great. But is it really efficient to require everyone to independently work this out? Furthermore, I suspect that the idea that we can all just ignore the weak stuff is somewhat an example of typical mind fallacy. Several times I have come across people I respect according respect to work I found blatantly rubbish. And several times I have come across people I respect arguing persuasively that work I had previously respected was very bad – but I only learnt they believed this by chance! So I think it is quite possible that many people will waste a lot of time as a result of this strategy, especially if they don’t happen to move in the right social circles.

Finally, I will note that the two examples which spring to mind of cases where the EA community has forthrightly criticized people for producing epistemically poor work – namely Intentional Insights and ACE – seem ex post to have been the right thing to do, although in both cases the targets were inside the EA community, rather than vaguely-aligned academics.

Having said all that, I am not a fan of unilateral action, so will largely continue to abide by this non-aggression convention. My only deviation here is to make it explicit – though see this by 80,000 Hours.

The Bay Area

Much of the AI and EA communities, and especially the EA community concerned with AI, is located in the Bay Area, especially Berkeley and San Francisco. This is an extremely expensive place, and is dysfunctional both politically and socially. A few months ago I read a series of stories about abuse in the bay and was struck by how many things I considered abhorrent were in the story merely as background. In general I think the centralization is bad, but if there must be centralization I would prefer it be almost anywhere other than Berkeley. Additionally, I think many funders are geographically myopic, and biased towards funding things in the Bay Area. As such, I have a mild preference towards funding non-Bay-Area projects. If you’re interested in this topic I recommend you reading this or this or this.

Organisations and Research

MIRI: The Machine Intelligence Research Institute

MIRI is the largest pure-play AI existential risk group. Based in Berkeley, it focuses on mathematics research that is unlikely to be produced by academics, trying to build the foundations for the development of safe AIs. They were founded by Eliezer Yudkowsky and lead by Nate Soares.

Historically they have been responsible for much of the germination of the field, including advocacy, but are now focused on research. In general they do very ‘pure’ mathematical work, in comparison to other organisation with more ‘applied’ ML or strategy focuses. I have historically been impressed with their research.

Their agent foundations work is basically trying to develop the correct way of thinking about agents and learning/decision making by spotting areas where our current models fail and seeking to improve them.


Garrabrant and Demski's Embedded Agency Sequence is a short sequence of blog posts outlining MIRI's thinking about Agent Foundations. It describes the issues about how to reason about agents that are embedded in their environment. I found it to be a very intuitive explanation of many issues that MIRI is working on. However, little of it will be new to someone who has worked through MIRI's previous, less accessible work on the subject.

Yudkowsky and Christiano's Challenges to Christiano's Capability Amplification Proposal discusses Eliezer's objections to Paul's Amplification agenda in back-and-forth blog format. Eliezer has a couple of objections. At a high level, Paul is attempting a more direct solution, working largely within the existing ML framework, vs MIRI's desire to work on things like agent foundations first. Eliezer is concerned that most aggregation/amplification methods do not preserve alignment, and that finding one that does (and building the low level agents) is essentially as hard as solving the alignment problem. Any loss of alignment would be multiplied with every level of amplification. Thirdly, there may be many problems that need sequential work - additional bandwidth does not suffice. Additionally, he objects that Paul's ideas would likely be far too slow, due to the huge amount of human input required. This was an interesting post, but I think could have been more clear. Researchers from OpenAI were also named authors on the paper.

Yudkowsky's The Rocket Alignment Problem is a blog post presenting a Galileo-style dialogue/analogy for why MIRI is taking a seemingly indirect approach to AI Safety. It was enjoyable, but I'm not sure how convincing it would be to outsiders. I guess if you thought a deep understanding of the target domain was never necessary it could provide an existence proof.

Demski's An Untrollable Mathematician Illustrated provides a very accessible explanation to some results about logical induction.

MIRI researchers also appeared as co-authors on:

Non-disclosure policy

Last month MIRI announced their new policy of nondisclosure-by-default:

[G]oing forward, most results discovered within MIRI will remain internal-only unless there is an explicit decision to release those results, based usually on a specific anticipated safety upside from their release.

This is a significant change from their previous policy. As of circa a year ago my understanding was that MIRI would be doing secret research largely in addition to their current research programs, not that all their programs would become essentially secret.

At the same time secrecy at MIRI is not entirely new. I’m aware of at least one case from 2010 where they decided not to publish something for similar reasons; as far as I’m aware this thing has never been ‘declassified’ – indeed perhaps it has been forgotten.

In any case, one consequence of this is that for 2018 MIRI has published essentially nothing. (Exceptions to this are discussed above).

I find this very awkward to deal with.

On the one hand, I do not want people to be pressured into premature disclosure for the sake of funding. This space is sufficiently full of infohazards that secrecy might be necessary, and in its absence researchers might prudently shy away from working on potentially risky things - in the same way that no-one in business sends sensitive information over email any more. MIRI are in exactly the sort of situation that you would expect might give rise to the need for extreme secrecy. If secret research is a necessary step en route to saving the world, it will have to be done by someone, and it is not clear there is anyone much better.

On the other hand, I don’t think we can give people money just because they say they are doing good things, because of the risk of abuse. There are many other reasons for not publishing anything. A some simple ones would be “we failed to produce anything publishable” or “it is fun to fool ourselves into thinking we have exciting secrets” or “we are doing bad things and don’t want to get caught.”

Additionally, by hiding the highest quality work we risk impoverishing the field, making it look unproductive and unattractive to potential new researchers.

One possible solution would be for the research to be done by impeccably deontologically moral people, whose moral code you understand and trust. Unfortunately I do not think this is the case with MIRI. (I also don’t think it is the case with many other organisations, so this is not a specific criticism of MIRI, except insomuchas you might have held them to a higher standard than others).

Another possible solution would be for major donors to be insiders, who read the secret stuff and can verify it is worth supporting. If the organisation also wanted to keep small donors the large donors could give their seal of approval; otherwise the organisation could simply decide it did not need them any more. However, if MIRI are adopting this strategy they are keeping it a secret from me! Perhaps this is reassuring about their ability to keep secrets.

Perhaps we hope that MIRI employees would leak information of any wrongdoing, but not leak potential info-hazards?

Finally, I will note that MIRI are have been very generous with their time in attempting to help me understand what they are doing.


According to MIRI they have around 1.5 years of expenses in reserve, and their 2019 estimated budget is around $4.8m. This does not include the potential purchase of a new office they are considering.

There is prima facie counterfactually valid matching funding available from REG’s Double Up Drive.

If you wanted to donate to MIRI, here is the relevant web page.

FHI: The Future of Humanity Institute

FHI is a well-established research institute, affiliated with Oxford and led by Nick Bostrom. Compared to the other groups we are reviewing they have a large staff and large budget. As a relatively mature institution they produced a decent amount of research over the last year that we can evaluate. They also do a significant amount of outreach work.

Their research is more varied than MIRI's, including strategic work, work directly addressing the value-learning problem, and corrigibility work.


Armstrong and O'Rourke's ‘Indifference’ methods for managing agent rewards provides an overview of Stuart's work on Indifference. These are methods that try to prevent agents from manipulating a certain event, or ignore it, or change utility function without trying to fight it. In the paper they lay out extensive formalism and prove some results. Some but not all will be familiar to people who have been following his other work in the area. The key to understanding the why the utility function in the example is defined the way it is, and vulnerable to the problem described in the paper, is that we do not directly observe age - hence the need to base it on wristband status. I found the example a little confusing because it could also be solved by just scaling up the punishment for mis-identification that is caught, in line with Becker's Crime and Punishment: An Economic Approach (1974), but this approach wouldn't work if you didn't know the probabilities ahead of time. Overall I thought this was an excellent paper. Researchers from ANU were also named authors on the paper.

Armstrong and Mindermann's Impossibility of deducing preferences and rationality from human policy argues that you cannot infer human preferences from the actions of people who may be irrational in unknown ways. The basic point is quite trivial - that arbitrary irrationalities can mean that any set of values could have produced the observed actions - but at the same time I hadn't internalised why this would be a big problem for the IRL framework, and in any case it is good to have important things written down. More significant is they also showed that 'simplicity' assumptions will not save us - the 'simplest' solution will (almost definitely) be degenerate. This suggests we do need to 'hard code' some priors about human values into the AI - they suggest beliefs about truthful human utterances (though of course as speech acts are acts all the same, it seems that some of the same problems occur again at this level of meta). Alternatives (not mentioned in the paper) could be to look to psychology or biology (e.g. Haidt or evolutionary biology). Overall I thought this was an excellent paper.

Armstrong and O'Rourke's Safe Uses of AI Oracles suggests two possible safe Oracle designs. The first takes advantage of Stuart's trademark indifference results to build an oracle whose reward is only based on cases where the output after being automatically verified is deleted, and hence cannot attempt to manipulate humanity. I thought this was clever, and it's nice to see some payoff from the indifference machinery he's been working on, though this Oracle only works for NP-style questions, and assumes the verifier cannot be manipulated - which is a big assumption. The paper also includes a simulation of such an Oracle, showing how the restriction affects performance. The rest of the paper describes the more classic technique of restricting an Oracle to give answers simple enough that we hope they're not potentially manipulative, and frequently re-starting the Oracle. Researchers from ANU were also named authors on the paper.

Dafoe's AI Governance: A Research Agenda is an introduction to the issues faced in AI governance for policy future researchers. It seems to do a good job of this. As lowering barriers to entry is important for new fields, this is potentially a very valuable document if you are highly concerned about the governance side of AI. In particular, it covers policy work to address threats from general artificial intelligence as well as near-term narrow AI issues, which is a major plus to me. In some ways it feels similar to Superintelligence.

Sandberg's Human Extinction from Natural Hazard Events provides a detailed overview of extinction risks from natural events. The paper is both detailed and broad, and is something of an updated version of part of Bostrom and Cirkovic's Global Catastrophic Risks. His conclusion is broadly than man-made risks are significantly larger than natural ones. As with any Anders paper it contains a number of interesting anecdotes - for example I also hadn't realised that people in 1910 were concerned that Halley's Comet might poison the atmosphere!

Schulze and Evans's Active Reinforcement Learning with Monte-Carlo Tree Search provide an algorithm for efficient reinforcement-learning when learning the reward is costly. In most RL designs the agent always sees the reward; however, this would not be the case with CIRL, because the rewards require human input, which is expensive, so we have to ration it. Here Sebastian and Owain produce a new algorithm, BAMCP++ that tries to address this in an efficient way. The paper provides simulations to show the near-optimality of this algorithm in some scenarios vs failure of rivals, and some theoretical considerations for why things like Thompson Sampling would struggle.

Brundage et al.'s The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation is a massively collaborative policy document on the threats posed by narrow AI. Aimed primarily at policymakers, it does a good job of introducing a wide variety of potential threats. However, it does not really cover existential risks at all, so I suspect the main benefit (from our point of view) is that of credibility-building for later. However, I am in general sceptical of politicians' ability to help with AI safety, so I relatively downweight this. But if you were concerned about bad actors using AI to attack, this is a good paper for you. Researchers from OpenAI, CSER were also named authors on the paper.

Bostrom's The Vulnerable World Hypothesis introduces and discusses the idea of worlds that will be destroyed 'by default' when they reach a certain level of technological advancement. He distinguishes between a variety of different cases, like if it is easy for individuals to develop weapons of mass destruction, with intuitive names like 'Type-2b vulnerability', and essentially argues for a global police state (or similar) to reduce the risk. It contained a bunch of interesting anecdotes - for example I hadn't realised what little influence the scientists in the Manhattan Project had on the eventual political uses of nukes. However, given its origin I actually found this paper didn't add much new. The areas where it could have added - for example, discussing novel ways of using cryptography to enable surveillance without totalitarianism, discussing Value Drift as a form of existential risk that might be impossible to solve without something like this, or the risks of global surveillance itself being an existential risk (as ironically covered in Caplan's chapter of Global Catastrophic Risks) - were left with only cursory discussion. Additionally, given the nature of governments, I do not think that supporting surveillance is a very neglected area.

Lewis et al.'s Information Hazards in Biotechnology discusses issues around dangerous biology research. They provide an overview, including numerous examples of dangerous discoveries and the policies that were used and their merits.

FHI researchers also appeared as co-authors on:


OpenPhil awarded FHI $13.4m earlier this year, spread out over 3 years, largely (but not exclusively) to fund AI safety research. Unfortunately the write-up I found on the website was even more minimal than last year’s and so is unlikely to be of much assistance to potential donors.

They are currently in the process of moving to a new larger office just west of Oxford.

FHI didn’t reply to my emails about donations, and seem to be more limited by talent (though there are problems with this phrase) than by money, so the case for donating here seems weaker. But it could be a good place to work!

If you wanted to donate to them, here is the relevant web page.

CHAI: The Center for Human-Compatible AI

The Center for Human-Compatible AI, founded by Stuart Russell in Berkeley, launched in August 2016. They have produced a lot of interesting work, especially focused around inverse reinforcement learning. They are significantly more applied and ML-focused than MIRI or FHI (who are more ‘pure’) or CSER or CGRI (who are more strategy-focused). They also do work on non-xrisk related AI issues, which I generally think are less important, but which perhaps have solutions that can be re-used for AGI safety.


Shah's AI Alignment Newsletter is a weekly email of interesting new developments relevant to AI Alignment. It is amazingly detailed. I struggle writing this; I don't know how he keeps on track of it all. Overall I thought is an excellent project.

Mindermann and Shah et al.'s Active Inverse Reward Design turns the reward design process into an interactive one where the agent can 'ask' questions. The idea, as I understand it, is that instead of the programmers creating a one-and-done training reward function which the agent learns about, instead the agent learns from the reward function, is cognizant of its uncertainties (Inverse Reward Design) and then queries the designer in such a way as to reduce its uncertainty. This seems like exploring the designers value space in the same way that an RL agent explores its environmental space. It seems like a very clever idea to me, though I would have liked to see more examples in the paper.

Hadfield-Menell and Hadfield's Incomplete Contracting and AI alignment analogises the problem of AI alignment with the economics literature on incentive alignment (for humans). The analysis is generally good, and might lead to useful followups, though most of the readthroughs they drew from the principal-agent literature seem like they are already appreciated in the AI safety community. There was some somewhat novel stuff about signalling models, and about Aghion & Tirole's 1997 paper on incomplete contracting that seemed interesting but I didn't really understand or have time to look into. It also did a nice job of pointing out how much the human problem of incomplete contracting is solved by humans being embedded in a moral and social order, and thus able and willing to do what 'obviously' is 'common sense' in unclear situations - a solution which unfortunately seems no FAI-complete for our case. Researchers from OpenAI were also named authors on the paper.

Reddy et al.'s Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behaviour attempt to infer values from agents with incorrect world-models (pace Armstrong and Mindermann's Impossibility paper). They attempt to avoid the impossibility result by first deducing agent beliefs on a task with known goals, and then using those beliefs to infer goals on a new task. While there might not be any tasks with known human goals, you might hope that there are different areas where human goals and beliefs are more or less well understood, which could be utilised by a related approach. As such I was quite pleased by this paper. They also have a n=12 user trial.

Tucker et al.'s Inverse Reinforcement Learning for Video Games apply an IRL algorithm to an Atari game. Given that proving that alignment-congeniality can be achieved with little loss of efficacy is important for convincing the field, and how much status is applied to success at video games, I think this is a good area to pursue.

Filan's Bottle Caps aren't Optimisers is a short blog post about how to identify agents. It argues this is important because we don't want to accidentally create agents.

Milli et al.'s Model Reconstruction from Model Explanations show it is easier to reconstruct a model with queries about gradients than levels. Asking "what are the partial derivatives at this point?" gives more information, and hence makes it easier to reverse-engineer the model, than asking "what is the output at this point?". The paper is framed as being about the desire by some people to make AI models 'accountable' by making them 'explain' their decisions. I think this is not very important, but it does seem to have some relevance to efficiently reconstructing latent *human* value models. Given that we can only query humans so many times, it is important to make efficient use of these queries. Instead of asking "Would you pull the lever?" many times, instead ask "Which factors would make you more likely to pull the lever?". In some sense asking for partial derivatives seems like n queries (for an n-dimensional space), but given that many (most?) of these are likely to be locally negligible this might be an efficient way to help extract human preferences.

Shah et al.'s Value Learning Sequence is a short sequence of blog posts outlining the specification problem. This is basically how to specify even in theory what we might want to AI to do. It is a nice introduction to many of the issues, like why imitation learning is not enough. Most of what has been published so far is not that new, though apparently it is still ongoing. Researchers from FHI were also contributed posts.

Reddy et al.'s Shared Autonomy via Deep Reinforcement Learning desire an RL system that is intended to operate simultaneously with a human, preventing the human from taking very bad actions, despite not fully understanding the humans goals.

Hadfield-Menell et al.'s Legible Normativity for AI Alignment: The Value of Silly Rules build a RL/Game Theory model for why we might want AI agents to obey and enforce even 'silly' rules. Basically the idea is that fidelity to, and enforcement of, silly rules provides credible signals that important rules will also be enforced - and their failure to be enforced is also useful information that the group is not strong enough to defend itself so agents can quit earlier. I was a little confused by the conclusion, which suggested that agents would have to learn the difference between silly and non-silly rules. Wouldn't this undermine the signalling value?

CHAI researchers also appeared as co-authors on:


Based on detailed financials they shared with me I estimate they have around 2 years worth of expenses in reserve (including grants promised but not yet disbursed), with a 2019 budget of around $3m.

If you wanted to donate to them, here is the relevant web page.

CSER: The Center for the Study of Existential Risk

CSER is an existential risk focused group located in Cambridge. Like GCRI they do work on a variety of existential risks, with more of a focus on strategy than FHI, MIRI or CHAI.

Strategic work is inherently tied to outreach, like lobbying the UK government, which is hard to evaluate and assign responsibility for.

In the past I have criticised them for a lack of output. It is possible they had timing issues whereby a substantial amount of work was done in earlier years but only released more recently. In any case they have published more in 2018 than in previous years.

CSER’s researchers seem to select a somewhat eclectic group of research topics, which I worry may reduce their effectiveness.


Liu and Price's Ramsey and Joyce on deliberation and prediction discusses whether agents can have credences on which decision they'll make while they're in the process of deciding. This builds on their previous work in Heart of DARCness. The relevance to AI safety is presumably via MIRI's 5-10 problem, and how to model agents who think about themselves as part of the world, which I didn't appreciate when I read Heart of DARCness. In particular, it discusses agents with sub agents. Having said that, a lot of the paper seemed to rest on terminological distinctions.

Currie's Existential Risk, Creativity & Well-Adapted Science argues that the professionalisation of science encourages 'cautious' research, whereas Xrisk requires more creativity. Essentially it argues that many institutional factors push scientists towards exploitation over exploration. In general I found this convincing, though pace Currie I think the small number of Professorships compared to the number of PhDs actually *encourages* risk-taking, as the value out-of-the-money call options increases with volatility. I found his argument that Xrisk research needing unusually large amounts of creativity not entirely convincing - while I agree that novel threats like AI require this, his example of solar flares seems like the sort of threat that could be addressed in a diligent, rather than genius, fashion. The paper has some pertience for how we fund the Xrisk movement - in particular I think it pulls in favour of many small grants to 'citizen scientists', rather than large grants towards organisations.

Rees's On The Future is a quick-read pop-sci book about the future of humanity. It includes a brief discussion of AI risk, and the section on the risks posed by high-energy physics experiments was new to me. Many topics are discussed only in a very cursory way however, and I agree with Robin's review - the book would have benefited from being proofread by an economist, or simply someone who does not share the author's political views.

Shahar and Shapira's Civ V AI Mod is a mod for Civ V (PC game) that adds superintelligence research into the game. This is the novel publicity effort I alluded to last year. It generated some media attention, which seemed less bad than I expected.

Currie's Introduction: Creativity, Conservatism & the Social Epistemology of Science is a general introduction to some issues about how risk-taking (or not) institutional science is.

Shahar's Mavericks and Lotteries describes various ways in which allocating research funding by lottery, rather than through peer review, might be better. In particular he argues it would make institutional science less conservative. I am sceptical of this, however: the proposals still feature filtering proposals for being "good enough", and in equilibrium the standard for being "good enough" may just rise to where the peer review standard was before. Additionally, I'm not sure I see a very strong link to existential risk - I guess OpenPhil could adopt randomisation? Expecting to reform all of science funding as a path to Xrisk reduction seems *very* indirect.

Currie's Geoengineering Tensions discusses the pros and cons of geoengineering, and the difficulties of doing experiments in the field. It discusses two tensions: firstly the moral hazard risk, and secondly the difficulty of doing the necessary experiments given the conservatism of institutional science.

Adrian Currie edited a ‘special issue’, Futures of Research in Catastrophic and Existential Risk which I think is basically a journal of articles they in some sense commissioned or collected. Currie and Ó hÉigeartaigh's Working together to face humanity's greatest threats: Introduction to The Future of Research on Catastrophic and Existential Risk provides an overview of the topics discussed in the edition. In general these are not so much concerned with object-level existential risks as with the meta-work of developing the field. Unfortunately I have not had time to review all the articles it contains that were not authored by CSER researchers, though Jones et al.'s Representation of future generations in United Kingdom policy-making which advocated for a Parliamentory committee for future generations, looks interesting, as one was indeed subsequently created. CSER claim, as seems plausible, that many of these papers would not have counterfactually existed without CSER’s role as a catalyst. The topics discussed include a variety of existential risks.

CSER researchers also appeared as co-authors on the following papers:


Based on some very rough numbers shared with me I estimate they have around 1.25 years worth of expenses in reserve, with an annual budget of around $1m.

If you wanted to donate to them, here is the relevant web page.

GCRI: Global Catastrophic Risks Institute

The Global Catastrophic Risks Institute is a geographically dispersed group run by Seth Baum. They have produced work on a variety of existential risks, including AI and non-AI risks. Within AI they do a lot of work on the strategic landscape, and are very prolific.

They are significantly smaller organisation than most of the others reviewed here, and in 2018 only one of their researchers (Seth) was full time. In the past I have been impressed with their high research output to budget ratio, and that continued this year. At the moment they seem to be somewhat subscale as an organisation - Seth seems to have been responsible for a large majority of their 2018 work - and are trying to grow.

Here is their annual write-up.

Adam Gleave, winner of the 2017 donor lottery, chose to give some money to GCRI; here is his thought process. He was impressed with their nuclear war work (which I’m not qualified to judge), and recommend GCRI focus more on quality and less on quantity, which seems plausible to me. GCRI tell me they are attentive to the issue and have made institutional changes to try to affect change.

GCRI also shared some other considerations with me that I cannot disclose, which may have affected my overall conclusion in addition to the considerations listed above.


Baum et al.'s Long-Term Trajectories of Human Civilization provides an analysis of possible ways the future might go. They discuss four broad trajectories: status quo, catastrophe, technological transformation, and astronomical colonisation. The scope is very broad but the analysis is still quite detailed; it reminds me of Superintelligence a bit. I think this paper has a strong claim to becoming the default reference for the topic. Researchers from FHI, FRI were also named authors on the paper.

Baum's Resilience to Global Catastrophe provides a brief introduction to ideas around resilience to disasters. The points it made seem true, but are obviously more applicable to non-AGI based threats that leave more scope for recovery.

Baum's Uncertain Human Consequences in Asteroid Risk Analysis and the Global Catastrophe Threshold discusses the consequences of Asteroid impact. He reviews some of the literature, and discusses the idea of important thresholds for impact. One idea I hadn't come across before was the risk that an asteroid impact might be mistaken as a nuclear attack and cause a war - an interesting risk because all we need to do to avoid it is see the asteroid coming. However, I'm not an expert in the field, so struggle to judge how novel or incremental the paper is.

Baum and Barrett's A Model for the Impacts of Nuclear War goes through the various impacts of nuclear war. It seems diligent and useful for future researchers or policymakers as a reference, though it is not my area of expertise.

Baum et al.'s A Model for the Probability of Nuclear War describes and decomposes the many possible routes to nuclear war. It also contains an interesting and extensive database of 'near-miss' scenarios.

Baum's Superintelligence Skepticism as a Political Tool discusses the risk of motivated scepticism about AI risks in order to protect funding for researchers and avoid regulation for corporation. This seems like a plausible risk, though we should be careful attributing disingenuous motivations to opponents - though it is certainly true that the AI safety community seems to be the target of more misinformation than you might expect. I think the paper could might have benefitted from contrasting this with the risks of regulatory capture, which seem to operate in the other direction. Without doing so the political discussion was somewhat partisan - in both misinformation papers virtually all the examples bad actor were right wing groups, though perhaps most readers might find this is agreeable!

Baum's Countering Superintelligence Misinformation discusses ways to improve debate around superintelligence through countering misinformation. These are mainly different forms of education, plus criticism of people for saying false things. I thought that the sections about ways of addressing misinformation once it exists were generally quite sophisticated, though I am sceptical of some of them as I don't think AI safety is very amenable to popular or state pressure.

Baum et al.'s Modelling and Interpreting Expert Disagreement about Artificial Intelligence attempts to put numbers of Bostrom and Goertzel's credences for various AI risk factors and compare. They try to break down the disagreement into three statements, interpret the two thinkers' statements as probabilities for those statements, and then assign their own probability for which thinker is correct. I'm a bit confused by the last step - it seems that by doing so you're basically ensuring the output will be equal to your own credence (by the law of total probability).

Umbrello and Baum's Evaluating Future nanotechnology: The Net Societal Impacts of Atomically Precise Manufacturing discusses the possible impacts of nanotechnology on society. Most of the discussion is quite broad, and could apply to economic growth in general. I was surprised how little value the authors assigned to greatly increasing the wealth of humanity.


GCRI spent around $140k in 2018, and are aiming to raise $1.5m to cover the next three years, for a target annual budget of ~$500k. This would allow them to employ their (3) key staff full time and have some money for additional hiring.

This large jump makes it a little hard to calculate runway in a comparable fashion to other organisations. They currently have around $280k, having recently received a $250k donation. But is it unfair to include this donation, given they received it subsequently to some other organisations telling me about their finance? All organisations should look progressively better funded as giving season goes on!

In any case it seems relatively clear that they have been and probably continue to be at the moment more funding constrained than most other organisations. The part-time nature of many of their staff makes their cost structure more variable and less fixed, suggesting this limited runway is less of an existential threat than it would be at some other organisations – they’re not about to disband - though clearly this is still undesirable.

It seems credible that more funding would allow them to hire their researchers full time, which seems like a relatively low-risk method of scaling. If they can preserve their current productivity this could be valuable, though my impression is many small organisations become less productive as they scale, as high initial productivity may be due to founder effects that revert to the mean.

If you want to donate to GCRI, here is the relevant web page.

GPI: The Global Priorities Institute

The Global Priorities Institute is an academic research institute, lead by Hilary Greaves, working on EA philosophy within Oxford. I think of their mission as attempting to provide a home so that high quality academics can have a respectable academic career while working on the most important issues. At the moment they mainly employee philosophers, but they tell me they are planning to hire more economists in the future.

They are relatively new but many of their employees are extremely impressive and their working papers (linked on the EA forum, not on their main website) seem very good to me. At this stage I wouldn’t expect them to have reached run-rate productivity, so would expect this to increase in 2019.

They shared with me abstracts of a number of papers and so on they were working on which seemed interesting and useful. As academic philosophy goes it is very tightly focused on important, decision-relevant issues - however it is not directly AI Safety work.

They allow their employees to spend 50% (!) of their time working on non-GPI projects, to help attract talent. However, the Trammell paper mentioned below was one of these projects, and I thought it was very good, so maybe in practice this does not represent a halving of their cost-effectiveness.

CEA are also spawning a new independant Forethought Foundation for Global Priorities Research, which seems to be very similar to GPI except not part of Oxford.


Mogensen's Long-termism for risk averse altruists argues that risk-averse should make altruists *more*, not *less*, interested in preventing existential risks. This is basically for the same reason that risk aversion causes people to buy insurance. You should be risk averse in outcomes, not in the direct impacts of your actions. This argument is totally obvious now but I'd never heard anyone mention it until two months ago, which suggests it is real progress. Overall I thought this was an excellent paper.

Trammell's Fixed-Point Solutions to the Regress Problem in Normative Uncertainty argues that we can avoid infinite metaethical regress through fixed-point results. This seems like an alternative to Will's work on Moral Uncertainty in some senses. Basically the idea is that if the 'choiceworthiness' of different theories are cardinal at every level in their hierarchy, we can prove a unique fixed point. This is significant to the extent we think that AIs are going to have to learn how to do moral reasoning, perhaps without the aid of humans' convenient "just don't think about it" hack. It's also in some ways a nice response to this SlateStarCodex article.


They have a 2019 budget of around $1.5m dollars, and shared with me a number of examples of types of people they might like to hire in the future, with additional funding.

Apparently Oxford University rules mean that all their hires have to be pre-funded for their entire duration of their (4-5 year) contract.

If you wanted to donate to GPI, here is the link.

ANU: Australian National University

Australian National University has produced a surprisingly large number of relevant papers and researchers over time.


Everitt et al.'s AGI Safety Literature Review AGI Safety Literature Review - I was glad to see someone else attempting to do the same thing I have! Readers of this article might enjoy reading it, as it has much the same purpose. For academics new to the field it could function as a useful overview, introducing but not really arguing for many important points. It’s main value probably comes from one-sentence descriptions of a large number of papers, which could be a useful launching point for research. Literature reviews can also help raise the status of the field. However, it is less likely to add much new insight to those familiar with the field, as it doesn’t really engage with any of the arguments in depth.

Everitt et al.'s Reinforcement Learning with a Corrupted Reward Channel examines how noisy reward inputs can drastically degrade reinforcement learner performance, and some possible solutions. Unsurprisingly, CIRL features as a possible solution. It's also nice to see ANU-Deepmind collaboration. This paper was actually written last year, but I mention it here for completeness as I think I missed it previously; I haven't reviewed it in depth. Researchers from Deepmind were also named authors on the paper.

EDIT: one paper redacted on author request, pending improved second version.

ANU researchers were also named as co-authors on the following papers:


Given their position as part of ANU I suspect it would be difficult for individual donations to appreciably support their work. Additionally, one of their top researchers, Tom Everitt, has now joined Deepmind.

BERI: The Berkeley Existential Risk Initiative

EDIT: After publishing, the Berkeley Existential Risk Initiative requested I remove this section. As a professional courtesy I am reluctantly complying, and rescind any suggestion that BERI may be a good place to donate. I apologize for any inconvenience caused to readers.


Ought is a San Francisco based non-profit are researching the viability of automating human-like cognition. The focus is on approaches that are “scalable” in the sense that better ML or more compute makes them increasingly helpful for supporting and automating deliberation without requiring additional data generated by humans. The idea, as with amplification, is that we can achieve safety guarantees by making agents that reason in individual explicit and comprehensible steps, iterated many times over, as opposed to the dominant more black-box approaches of mainstream ML. Ought does research on computing paradigms that support this approach and experiments with human participants to determine whether this class of approaches is promising. But I admit I understand what they do less well than with other groups.

Their work doesn’t fit neatly into the model of the above groups - they’re not focused on publishing research papers, at least at the moment. Partly as a result of this, and as a new group, I feel like I don’t have quite as good a grasp on exactly their status as with other groups - which is of course primarily a fact about my epistemic state, rather than them.


Stuhlmüller's Factored Cognition outlines the ideas behind their implementation of Christiano-style amplification. They built a web app where people take questions and recursively break them down into simpler questions that can be solved in isolation. At the moment this is for humans, to try to test whether this sort of amplification of distillation and answering could work. It seems like they have put a fair bit of thought into the ontology.

Evans et al.'s Predicting Human Deliberative Judgments with Machine Learning attempts to make progress on building ML systems remain well-calibrated (i.e. the system "knows what it knows") in AI-complete settings (i.e. in settings where current ML algorithms can’t possibly do well on every possible input). To do this they collect a dataset of human judgements on complex issues (weird fermi estimations and political fact-checking) and then look at how people's estimates for these questions changed as they were allowed more time. This is important because someone's rapid judgement of an issue is evidence as to what their eventual slow judgement will be. In some cases you might be able to predict that there is no need to give the human more time; their 30 second answer is probably good enough. This could be useful if you are trying to produce a large training set of judgements about complex topics. I also admire the author's honesty that the results of their ML system was less good than they expected. They also discussed problems with their dataset; this was definitely my experience when trying to use the site. Researchers from FHI were also named authors on the paper.


Based on numbers they shared with me I estimate they have around half a year’s worth of expenses in reserve, with an projected 2019 budget of around $1m.

Additional funding sounds like it would go towards reserves and additional researchers and programers, including a web developer, probably mainly continuing working on Factored Cognition.

Ought ask me to point out that they have applied for an OpenPhil grant renewal but expect to still have room for more funding afterwards.

AI Impacts

AI Impacts is a small Berkeley-based group that does high-level strategy work, especially on AI timelines, somewhat associated with MIRI.

Adam Gleave, winner of the 2017 donor lottery, chose to give some money to AI Impacts; here is his thought process. He was impressed with their work, although sceptical of their ability to scale.


Carey wrote Interpreting AI Compute Trends, which argues that cutting-edge ML research projects have been getting dramatically more expensive. So much so that the trend will have to stop, suggesting that (one driver of) AI progress will slow down over the next 3.5-10 years. Additionally, he points out that we are also nearing the processing capacity (though not scanning capacity) required to model human brains. (Note that this was a guest post by Ryan, who works for FHI)

Grace's Likelihood of discontinuous progress around the development of AGI discusses a 11 different arguments for AGI to have a discontinuous impact, and finds them generally unconvincing. This is important from a strategy point of view because it suggests we should have more time to see AGI coming, potentially also making it clear to sceptics. Overall I found the article clear and generally convincing.

McCaslin's Transmitting fibers in the brain: Total length and distribution of lengths analyses how much neural fibre there is in the human brain, and the distribution of long vs short. My understanding is this is related to how many neurons in human brains are dedicated to moving information around, rather than computation, which might be important because it is an additional form of capacity that is often overlooked when people talk about FLOPS and MIPS, and so might affect your estimates for when we have enough hardware capacity for neuromorphic AI. However, I might be misunderstanding, as I found the motivation a little unclear.

Grace's Human Level Hardware Timeline attempts to estimate how long until we have human-level hardware at human cost. Largely based on earlier work, they estimate "a 30% chance we are already past human-level hardware (at human cost), a 45% chance it occurs by 2040, and a 25% chance it occurs later."

They have gathered a collection of examples of discontinuous progress in history, to attempt to produce something of a reference class for how likely this is with AGI - see for example the Burj Khalifa, the Eiffel Tower, rockets. It would be nice to see how many possible examples they investigated and found were not discontinuous.


According to numbers they shared with me, AI Impacts spent around $90k in 2018 on two part-time employees. In 2019 they plan to significantly increase, to ~$360k and hire multiple new workers. They have just over $400k in current funding, suggesting a bit over a year of runway at this elevated rate, or many years at their 2018 rate.

Similar to GCRI, there is some risks that small groups may have a high productivity due to founder effects, and this might revert to the mean as they scale.

MIRI seems to administer their finances on their behalf; donations can be made here.

Open AI

OpenAI is a San Francisco based AGI startup charity, with a large focus on safety. It was founded in 2015 with money largely from Elon Musk.


Christiano et al. 's Supervising Strong Learners by Amplifying Weak Experts lays out Paul's amplification ideas in a paper - or at least one implementation of them. Basically the idea is that there are many problems where it is too expensive to produce training signals directly, so we will do so indirectly. We do this by iteratively breaking up the task into sub-tasks, using the agent to help with each sub-task, and then training the agent on the human's overall judgement, aided by the agent's output on the subtasks. Hopefully as the agent becomes strong it also gets better at the subtasks, improving the training set further. We also train a second agent to be able to predict good subtasks to go for, and to predict how the human will use the outputs from the subtasks. I'm not sure I understand why we don't train the agent on its performance of the subtasks (except that it is expensive to evaluate there?) I think the paper might have been a bit clearer if it had included an example of the algorithm being used in practice with a human in the loop, rather than purely algorithmic examples. Hopefully this will come in the future. Nonetheless this was clearly a very important paper. Overall I thought this was an excellent paper.

Irving, Christiano and Amodei's AI Safety via Debate explore adversarial 'debate' between two or more advanced agents, competing to be judged the most helpful by a trusted but limited agent. This is very clever. It's an extension of the grand Christiano project of trying to devise ways of amplifying simple, trusted agents (like humans) into more powerful ones - designing a system that takes advantage of our trust in the weak agent to ensure compliance in the stronger. Imagine we basically have a courtroom situation, where two highly advanced legal teams, with vast amounts of legal and forensic expertise, try to convince a simple but trusted agent (the jury) that they're in the right. Each side is trying to make its 'arguments' as simple as possible, and point out the flaws in the other's. As long as refuting lies is easy relative to lying, honesty should be the best strategy... so agents constrained in this way will be honest, and not even try dishonesty! Like a courtroom where both legal teams decide to represent the same side. The paper contains some nice examples, including AlphaGo as an analogy and a neat MNIST simulation, and an interactive website. Overall I thought this was an excellent paper.

The OpenAI Charter is their statement of values with regard AGI research. It seems to contain the things you would want it to: benefit of all, fiduciary duty to humanity. Most interestingly, it also includes " if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years”", a clause which seems very sensible. Finally, it also notes that, like MIRI, they anticipate reducing their conventional publishing.

Amodei and Hernandez's AI and Compute attempts to quantify the computing power used for recent major AI developments like ResNets and AlphaGo. They find it has been doubling approximately every 3-4 months, dramatically faster than you would expect from Moore’s law – especially if you had been reading articles about the end of Moore’s law! This is due to a combination of the move to specialist hardware (initially GPUs, and now AI ASICs) and companies simply spending a lot more dollars. This is not a theory paper, but has direct relevance for timeline prediction and strategy that depends on whether or not there will be a hardware overhang.

Christiano's Universality and Security Amplification describes how Amplification hopes to enhance security by protecting against adversarial inputs (attacks). The hope is that the process of breaking down queries into sub-queries that is at the heart of the Amplification idea can leave us with queries of sufficiently low complexity that they are human-secure. I'm not sure I really understood what this posts adds to others in Paul's arsenal, mainly because I haven’t been following these as closely as perhaps I should have.

Researchers from OpenAI were also named as coauthors on:


Given the strong funding situation at OpenAI, as well as their safety team’s position within the larger organisations, I think it would be difficult for individual donations to appreciably support their work. However it could be an excellent place to apply to work.

Google Deepmind

As well as being arguably the most advanced AI research shop in the world, Google’s London-based Deepmind has a very sophisticated AI Safety team.


Leike et al.'s AI Safety Gridworlds introduces an open-source set of environments for testing ML algorithms for safetyness. Progress in ML has been considerable aided by the availability of common toolsets like MNIST or the Atari games. Here the Deepmind safety team have produced a set of environments designed to test algorithms ability to avoid a number of safety-related failure modes, like Interruptibility, Side Effects, Distributional Shifts and Reward Hacking. This hopefully not only makes such testing more accessible, it also makes these issues more concrete. Ideally it would shift the overton window: maybe one day it will be weird to read an ML paper that does not contain a section describing performance on the Deepmind Gridworlds. This is clearly not a panacea; it is easily to 'fake' passing the test by giving the agent information it shouldn't have, it is better to prove safety results than tack them on, and there is always a risk of Goodhearting. But this seems to me to be clearly a significant step forward. My enthusiasm is only slightly tempered by the fact that only one paper published in the following year citing the paper made use of the Gridworld suite, though Alex Turner's excellent post on Impact measures did as well. Overall I thought this was an excellent paper. Researchers from ANU were also named authors on the paper.

Krakovna's Specification Gaming Examples in AI provides a collection of different cases where agents have optimised their reward function in surprising/undesirable fashion. The spreadsheet of 45 examples might have some research value, but my guess is most of the value is as evidence of the problem.

Krakovna et al.'s Measuring and avoiding side effects using relative reachability invents a new way of defining 'impact', which is important if you want to minimise it, based on how many states' achievability are affected. Essentially it takes some the set of possible states, and then punishes the agent for reducing the attainability of these states. The post also includes a few simulations in the AI Gridworld.

Leike et al.'s Scalable agent alignment via reward modeling: a research direction outlines the Deepmind agenda for bootstrapping human evaluations to provide feedback for RL agents. Similar in some ways to the Christiano project, the idea is that your main RL agent simultaneously learns its reward function and about the world. The human's ability to provide good reward feedback is improved by training smaller agents who help him judge which rewards to provide. The paper goes into a number of potential familiar problems, and potential avenues of attack on those issues. I think the news here is more that the Deepmind (Safety) team is focusing on this, rather than the core ideas themselves. The paper also reviews a lot of related work.

Gasparik et al.'s Safety-first AI for autonomous data centre cooling and industrial control describes the mainly safety measures Google put in place to ensure their ML-driven datacenter cooling system didn't go wrong.

Ibarz et al.'s Reward Learning from Human Preferences and Demonstrations in Atari combines RL and IRL as two different sources of information for the agent. If you think both ideas have some value, it makes sense that combining them further improves performance.

Leibo et al.'s Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents creates an environment for comparing humans and RL agents on the same tasks. Given the goal of getting AI agents to behave in ways humans approve of is closely related to the goal of making them behave like humans, this seems like a potentially useful tool.

Ortega et al.'s Building safe artificial intelligence: specification, robustness and assurance provide an introduction to various problems in AI Safety. The content is unlikely to be new to readers here; it is significant insomuchas it represents a summary of the (worthwhile) priorities of Deepmind('s safety team). They decompose the issue into specification, robustness and assurance.

Researcher’s from Deepmind were also named as coauthors on the following papers:


Being part of Google, I think it would be difficult for individual donors to directly support their work. However it could be an excellent place to apply to work.

Google Brain

Google Brain is Google’s other highly successful AI research group.


Kurakin et al. wrote Adversarial Attacks and Defences Competition, which summarises the NIPS 2017 competition on Adversarial Attacks, including many of the strategies used. If you're not familiar with the area this could be a good introduction.

Brown and Olsson wrote Introducing the Unrestricted Adversarial Examples Challenge, which launches a new 2-sided challenge, for designing systems resistant to adversarial examples, and then finding adversarial examples. The difference here is in allowing a much broader class of adversarial examples, rather than just small perturbations. This seems like a significantly more important class, so it is good they are attempting to move the field in this direction.

Gilmer et al. wrote Motivating the Rules of the Game for Adversarial Example Research, which argue that the adversarial example literature has overly-focused on a narrow class of imperceptibly-changed images. In most realistic cases the adversary has a much wider scope of possible attacks. Importantly for us, the general question is also more similar to the sorts of distributional shift issues that are likely to arise with AGI. To the extent this paper helps push researchers towards more relevant research it seems quite good.


Being part of Google, I think it would be difficult for individual donors to directly support their work. However it could be an excellent place to apply to work.

EAF / FRI: The Effective Altruism Foundation / Foundational Research Institute

EAF is a German/Swiss group effective altruist group, lead by Jonas Vollmer and Stefan Torges, that undertakes a number of activities. They do research on a number of fundamental long-term issues, many related how to reduce the risks of very bad AGI outcomes, published through the Foundational Research Institute (FRI). Their website suggests that FRI and WAS (Wild Animal Suffering) are two equal sub-organisations, but apparently this is not the case - essentially everything EAF does is FRI now, and they just let WAS use their legal entity and donation interface. EAF also have Raising for Effective Giving, which encourages professional poker players to donate to effective charities, including MIRI.

In the past they have been rather negative utilitarian, which I have always viewed as an absurd and potentially dangerous doctrine. If you are interested in the subject I recommend Toby Ord’s piece on the subject. However, they have produced research on why it is good to cooperate with other value systems, making me somewhat less worried.


Oesterheld's Approval-directed agency and the decision theory of Newcomb-like problems analyses which decision theories are instantiated by RL agents. The paper analyses the structure of RL agents of various kinds and maps them mathematically to either Evidential or Causal Decision theory. Given how much we discuss decision theory it is surprising in retrospect that no-one (to my knowledge) had previously looked to see which ones our RL agents were actually instantiating. As such I found this an interesting paper.

Baumann's Using Surrogate Goals to Deflect Threats discusses using a decoy utility function component as to protect against threats. The idea is that agents run the risk of counter-optimisation at the hands of an extortionist, but this could be protected against by redefining their utility function to add a pointless secondary goal (like avoiding the creation of a certain dimensioned platinum sphere). An opponent would find it easier to extort the agent by negatively optimising the surrogate goal. This doesn't prevent the agent from giving in to the threats, but it does reduce the damage if the attacker has to follow-through on their threat. The paper discusses many additional details, including the multi-agent case, and the interaction between this and other defence mechanisms. My understanding is that they and Eliezer both (independently?) came up with this idea. One thing I didn't quite understand is the notional of attacker-hostile surrogates - surely they would just be ignored?

Sotala and Gloor's Superintelligence as a Cause or Cure for Risks of Astronomical Suffering is a review article for the various ways the future might contain a lot of suffering. It does a good job of going through possibilities, though I felt it was overly focused on suffering as a bad outcome - there are many other bad things too!

Sotala's Shaping economic incentives for collaborative AGI argues that encouraging collaborative norms in AI with regard narrow AI will encourage those norms in the future for AGI due to cultural lock-in. Unfortunately it is not clear how to go about doing this. Researchers from FHI, were also named authors on the paper.


Based on their blog post, they currently have around a year and a half’s worth of reserves, with a 2019 budget of $925,000.

As EAF have in the past worked on a variety of cause areas, donors might worry about fungibility. EAF tell me that they are now basically entirely focused on AI related work, and that WAS research is funded by specifically allocated donations, which would imply this is not a concern, though I note that several WAS people are still listed on their team page.

Readers who want to donate to EAF/FRI can do so here.

Foresight Institute

The Foresight Institute is a Palo-Alto based group focusing on AI and nanotechnology. Originally founded in 1986 (!), they seem to have been somewhat re-invigorated recently by Allison Duettmann. Unfortunately I haven’t had time to review them in detail.

A large part of their activity seems to be in organising ‘salon’ discussion / workshop events.

Duettmann et al.'s Artificial General Intelligence: Coordination and Great Powers summarises the discussion at the 2018 Foresight Institute Strategy Meeting on AGI. Researchers from FHI and FLI were also named authors on the paper.

Readers who want to donate to Foresight can do so here.

FLI: The Future of Life Institute

The Future of Life Institute was founded to do outreach, including run the Puerto Rico conference. Elon Musk donated $10m for the organisation to re-distribute; given the size of the donation it has rightfully come to somewhat dominate their activity.

In 2018 they ran a second grantmaking round, giving $2m split between 10 different people. These grants were more focused on AGI than the previous round, which included a large number of narrow AI projects. In general the grants went to university professors. They have now awarded most of the $10m.

Unfortunately I haven’t had time to review them in detail.

Readers who want to donate to FLI can do so here.

Median Group

The Median Group is a new group for research on global catastrophic risks, with researchers from MIRI, OpenPhil and Numerai. As a new group they lack the sort of track record that would make them easily amenable to analysis. Current projects they’re working on include AI timelines, forest fires, and climate change impacts on geopolitics.

I don’t know that much about them because the contact email listed on the website does not work.


Taylor et al. wrote Insight-based AI timeline model, which made an insight-based model for the time to AGI. They first produced a list of important insights that have (plausibly) contributed towards AGI. Surprisingly, they find there has been a roughly constant rate of insight production since 1945. They then model time-to-AGI using a pareto distribution for the number of insights required. This is a novel (to me, at least) method that I liked.

Convergence Analysis

Convergence Analysis is a new group, lead by Justin Shovelain, aiming to do strategic work. They are too new to have any track record.

Other Research

I would like to emphasis that there is a lot of research I didn't have time to review, especially in this section, as I focused on reading organisation-donation-relevant pieces. For example, Kosoy's The Learning-Theoretic AI Alignment Research Agenda seems like a worthy contribution.


Lipton and Steinhardt's Troubling Trends in Machine Learning Scholarship critiques a number of developments in the ML literature that they think are bad. Basically, they argue that a lot of papers obfuscate explanation vs speculation, obscure the true source of improvement in their papers (often just hyper-parameter tuning), use maths to impress rather than clarify, and use common english words for complex terms, thereby smuggling in unnecessary connotations. It's unclear to me, however, to what extent these issues retard progress on safety vs capabilities. I guess to the extent that safety requires clear understanding, whereas capabilities can be achieved in a more messy fashion, these trends are bad and should be pushed back ok.

Jilk's Conceptual-Linguistic Superintelligence discusses the need for AGI to have a conceptual-linguistic facility. Contra recent AI developments - e.g. AlphaZero does not have a linguistic ability - he argues that AIs will need linguistic ability to understand much of the human world. He also discusses the difficulties that Rice's theorem imposes on AI self-improvement, though this has been well discussed before.

Cave and Ó hÉigeartaigh's An AI Race for Strategic Advantage: Rhetoric and Risks argues that framing AI development as a 'race', or an 'arms race', is bad. Much of their reasoning is not new, and was previously published by e.g. Baum's On the Promotion of Safe and Socially Beneficial Artificial Intelligence. Instead I think of the target audience here as being policymakers and other AI researchers: this is a paper aiming to influence global strategy, not research EA strategy. Having said that, their discussion of why we should actively confront AI race rhetoric, rather than trying to simply avoid it, was novel, at least to me. It also apparently won best paper at the AAAI/ACM conference on Artificial Intelligence, Ethics, and Society. Researchers from CSER were also named authors on the paper.

Liu et al.'s A Survey on Security Threats and Defensive Techniques of Machine Learning: A Data Driven View reviews security threats to contemporary ML systems. This is basically addresses the concerns raised in Amodei et al.'s Concrete Problems about Distributional Shifts between training and test data, and how to ensure robustness.

Sarma and Hay's Robust Computer Algebra, Theorem Proving, and Oracle AI discuss computer algorithm systems as potentially important classes of Oracles, and try to provide concrete safety-related work that could be done. Their overview of Question-Answering-Systems, Computer-Algebra-Systems and Interactive-Theorem-Provers was interesting to me, as I didn't have much familiarity thereof. They argue that CAS use heuristics that lead to invalid inferences sometimes, while ITPs are very inefficient, and suggest projects to help integrate the two, to produce more reliable math oracles. I think of this paper as being a bit like a specialised version of Amodei et al's Concrete Problems, but the connection between the projects here and the end goal of FAI is a little harder for me to grasp. Additionally, the paper seems to have been in development since 2013?

Manheim and Garrabrant's Categorizing Variants of Goodheart's Law classifies different types of situations where a proxy measures ceases to be a good proxy when you start relying on it. This is clearly an important topic for AI safety, insomuch as we are hoping to design AIs that will not fall victim to it. The paper provides a nice disambiguation of different kinds of situation, bringing conceptual clarity even if it's not a deep mathematical result. Researchers from MIRI were also named authors on the paper.

Ngo and Pace's Some cruxes on impactful alternatives to AI policy work discuss the advantages and disadvantages of AI policy work. They try to find the 'crux' of their disagreement - the small number of statements they disagree about which determine which side of the issue they come down on. Researchers from Deepmind were also named authors on the paper.

Awad et al.'s The Moral Machine Experiment did a massive online interactive survey of 35 *million* people to determine their moral preferences with regard autonomous cars. They found that people prefer: saving more people rather than fewer; saving humans over animals; saving young (including unborn children) over old; lawful people over criminals; executives over homeless; fit over fat; females over males; and pedestrians over passengers. I thought this was very interesting, and applaud them for actually looking for people's moral intuitions, rather than just substituting the values of the programmers/politicians. They also analyse how these values differ between cultures. Overall I thought this was an excellent paper.

Green's Ethical Reflections on Artificial Intelligence reviews various ethical issues about AI from a christian perspective. Given the dominance of utilitarian thinking on the subject, it was nice to see an explicitly Christian contribution that displayed familiarity with the literature, with safety as #1 and #3 on the list of issues. "therefore it must be the paramount goal of ethics to maintain human survival.'

Eth's The Technological Landscape Affecting Artificial General Intelligence and the Importance of Nanoscale Neural Probes presents arguments for favouring whole-brain-emulation as a pathway to human-level AI over de novo AGI, and suggests that nanoscale neural probe research could be a good way to differentially advance WBE vs merely human-inspired Neuromorphic AGI. The paper builds on a lot of arguments in Bostrom's Superintelligence. It seems clear that neuromorphic AGI is undesirable - the question is between de novo and WBE, which unfortunately seem to have neuromorphic 'in between' them from a technological requirement point of view. Daniel presents some good arguments for the relative safety of WBE (some of which were already in Bostrom), for example that WBEs would help provide training data from de novo AGI, though I was sceptical of the idea that the identity of the first WBEs would be determined by public debate. An especially good point was that even if nanoscale neural probes accelerate neuromorphic almost as much as WBEs, because the two human-inspired paths are closely linked and hence more likely to hit closer in time than de novo, neural probe research is more likely to cause WBE to overtake neuromorphic than neuromorphic to overtake de novo.

Turchin's Could slaughterbots wipe out humanity? Assessment of the global catastrophic risk posed by autonomous weapons, provides a series of fermi-calculation like estimates of the danger posed by weaponised drones. He concludes that while they are very difficult to defend against, and their cost is coming down, it is unlikely they would be the driving force behind human extinction.

Bogosian's Implementation of Moral Uncertainty in Intelligent Machines, argues for using Will's metanormativity approach to moral uncertainty as a way for addressing moral disagreement in AI design. I'm always glad to see more attention given to Will's thesis, which I thought was very good, and the application to AI is an interesting one. I'm not quite sure how it would interact with a value-learning system - is the idea that the agent is updating all of its moral theories as new evidence comes in? Or that it has some value-learning approaches that are sharing credence with pre-programmed non-learning systems? I was a bit confused by his citing Greene (2001) as comparing the dispersion of issue and theory level disagreement on moral issues, but I don't think this actually affects the conclusions of the paper at all, and am less concerned than Kyle is about the scaling properties of the algorithm. I also liked his prudential argument for why moral partisans should agree to this compromise, though I note that virtue ethicists, for whom the character of the agent (not merely the results) matters, may not be convinced. Finally, I think he actually understated the extent to which debates about decision procedures are less vicious than those about object-level issues, as virtually all the emotion about voting systems seems to be generated by object-level partisans who believe that changing the voting system will help them achieve their object-level political goals.

rk and Sempere's AI development incentive gradients are not uniformly terrible argue that the 'openness is bad' conclusion from Armstrong et al's Racing to the Precipice is basically because of the discontinuity in success probability in their model. This seems true to me, and reduced my credence that openness was bad. Researchers from FHI were also named authors on the paper.

Liu et al.'s Governing Boring Apocalypses: A new typology of existential vulnerabilities and exposures for existential risk research discusses the broad risk landscape. They provide a number of breakdowns of possible risks, including many non-AI. I think the main use is the relatively policymaker-friendly framing.

Bansal and Weld's A Coverage-Based Utility Model for Identifying Unknown Unknowns design a model for efficiently utilising a scarce human expert to discover false-positive regions.

Dai's A general model of safety-oriented AI development provides a very brief generalisation of the sort of inductive strategies for AI safety I had been referring to as 'Christiano-like'


Roman Yampolskiy edited a 500-page anthology on AI Safety, available for purchase here. Unfortunately I haven’t had time to read every article; here is a review by someone who has.

The first half of the book, Concerns of Luminaries, is basically re-prints of older articles. As such readers will probably mainly be interested in the second half, which I think are all original to this volume.

Misc other news

OpenPhil gave Carl Shulman $5m to re-grant, of which some seems likely to end up funding useful AI safety work. Given Carl’s intellect and expertise this seems like a good use of money to me.

OpenPhil are also funding seven ML PhD students ($1.1m over five years) through their ‘AI Fellows’ program. I have read their published research and some of it seems quite interesting – I found Noam’s Safe and Nested Subgame Solving for Imperfect-Information Games particularly interesting, partly as I didn’t have much prior familiarity with the subject. Most of their work thus far does not seem very AI Safety relevant, with some exceptions like this blog post by Jon Gauthier. But given the timeline for academic work and the mid-year announcement of the fellowships I think it’s probably too early to see if they will produce any AI Safety relevant work.

If you like podcasts, you might enjoy these 80,000 Hours podcasts. If not, they all have complete transcripts.

80,000 Hours also wrote a guide on how to transition from programming or CS into ML.

Last year I mentioned that EA Long Term Future Fund did not seem to be actually making grants. After a series of criticism on the EA forum by Henry Stanley and Evan Gaensbauer, CEA has now changed the management of the funds and committed to a regular series of grantmaking. However, I’m skeptical this will solve the underlying problem. Presumably they organically came across plenty of possible grants – if this was truly a ‘lower barrier to giving’ vehicle than OpenPhil they would have just made those grants. It is possible, however, that more managers will help them find more non-controversial ideas to fund. Here is a link to their recent grants round.

If you’re reading this, you probably already read SlateStarCodex. If not, you might enjoy this article he wrote this year about AI Safety.

In an early proof of the viability of cryonics, LessWrong has been brought back to life. If like me you find the new interface confusing you can view it through GreaterWrong. Relatedly there is integration with the Alignment Forum, to provide a place for discussion of AI Alignment issues that is linked to LessWrong. This seems rather clever to me.

Zvi Mowshowitz and Vladimir Slepnev have been organizing a series of AI Safety prizes, giving out money for the articles they were most impressed with in a certain time frame.

Deepmind’s work on Protein Folding proved quite successful, winning the big annual competition by a significant margin. This seemed significant to me mainly because ‘solving the protein folding problem’ has been one of the prototypical steps between ‘recursively self-improving AI’ and ‘singleton’ since at least 2001.

Berkley offered a graduate-level course in AGI Safety. are attempting to create a two-sided marketplace where you can buy or sell idle GPU capacity. This seems like the sort of thing that probably will not succeed, but if something like it did that’s another piece of evidence for hardware overhang.

The US department of commerce suggested an ban on AI exports, presumably inspired by previous bans on cryptography exports.


The size of the field continues to grow, both in terms of funding and researchers. Both make it increasingly hard for individual donors.

As I have once again failed to reduce charity selection to a science, I’ve instead attempted to subjectively weigh the productivity of the different organisations against the resources they used to generate that output, and donate accordingly.

My constant wish is to promote a lively intellect and independent decision-making among my readers; hopefully my laying out the facts as I see them above will prove helpful to some readers. Here is my eventual decision, rot13'd so you can do come to your own conclusions first if you wish:

Qrfcvgr univat qbangrq gb ZVEV pbafvfgragyl sbe znal lrnef nf n erfhyg bs gurve uvtuyl aba-ercynprnoyr naq tebhaqoernxvat jbex va gur svryq, V pnaabg va tbbq snvgu qb fb guvf lrne tvira gurve ynpx bs qvfpybfher. Nqqvgvbanyyl, gurl nyernql unir n ynetre ohqtrg guna nal bgure betnavfngvba (rkprcg creuncf SUV) naq n ynetr nzbhag bs erfreirf.

Qrfcvgr SUV cebqhpvat irel uvtu dhnyvgl erfrnepu, TCV univat n ybg bs cebzvfvat cncref va gur cvcryvar, naq obgu univat uvtuyl dhnyvsvrq naq inyhr-nyvtarq erfrnepuref, gur erdhverzrag gb cer-shaq erfrnepuref’ ragver pbagenpg fvtavsvpnagyl vapernfrf gur rssrpgvir pbfg bs shaqvat erfrnepu gurer. Ba gur bgure unaq, uvevat crbcyr va gur onl nern vfa’g purnc rvgure.

Guvf vf gur svefg lrne V unir nggrzcgrq gb erivrj PUNV va qrgnvy naq V unir orra vzcerffrq jvgu gur dhnyvgl naq ibyhzr bs gurve jbex. V nyfb guvax gurl unir zber ebbz sbe shaqvat guna SUV. Nf fhpu V jvyy or qbangvat fbzr zbarl gb PUNV guvf lrne.

V guvax bs PFRE naq TPEV nf orvat eryngviryl pbzcnenoyr betnavfngvbaf, nf 1) gurl obgu jbex ba n inevrgl bs rkvfgragvny evfxf naq 2) obgu cevznevyl cebqhpr fgengrtl cvrprf. Va guvf pbzcnevfba V guvax TPEV ybbxf fvtavsvpnagyl orggre; vg vf abg pyrne gurve gbgny bhgchg, nyy guvatf pbafvqrerq, vf yrff guna PFRE’f, ohg gurl unir qbar fb ba n qenzngvpnyyl fznyyre ohqtrg. Nf fhpu V jvyy or qbangvat fbzr zbarl gb TPEV ntnva guvf lrne.

NAH, Qrrczvaq naq BcraNV unir nyy qbar tbbq jbex ohg V qba’g guvax vg vf ivnoyr sbe (eryngviryl) fznyy vaqvivqhny qbabef gb zrnavatshyyl fhccbeg gurve jbex.

Bhtug frrzf yvxr n irel inyhnoyr cebwrpg, naq V nz gbea ba qbangvat, ohg V guvax gurve arrq sbe nqqvgvbany shaqvat vf fyvtugyl yrff guna fbzr bgure tebhcf.

NV Vzcnpgf vf va znal jnlf va n fvzvyne cbfvgvba gb TPEV, jvgu gur rkprcgvba gung TPEV vf nggrzcgvat gb fpnyr ol uvevat vgf cneg-gvzr jbexref gb shyy-gvzr, juvyr NV Vzcnpgf vf fpnyvat ol uvevat arj crbcyr. Gur sbezre vf fvtavsvpnagyl ybjre evfx, naq NV Vzcnpgf frrzf gb unir rabhtu zbarl gb gel bhg gur hcfvmvat sbe 2019 naljnl. Nf fhpu V qb abg cyna gb qbangr gb NV Vzcnpgf guvf lrne, ohg vs gurl ner noyr gb fpnyr rssrpgviryl V zvtug jryy qb fb va 2019.

Gur Sbhaqngvbany Erfrnepu Vafgvghgr unir qbar fbzr irel vagrerfgvat jbex, ohg frrz gb or nqrdhngryl shaqrq, naq V nz fbzrjung zber pbaprearq nobhg gur qnatre bs evfxl havyngreny npgvba urer guna jvgu bgure betnavfngvbaf.

V unira’g unq gvzr gb rinyhngr gur Sberfvtug Vafgvghgr, juvpu vf n funzr orpnhfr ng gurve fznyy fvmr znetvany shaqvat pbhyq or irel inyhnoyr vs gurl ner va snpg qbvat hfrshy jbex. Fvzvyneyl, Zrqvna naq Pbairetrapr frrz gbb arj gb ernyyl rinyhngr, gubhtu V jvfu gurz jryy.

Gur Shgher bs Yvsr vafgvghgr tenagf sbe guvf lrne frrz zber inyhnoyr gb zr guna gur cerivbhf ongpu, ba nirentr. Ubjrire, V cersre gb qverpgyl rinyhngr jurer gb qbangr, engure guna bhgfbhepvat guvf qrpvfvba.

V nyfb cyna gb fgneg znxvat qbangvbaf gb vaqvivqhny erfrnepuref, ba n ergebfcrpgvir onfvf, sbe qbvat hfrshy jbex. Gur pheerag fvghngvba, jvgu n ovanel rzcyblrq/abg-rzcyblrq qvfgvapgvba, naq hcsebag cnlzrag sbe hapregnva bhgchg, frrzf fhobcgvzny. V nyfb ubcr gb fvtavsvpnagyl erqhpr bireurnq (sbe rirelbar ohg zr) ol abg univat na nccyvpngvba cebprff be nal erdhverzragf sbe tenagrrf orlbaq univat cebqhprq tbbq jbex. Guvf jbhyq or fbzrjung fvzvyne gb Vzcnpg Pregvsvpngrf, juvyr ubcrshyyl nibvqvat fbzr bs gurve vffhrf.

However I wish to emphasis that all the above organisations seem to be doing good work on the most important issue facing mankind. It is the nature of making decisions under scarcity that we must prioritize some over others, and I hope that all organisations will understand that this necessarily involves negative comparisons at times.

Thanks for reading this far; hopefully you found it useful. Apologies to everyone who did valuable work that I excluded; I have no excuse other than procrastination, Crusader Kings II, and a starting work at a new hedge fund.


I have not in general checked all the proofs in these papers, and similarly trust that researchers have honestly reported the results of their simulations.

I was a Summer Fellow at MIRI back when it was SIAI, volunteered briefly at GWWC (part of CEA) and previously applied for a job at FHI. I am personal friends with people at MIRI, FHI, CSER, CHAI, GPI, BERI, OpenAI, Deepmind, Ought and AI Impacts but not really at ANU, EAF/FRI, GCRI, Google Brain, Foresight, FLI, Median, Convergence (so if you’re worried about bias you should overweight them… though it also means I have less direct knowledge) (also sorry if I’ve forgotten any friends who work for the latter set!). However I have no financial ties beyond being a donor and have never been romantically involved with anyone who has ever been at any of the organisations.

I shared drafts of the individual organisation sections with representatives from MIRI, FHI, CHAI, CSER, GCRI, GPI, BERI, Ought, AI Impacts, and EAF/FRI.

I’d like to thank Greg Lewis and my anonymous reviewers for looking over this. Any remaining mistakes are of course my own. I would also like to thank my wife for tolerating all the time I have invested/wasted on this.

EDIT: Removed language about BERI, at their request.


Amodei, Dario and Hernandez, Danny - AI and Compute - 2018-05-16 -

Armstrong, Stuart; O'Rourke, Xavier - ‘Indifference’ methods for managing agent rewards - 2018-01-05 -

Armstrong, Stuart; O'Rourke, Xavier - Safe Uses of AI Oracles - 2018-06-05 -

Armstrong, Stuart; Soren, Mindermann - Impossibility of deducing preferences and rationality from human policy - 2017-12-05 -

Avin, Shahar; Wintle, Bonnie; Weitzdorfer, Julius; Ó hÉigeartaigh, Seán; Sutherland, William; Rees, Martin - Classifying Global Catastrophic Risks - 2018-02-23 -

Awad, Edmond; Dsouza, Sohan; Kim, Richard; Schulz, Jonathan; Henrich, Joseph; Shariff, Azim; Bonnefon, Jean-Francois; Rahwan, Iyad - The Moral Machine Experiment - 2018-10-24 -

Bansal, Gagan; Weld, Daniel - A Coverage-Based Utility Model for Identifying Unknown Unknowns - 2018-04-25 -

Basu, Chandrayee; Yang, Qian; Hungerman, David; Mukesh, Singhal; Dragan, Anca - Do You Want Your Autonomous Car to Drive Like You? - 2018-02-05 -

Batin, Mikhail; Turchin, Alexey; Markov, Sergey; Zhila, Alisa; Denkenberger, David - Artificial Intelligence in Life Extension: from Deep Learning to Superintelligence - 2017-08-31 -

Baum, Seth - Countering Superintelligence Misinformation - 2018-09-09 -

Baum, Seth - Resilience to Global Catastrophe - 2018-11-29 -

Baum, Seth - Superintelligence Skepticism as a Political Tool - 2018-08-22 -

Baum, Seth - Uncertain Human Consequences in Asteroid Risk Analysis and the Global Catastrophe Threshold - 2018-07-28 -

Baum, Seth; Armstrong, Stuart; Ekenstedt, Timoteus; Haggstrom, Olle; Hanson, Robin; Kuhlemann, Karin; Maas, Matthijs; Miller, James; Salmela, Markus; Sandberg, Anders; Sotala, Kaj; Torres, Phil; Turchi, Alexey; Yampolskiy, Roman - Long-Term Trajectories of Human Civilization - 2018-08-08 -

Baum, Seth; Barrett, Anthony - A Model for the Impacts of Nuclear War - 2018-04-03 -

Baum, Seth; Barrett, Anthony; Yampolskiy, Roman - Modelling and Interpreting Expert Disagreement about Artificial Intelligence - 2018-01-27 -

Baum, Seth; Neufville, Robert; Barrett, Anthony - A Model for the Probability of Nuclear War - 2018-03-08 -

Baumann, Tobias - Using Surrogate Goals to Deflect Threats - 2018-02-20 -

Becker, Gary - Crime and Punishment: An Economic Approach - 1974-01-01 -

Bekdash, Gus - Using Human History, Psycology and Biology to Make AI Safe for Humans - 2018-04-01 -

Berberich, Nicolas; Diepold, Klaus - The Virtuous Machine - Old Ethics for New Technology - 2018-06-27 -

Blake, Andrew; Bordallo, Alejandro; Hawasly, Majd; Penkov, Svetlin; Ramamoorthy, Subramanian; Silva, Alexandre - Efficient Computation of Collision Probabilities for Safe Motion Planning - 2018-04-15 -

Bogosian, Kyle - Implementation of Moral Uncertainty in Intelligent Machines - 2017-12-01 -

Bostrom, Nick - The Vulnerable World Hypothesis - 2018-11-09 -

Brown, Noam; Sandholm, Tuomas - Safe and Nested Subgame Solving for Imperfect-Information Games - 2017-05-08 -

Brown, Noam; Sandholm, Tuomas - Solving Imperfect-Information Games via Discounted Regret Minimization - 2018-09-11 -

Brown, Tom; Olsson, Catherine; Google Brain Team, Research Engineers - Introducing the Unrestircted Adversarial Examples Challenge - 2018-09-03 -

Carey, Ryan - Interpreting AI Compute Trends - 2018-07-10 -

Cave, Stephen; Ó hÉigeartaigh, Seán - An AI Race for Strategic Advantage: Rhetoric and Risks - 2018-01-16 -

Christiano, Paul - Techniques for Optimizing Worst-Case Performance - 2018-02-01 -

Christiano, Paul - Universality and Security Amplification - 2018-03-10 -

Christiano, Paul; Shlegeris, Buck; Amodei, Dario - Supervising Strong Learners by Amplifying Weak Experts - 2018-10-19 -

Cohen, Michael; Vellambi, Badri; Hutter, Marcus - Algorithm for Aligned Artificial General Intelligence - 2018-05-25 -

Cundy, Chris; Filan, Daniel - Exploring Hierarchy-Aware Inverse Reinforcement Learning - 2018-07-13 -

Currie, Adrian - Existential Risk, Creativity & Well-Adapted Science - 2018-07-22 -

Currie, Adrian - Geoengineering Tensions - 2018-04-30 -

Currie, Adrian - Introduction: Creativity, Conservatism & the Social Epistemology of Science - 2018-09-27 -

Currie, Adrian; Ó hÉigeartaigh, Seán - Working together to face humanity's greatest threats: Introduction to The Future of Research on Catastrophic and Existential Risk - 2018-03-26 -

Dafoe, Allen - AI Governance: A Research Agenda - 2018-08-27 -

Dai, Wei - A general model of safety-oriented AI development - 2018-06-11 -

Demski, Abram - An Untrollable Mathematician Illustrated - 2018-03-19 -

DeVries, Terrance; Taylor, Graham - Leveraging Uncertainty Estimates for Predicting Segmentation Quality - 2018-07-02 -

Dobbe, Roel; Dean, Sarah; Gilbert, Thomas; Kohli, Nitin - A Broader View on Bias in Automated Decision-Making: Reflecting on Epistemology and Dynamics - 2018-07-06 -

Doshi-Velez, Finale; Kim, Been - Considerations for Evaluation and Generalization in Interpretable Machine Learning - 2018-08-24 -

Duettmann, Allison; Afanasjeva, Olga; Armstrong, Stuart; Braley, Ryan; Cussins, Jessica; Ding, Jeffrey; Eckersley, Peter; Guan, Melody; Vance, Alyssa; Yampolskiy, Roman - Artificial General Intelligence: Coordination and Great Powers - 1900-01-00 -

Erdelyi, Olivia ; Goldsmith, Judy - Regulating Artificial Intelligence: Proposal for a Global Solution - 2018-02-01 -

Eth, Daniel - The Technological Landscape Affecting Artificial General Intelligence and the Importance of Nanoscale Neural Probes - 2017-08-31 -

Evans, Owain; Stuhlmuller, Andreas; Cundy, Chris; Carey, Ryan; Kenton, Zachary; McGrath, Thomas; Schreiber, Andrew - Predicting Human Deliberative Judgments with Machine Learning - 2018-07-13 -

Everitt, Tom; Krakovna, Victoria; Orseau, Laurent; Hutter, Marcus; Legg, Shane - Reinforcement Learning with a Corrupted Reward Channel - 2017-05-23 -

Everitt, Tom; Lea, Gary; Hutter, Marcus - AGI Safety Literature Review - 2018-05-22 - AGI Safety Literature Review

Filan, Daniel - Bottle Caps aren't Optimisers - 2018-11-21 -

Fisac, Jaime; Bajcsy, Andrea; Herbert, Sylvia; Fridovich-Keil, David; Wang, Steven; Tomlin, Claire; Dragan, Anca - Probabilistically Safe Robot Planning with Confidence-Based Human Predictions - 2018-05-31 -

Garnelo, Marta; Rosenbaum, Dan; Maddison, Chris; Ramalho, Tiago; Saxton, David; Shanahan, Murray; The, Yee Whye; Rezende, Danilo; Eslami, S M Ali - Conditional Neural Processes - 2018-07-04 -

Garrabrant, Scott; Demski, Abram - Embedded Agency Sequence - 2018-10-29 -

Gasparik, Amanda; Gamble, Chris; Gao, Jim - Safety-first AI for autonomous data centre cooling and industrial control - 2018-08-17 -

Gauthier, Jon; Ivanova, Anna - Does the brain represent words? An evaluation of brain decoding studies of language understanding - 2018-06-02 -

Ghosh, Shromona; Berkenkamp, Felix; Ranade, Gireeja; Qadeer, Shaz; Kapoor, Ashish - Verifying Controllers Against Adversarial Examples with Bayesian Optimization - 2018-02-26 -

Gilmer, Justin; Adams, Ryan; Goodfellow, Ian; Andersen, David, Dahl, George - Motivating the Rules of the Game for Adversarial Example Research - 2018-07-20 -

Grace, Katja - Human Level Hardware Timeline - 2017-12-22 -

Grace, Katja - Likelihood of discontinuous progress around the development of AGI - 2018-02-23 -

Green, Brian Patrick - Ethical Reflections on Artificial Intelligence - 2018-06-01 -

Hadfield-Menell, Dylan; Andrus, McKane; Hadfield, Gillian - Legible Normativity for AI Alignment: The Value of Silly Rules - 2018-11-03 -

Hadfield-Menell, Dylan; Hadfield, Gillian - Incomplete Contracting and AI alignment - 2018-04-12 -

Haqq-Misra, Jacob - Policy Options for the radio Detectability of Earth - 2018-04-02 -

Hoang, Lê Nguyên - A Roadmap for the Value-Loading Problem - 2018-09-04 -

Huang, Jessie; Wu, Fa; Precup, Doina; Cai, Yang - Learning Safe Policies with Expert Guidance - 2018-05-21 -

Ibarz, Borja; Leike, Jan; Pohlen, Tobias; Irving, Geoffrey; Legg, Shane; Amodei, Dario - Reward Learning from Human Preferences and Demonstrations in Atari - 2018-11-15 -

IBM - Bias in AI: How we Build Fair AI Systems and Less-Biased Humans - 2018-02-01 -

Irving, Geoffrey; Christiano, Paul; Amodei, Dario - AI Safety via Debate - 2018-05-02 -

Janner, Michael; Wu, Jiajun; Kulkarni, Tejas; Yildirim, Ilker; Tenenbaum, Joshua - Self-Supervised Intrinsic Image Decomposition - 2018-02-05 -

Jilk, David - Conceptual-Linguistic Superintelligence - 2017-07-31 -

Jones, Natalie; O’Brien, Mark; Ryan, Thomas - Representation of future generations in United Kingdom policy-making - 2018-03-26 -

Koller, Torsten; Berkenkamp, Felix; Turchetta, Matteo; Krause, Andreas - Learning-based Model Predictive Control for Safe Exploration - 2018-09-22 -

Krakovna, Victoria - Specification Gaming Examples in AI - 2018-04-02 -

Krakovna, Victoria; Orseau, Laurent; Martic, Miljan; Legg, Shane - Measuring and avoiding side effects using relative reachability - 2018-06-04 -

Kurakin, Alexey; Goodfellow, Ian; Bengio, Samy; Dong, Yinpeng; Liao, Fangzhou; Liang, Ming; Pang, Tianyu ; Zhu, Jun; Hu, Xiaolin; Xie, Cihang; Wang, Jianyu; Zhang, Zhishuai; Ren, Zhou; Yuille, Alan; Huang, Sangxia; Zhao, Yao; Zhao, Yuzhe; Han, Zhonglin; Long, Junjiajia; Berdibekov, Yerkebulan; Akiba, Takuya; Tokui, Seiya; Abe Motoki - Adversarial Attacks and Defences Competition - 2018-03-31 -

Lee, Kimin; Lee, Kibok; Lee, Honglak; Shin, Jinwoo - A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks - 2018-10-27 -

Lehman, Joel; Clune, Jeff; Misevic, Dusan - The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities - 2018-08-14 -

Leibo, Joel; de Masson d'Autume, Cyprien; Zoran, Daniel; Amos, David; Beattie, Charles; Anderson, Keith; Castañeda, Antonio García; Sanchez, Manuel; Green, Simon; Gruslys, Audrunas, Legg, Shane, Hassabis, Demis, Botvinick, Matthew - Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents - 2018-02-04 -

Leike, Jan; Kruegar, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane - Scalable agent alignment via reward modeling: a research direction - 2018-11-19 -

Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane - AI Safety Gridworlds - 2017-11-28 -

Lewis, Gregory; Millett, Piers; Sandberg, Anders; Snyder-Beattie; Gronvall, Gigi - Information Hazards in Biotechnology - 2018-11-12 -

Lipton, Zachary; Steinhardt, Jacob - Troubling Trends in Machine Learning Scholarship - 2018-07-26 -

Liu, Chang; Hamrick, Jessica; Fisac, Jaime; Dragan, Anca; Hedrick, J Karl; Sastry, S Shankar; Griffiths, Thomas - Goal Inference Improves Objective and Perceived Performance in Human-Robot Collaboration - 2018-02-06 -

Liu, Hin-Yan; Lauta, Kristian Cedervall; Mass, Matthijs Michiel - Governing Boring Apocalypses: A new typology of existential vulnerabilities and exposures for existential risk research - 2018-03-26 -

Liu, Qiang; Li, Pan; Zhao, Wentao; Cai, Wei; Yu, Shui; Leung, Victor - A Survey on Security Threats and Defensive Techniques of Machine Learning: A Data Driven View - 2018-02-13 -

Liu, Yang; Price, Huw - Ramsey and Joyce on deliberation and prediction - 2018-08-30 -

Lütjens, Björn; Everett, Michael; How, Jonathan - Safe Reinforcement Learning with Model Uncertainty Estimates - 2018-10-19 -

Malinin, Andrey; Gales, Mark - Predictive Uncertainty Estimation via Prior Networks - 2018-10-08 -

Manheim, David; Garrabrant, Scott - Categorizing Variants of Goodheart's Law - 2018-04-10 -

Martinez-Plumed, Fernando; Loe, Bao Sheng; Flach, Peter; Ó hÉigeartaigh, Seán; Vold, Karina; Hernandez-Orallo, Jose - The Facets of Artificial Intelligence: A Framework to Track the Evolution of AI - 2018-08-21 -

McCaslin, Tegan - Transmitting fibers in the brain: Total length and distribution of lengths - 2018-03-29 -

Menda, Kunal; Driggs-Campbell, Katherine; Kochenderfer, Mykel - EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning - 2018-07-22 -

Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, Hyrum Anderson, Heather Roff, Gregory C. Allen, Jacob Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, Simon Beard, Haydn Belfield, Sebastian Farquhar, Clare Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy, Dario Amodei - The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation - 2018-02-20 -

Milli, Smitha; Schmidt, Ludwig; Dragan, Anca; Hardt, Moritz - Model Reconstruction from Model Explanations - 2018-07-13 -

Mindermann, Soren; Shah, Rohin; Gleave, Adam; Hadfield-Menell, Dylan - Active Inverse Reward Design - 2018-11-16 -

Mogensen, Andreas - Long-termism for risk averse altruists - 1900-01-00 -

Ngo, Richard; Pace, Ben - Some cruxes on impactful alternatives to AI policy work - 2018-10-10 -

Noothigattu, Ritesh; Bouneffouf, Djallel; Mattei, Nicholas; Chandra, Rachita; Madan, Piyush; Varshney, Kush; Campbell, Murray; Singh, Moninder; Rossi, Francesca - Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration - 2018-09-21 -

Nushi, Besmira; Kamar, Ece; Horvitz, Eric - Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure - 2018-09-19 -

Oesterheld, Caspar - Approval-directed agency and the decision theory of Newcomb-like problems - 2017-12-21 -

OpenAI - OpenAI Charter - 2018-04-09 -

Ortega, Pedro; Maini, Vishal; Safety Team, Deepmind - Building safe artificial intelligence: specification, robustness and assurance - 2018-09-27 -

Papernot, Nicolas; McDaniel, Patrick - Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning - 2018-03-13 -

Raghunathan, Aditi; Steinhardt, Jacob; Liang, Percy - Certified Defenses Against Adversarial Examples - 2018-01-29 -

Rainforth, Tom; Kosiorek, Adam; Anh Le, Tuan; Maddison, Chris; Igl, Maximilian; Wood, Frank; Whe Teh, Yee - Tighter Variational Bounds are Not Necessarily Better - 2018-06-25 -

Ratner, Ellis; Hadfield-Menell, Dylan; Dragan, Anca - Simplifying Reward Design through Divide-and-Conquer - 2018-06-07 -

Reddy, Siddharth; Dragan, Anca; Levine, Sergey - Shared Autonomy via Deep Reinforcement Learning - 2018-05-23 -

Reddy, Siddharth; Dragan, Anca; Levine, Sergey - Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behaviour - 2018-10-20 -

Rees, Martin - On The Future - 2018-10-16 -

rk; Sempere, Nuno - AI development incentive gradients are not uniformly terrible - 2018-11-12 -

Ruan, Wenjie; Huang, Xiaowei; Kwiatkowska, Marta - Reachability Analysis of Deep Neural Networks with Provable Guarantees - 2018-05-06 -

Sadigh, Dorsa; Sastry, Shankar; Seshia, Sanjit; Dragan, Anca - Planning for Autonomous Cars that Leverage Effects on Human Actions - 2016-06-01 -

Sandberg, Anders - Human Extinction from Natural Hazard Events - 2018-02-01 -

Sarma, Gopal; Hay, Nick - Mammalian Value Systems - 2017-12-31 -

Sarma, Gopal; Hay, Nick - Robust Computer Algebra, Theorem Proving, and Oracle AI - 2017-12-31 -

Sarma, Gopal; Hay, Nick; Safron, Adam - AI Safety and Reproducibility: Establishing Robust Foundations for the Neuropsychology of Human Values - 2018-09-08 -

Schulze, Sebastian; Evans, Owain - Active Reinforcement Learning with Monte-Carlo Tree Search - 2018-03-13 -

Shah, Rohin - AI Alignment Newsletter - 1905-07-10 -

Shah, Rohin; Christiano, Paul; Armstrong, Stuart; Steinhardt, Jacob; Evans, Owain - Value Learning Sequence - 2018-10-29 -

Shahar, Avin - Mavericks and Lotteries - 2018-09-25 -

Shahar, Avin; Shapira, Shai - Civ V AI Mod - 2018-01-05 -

Shaw, Nolan P.; Stockel, Andreas; Orr, Ryan W.; Lidbetter, Thomas F.; Cohen, Robin - Towards Provably Moral AI Agents in Bottom-up Learning Frameworks - 2018-03-15 -

Sotala, Kaj - Shaping economic incentives for collaborative AGI - 2018-06-29 -

Sotala, Kaj; Gloor, Lukas - Superintelligence as a Cause or Cure for Risks of Astronomical Suffering - 2017-08-31 -

Stuhlmuller, Andreas - Factored Cognition - 2018-04-25 -

Taylor, Jessica; Gallagher, Jack; Maltinsky, Baeo - Insight-based AI timeline model - 1905-07-10 -

The Future of Life Institute - Value Alignment Research Landscape - 1900-01-00 -

Trammell, Philip - Fixed-Point Solutions to the Regress Problem in Normative Uncertainty - 2018-08-29 -

Tucker, Aaron; Gleave, Adam; Russell, Stuart - Inverse Reinforcement Learning for Video Games - 2018-10-24 -

Turchin, Alexey - Could slaughterbots wipe out humanity? Assessment of the global catastrophic risk posed by autonomous weapons - 2018-03-19 -

Turchin, Alexey; Denkenberger, David - Classification of Global Catastrophic Risks Connected with Artificial Intelligence - 2018-05-03 -

Turner, Alex - Towards a New Impact Measure - 2018-09-18 -

Umbrello, Steven; Baum, Seth - Evaluating Future nanotechnology: The Net Societal Impacts of Atomically Precise Manufacturing - 2018-04-30 -

Vonitzer, Vincent; Sinnott-Armstrong, Walter; Borg, Jana Schaich; Deng, Yuan; Kramer, Max - Moral Decision Making Frameworks for Artificial Intelligence - 2017-02-12 -

Wang, Xin; Chen, Wenhu; Wang, Yuan-Fang ; Yang Wang, William - No Metrics are Perfect: Adversarial Reward Learning for Visual Storytelling - 2018-07-09 -

Wu, Yi; Siddharth, Srivastava; Hay, Nicholas; Du, Simon; Russell, Stuart - Discrete-Continuous Mixtures in Probabilistic Programming: Generalised Semantics and Inference Algorithms - 2018-06-13 -

Wu, Yueh-Hua; Lin, Shou-De - A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents - 2018-09-10 -

Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Lesser, Victor; Yang, Qiang - Building Ethics into Artificial Intelligence - 2018-07-13 -

Yudkowsky, Eliezer - The Rocket Alignment Problem - 2018-10-03 -

Yudkowsky, Eliezer; Christiano, Paul - Challenges to Christiano's Capability Amplification Proposal - 2018-05-19 -

Zhou, Allen; Hadfield-Menell, Dylan; Nagabandi, Anusha; Dragan, Anca - Expressive Robot Motion Timing - 2018-02-05 -

New Comment
26 comments, sorted by Click to highlight new comments since:

Rot13's content, hidden using spoiler markup:

Despite having donated to MIRI consistently for many years as a result of their highly non-replaceable and groundbreaking work in the field, I cannot in good faith do so this year given their lack of disclosure. Additionally, they already have a larger budget than any other organisation (except perhaps FHI) and a large amount of reserves.

Despite FHI producing very high quality research, GPI having a lot of promising papers in the pipeline, and both having highly qualified and value-aligned researchers, the requirement to pre-fund researchers’ entire contract significantly increases the effective cost of funding research there. On the other hand, hiring people in the bay area isn’t cheap either.

This is the first year I have attempted to review CHAI in detail and I have been impressed with the quality and volume of their work. I also think they have more room for funding than FHI. As such I will be donating some money to CHAI this year.

I think of CSER and GCRI as being relatively comparable organisations, as 1) they both work on a variety of existential risks and 2) both primarily produce strategy pieces. In this comparison I think GCRI looks significantly better; it is not clear their total output, all things considered, is less than CSER’s, but they have done so on a dramatically smaller budget. As such I will be donating some money to GCRI again this year.

ANU, Deepmind and OpenAI have all done good work but I don’t think it is viable for (relatively) small individual donors to meaningfully support their work.

Ought seems like a very valuable project, and I am torn on donating, but I think their need for additional funding is slightly less than some other groups.

AI Impacts is in many ways in a similar position to GCRI, with the exception that GCRI is attempting to scale by hiring its part-time workers to full-time, while AI Impacts is scaling by hiring new people. The former is significantly lower risk, and AI Impacts seems to have enough money to try out the upsizing for 2019 anyway. As such I do not plan to donate to AI Impacts this year, but if they are able to scale effectively I might well do so in 2019.

The Foundational Research Institute have done some very interesting work, but seem to be adequately funded, and I am somewhat more concerned about the danger of risky unilateral action here than with other organisations.

I haven’t had time to evaluate the Foresight Institute, which is a shame because at their small size marginal funding could be very valuable if they are in fact doing useful work. Similarly, Median and Convergence seem too new to really evaluate, though I wish them well.

The Future of Life institute grants for this year seem more valuable to me than the previous batch, on average. However, I prefer to directly evaluate where to donate, rather than outsourcing this decision.

I also plan to start making donations to individual researchers, on a retrospective basis, for doing useful work. The current situation, with a binary employed/not-employed distinction, and upfront payment for uncertain output, seems suboptimal. I also hope to significantly reduce overhead (for everyone but me) by not having an application process or any requirements for grantees beyond having produced good work. This would be somewhat similar to Impact Certificates, while hopefully avoiding some of their issues.

As usual, this is an excellent resource. Thanks so much. I've PM'd you with about 5 typos / minor errors.

Also, I'm stealing this:

In an early proof of the viability of cryonics, LessWrong has been brought back to life.

Thanks! I have fixed most of the typos.

OpenPhil gave Carl Shulman $5m to re-grant

I didn't realise this was happening. Is there somewhere we can read about grants from this fund when/if they occur?

The vast majority of discussion in this area seems to consist of people who are annoyed at ML systems are learning based on the data, rather than based on the prejudices/moral views of the writer.

While many writers may take this flawed view, there's also a very serious problem here.

Decision-making question: Let there be two actions A and ~A. Our goal is to obtain outcome G. If P(G | A) > P(G | ~A), should we do A?

The correct answer is “maybe.” All distributions of P(A,G) are consistent with scenarios in which doing A is the right answer, and scenarios in which it’s the wrong answer.

If you adopt a rule “do A, if P(G | A) > P(G | ~A)”, then you get AI systems which tell you never to go to the doctor, because people who go to the doctor are more likely to be sick. You may laugh, but I’ve actually seen an AI paper where a neural net for diagnosing diabetes was found to be checking every other diagnosis of the patient, in part because all diagnoses are correlated with doctor visits.

The moral of the story is that it is in general impossible to make decisions based purely on observational statistics. It comes down to the difference between P(G | A) and P(G | do(A)). The former is defined by counting the co-occurences of A and G; the latter is defined by writing G as a deterministic function of A (and other variables) plus random noise.

This is the real problem of bias: the decisions an AI makes may not actually produce the outcomes predicted by the data, because the data itself was influenced by previous decisions.

The third part of this slide deck explains the problem very well, with lots of references:

Source: I’m involved in a couple of causal inference projects.

Thanks for the informative post as usual.

Full-disclosure: I'm a researcher at UC Berkeley financially supported by CHAI, one of the organisations reviewed in this post. However, this comment is just my personal opinion.

Re: location, I certainly agree that an organization does not need to be in the Bay Area to do great work, but I do think location is important. In particular, there's a significant advantage to working in or near a major AI hub. The Bay Area is one such place (Berkeley, Stanford, Google Brain, OpenAI, FAIR) but not the only one; e.g. London (DeepMind, UCL) and Montreal (MILA, Brain, et al) are also very strong.

I also want to push back a bit on the assumption that people working for AI alignment organisations will be involved with EA and rationalist communities. While it may be true in many cases, at CHAI I think it's only around 50% of staff. So whether these communities are thriving or not in a particular area doesn't seem that relevant to me for organisational location decisions.

I definitely being near AI hubs is helpful, and I'd be interested in supporting any credible new groups that started in other hubs.

Thanks for that extra info on CHAI staff. In general my objections to the bay area are partly about the EA/LW culture there, and partly about the broader culture. I did end up donating to CHAI despite this!

On the theme of 'what about my other contributions', here are two with my name on them that I'd point to as similarly important to the one that was included:

Promoted to curated. This is an amazing resource for anyone trying to get a sense of the AI Alignment landscape, and I am really thankful for all the effort that must have gone into writing this.

It is possible they had timing issues whereby a substantial amount of work was done in earlier years but only released more recently. In any case they have published more in 2018 than in previous years.

(Disclosure: I am executive director of CSER) Yes. As I described in relation to last year's review, CSER's first postdoc started in autumn 2015, most started in mid 2016. First stages of research and papers began being completed throughout 2017, most papers then going to peer-reviewed journals. 2018 is more indicative of run-rate output, although 2019 will be higher.

Throughout 2016-2017, considerable CSER leadership time (mine in particular) has also gone on getting up and running, which will increase our output on AI safety/strategy/governance (although CFI also separately works on near term and non-AI safety-related topics).

Thank you for another detailed review! (response cross-posted to EA forum too)

In the past [EAF/FRI] have been rather negative utilitarian, which I have always viewed as an absurd and potentially dangerous doctrine. If you are interested in the subject I recommend Toby Ord’s piece on the subject. However, they have produced research on why it is good to cooperate with other value systems, making me somewhat less worried.

(I work for FRI.) EA/FRI is generally "suffering-focused", which is an umbrella term covering a range of views; NU would be the most extreme form of that, and some of us do lean that way, but many disagree with it and hold some view which would be considered much more plausible by most people (see the link for discussion). Personally I used to lean more NU in the past, but have since then shifted considerably in the direction of other (though still suffering-focused) views.

Besides the research about the value of cooperation that you noted, this article discusses reasons why the expected value of x-risk reduction could be positive even from a suffering-focused view; the paper of mine referenced in your post also discusses why suffering-focused views should care about AI alignment and cooperate with others in order to ensure that we get aligned AI.

And in general it's just straightforwardly better and (IMO) more moral to try to create a collaborative environment where people who care about the world can work together in support of their shared points of agreement, rather than trying to undercut each other. We are also aware of the unilateralist's curse and do our best to discourage any other suffering-focused people from doing anything stupid.

Excellent and useful review. Definitely the kind of thing that I would like to encourage in the future, and which also holds historical interest - what kind of progress was made in 2018?

My take+decision on the MIRI issue, in ROT13 to continue the pattern

Nabgure (zvabe) "Gbc Qbabe" bcvavba. Ba gur ZVEV vffhr: nterr jvgu lbhe pbapreaf, ohg pbagvahr qbangvat, sbe abj. V nffhzr gurl'er shyyl njner bs gur ceboyrz gurl'er cerfragvat gb gurve qbabef naq jvyy nqqerff vg va fbzr snfuvba. Vs gurl qb abg zvtug nqwhfg arkg lrne. Gur uneq guvat vf gung ZVEV fgvyy frrzf zbfg qvssreragvngrq va nccebnpu naq gnyrag bet gung pna hfr shaqf (if BcraNV naq QrrcZvaq naq jryy-shaqrq npnqrzvp vafgvghgvbaf)

May I recommend spoiler markup? Just start the line with >!

Another (minor) "Top Donor" opinion. On the MIRI issue: agree with your concerns, but continue donating, for now. I assume they're fully aware of the problem they're presenting to their donors and will address it in some fashion. If they do not might adjust next year. The hard thing is that MIRI still seems most differentiated in approach and talent org that can use funds (vs OpenAI and DeepMind and well-funded academic institutions)

Thanks for doing this! I couldn't figure out how.

Thanks for sharing, seems like a reasonable take to me.

Shah et al.'s Value Learning Sequence is a short sequence of blog posts outlining the specification problem.

The link goes to the Embedded Agency sequence, not the value learning sequence (

(Cross-posted to the EA forum). (Disclosure: I am executive director of CSER) Thanks again for a wide-ranging and helpful review; this represents a huge undertaking of work and is a tremendous service to the community. For the purpose of completeness, I include below 14 additional publications authored or co-authored by CSER researchers for the relevant time period not covered above (and one that falls just outside but was not previously featured):

Global catastrophic risk:

Ó hÉigeartaigh. The State of Research in Existential Risk

Avin, Wintle, Weitzdorfer, O hEigeartaigh, Sutherland, Rees (all CSER). Classifying Global Catastrophic Risks

International governance and disaster governance:

Rhodes. Risks and Risk Management in Systems of International Governance.


Rhodes. Scientific freedom and responsibility in a biosecurity context.

Just missing the cutoff for this review but not included last year, so may be of interest is our bioengineering horizon-scan. (published November 2017). Wintle et al (incl Rhodes, O hEigeartaigh, Sutherland). Point of View: A transatlantic perspective on 20 emerging issues in biological engineering.

Biodiversity loss risk:

Amano (CSER), Szekely… & Sutherland. Successful conservation of global waterbird populations depends on effective governance (Nature publication)

CSER researchers as coauthors:

(Environment) Balmford, Amano (CSER) et al. The environmental costs and benefits of high-yield farming

(Intelligence/AI) Bhatagnar et al (incl Avin, O hEigeartaigh, Price): Mapping Intelligence: Requirements and Possibilities

(Disaster governance): Horhager and Weitzdorfer (CSER): From Natural Hazard to Man-Made Disaster: The Protection of Disaster Victims in China and Japan

(AI) Martinez-Plumed, Avin (CSER), Brundage, Dafoe, O hEigeartaigh (CSER), Hernandez-Orallo: Accounting for the Neglected Dimensions of AI Progress

(Foresight/expert elicitation) Hanea… & Wintle The Value of Performance Weights and Discussion in Aggregated Expert Judgments

(Intelligence) Logan, Avin et al (incl Adrian Currie): Uncovering the Neural Correlates of Behavioral and Cognitive Specialization

(Intelligence) Montgomery, Currie et al (incl Avin). Ingredients for Understanding Brain and Behavioral Evolution: Ecology, Phylogeny, and Mechanism

(Biodiversity) Baynham Herdt, Amano (CSER), Sutherland (CSER), Donald. Governance explains variation in national responses to the biodiversity crisis

(Biodiversity) Evans et al (incl Amano). Does governance play a role in the distribution of invasive alien species?

Outside of the scope of the review, we produced on request a number of policy briefs for the United Kingdom House of Lords on future AI impacts; horizon-scanning and foresight in AI; and AI safety and existential risk, as well as a policy brief on the bioengineering horizon scan. Reports/papers from our 2018 workshops (on emerging risks in nuclear security relating to cyber; nuclear error and terror; and epistemic security) and our 2018 conference will be released in 2019.

Thanks again!

See next year's post here.

In the FHI's indifference paper, they define policies as mapping observation-action histories to a distribution over actions instead of just actions ("π : H → ∆(A)"). Why is that? Is that common? Does it mean the agent is stochastic?

I didn't look at that particular paper, but that definition sounds like a reasonable way of doing it, since that way your results apply to both stochastic and deterministic agents. A deterministic policy is a special case of a stochastic policy, where the distribution over actions assigns one action 100% probability of being taken and all other actions a 0% probability. So if you define policies as mapping from histories to distributions of actions, that allows for both deterministic and stochastic agents.

Yeah. I think I did notice it talking about a stochastic policy at one point, and on reflection I don't see any other reasonable way to do that. This interpretation also accords with making the agent's actions part of the observation history. If they were a pure function of the observations, we wouldn't need them to be there.

Typo thread: "The vast majority of discussion in this area seems to consist of people who are annoyed at ML systems are learning based on the data." I think that should be " that are learning..." or "...who are annoyed that ML systems..."

Very thorough, and it's very worthwhile that posts like this are made.

I would like to emphasis that there is a lot of research I didn't have time to review, especially in this section, as I focused on reading organisation-donation-relevant pieces. For example, Kosoy's The Learning-Theoretic AI Alignment Research Agenda seems like a worthy contribution.

I would like to note that my research is funded by MIRI, so it is somewhat organisation-donation-relevant.

Some dates in your list of bibliography are like: 1905-07-10 which seems to be an error.

I created my full bibliography and just for sake of completeness put it here.

List of my AI Safety related articles (many are coauthored), published, drafted and planned:
Artificial Intelligence in Life Extension: from Deep Learning to Superintelligence - published, Informatica

Military AI as a Convergent Goal of Self-Improving AI - published, “AI Safety and security"

Classification of Global Catastrophic Risks Connected with Artificial Intelligence - published “AI and Society"

Predictions of the Near-Term Global Catastrophic Risks of Artificial Intelligence - published (under the name "Assessing the future plausibility of catastrophically dangerous AI”) in “Futures"

The Global Catastrophic Risks Connected with Possibility of Finding Alien AI During SETI - published in “Journal of British interplanetary Society"

Classification of the Global Solutions of the AI Safety Problem - won a Good AI prize, submitted.

Could slaughterbots wipe out humanity? Assessment of the global catastrophic risk posed by autonomous weapons - draft

Message to Any Future AI: “There are several instrumental reasons why exterminating humanity is not in your interest” - not intended to be published in current form, but probably the most important of all works, as it is actionable in current form. Scheduled for revision in 2019.

"Decisive strategic advantage via Narrow AI” - LW post, submitted.

Levels of self-improvement - draft, LW post

The map of "Levels of defence" in AI safety - LW post

"Possible Dangers of the Unrestricted Value Learners” - LW post

"AI nanny via human upload” - early draft

"Catching treacherous turn: different ideas about AI boxing" - early draft

Hidden assumptions in the idea that humans have values - AI Safety Camp project, to be finished in early 2019.