AI safety: three human problems and one AI issue

by Stuart_Armstrong2 min read19th May 20177 comments


Personal Blog

Crossposted at the Intelligent agent foundation.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.

Specifically, I feel AI safety issues can be classified as three human problems and one central AI issue. The human problems are:

  • Humans don't know their own values (sub-issue: humans know their values better in retrospect than in prediction).
  • Humans are not agents and don't have stable values (sub-issue: humanity itself is even less of an agent).
  • Humans have poor predictions of an AI's behaviour.

And the central AI issue is:

  • AIs could become extremely powerful.

Obviously if humans were agents and knew their own values and could predict whether a given AI would follow those values or not, there would be not problem. Conversely, if AIs were weak, then the human failings wouldn't matter so much.

The points about human values is relatively straightforward, but what's the problem with humans not being agents? Essentially, humans can be threatened, tricked, seduced, exhausted, drugged, modified, and so on, in order to act seemingly against our interests and values.

If humans were clearly defined agents, then what counts as a trick or a modification would be easy to define and exclude. But since this is not the case, we're reduced to trying to figure out the extent to which something like a heroin injection is a valid way to influence human preferences. This makes both humans susceptible to manipulation, and human values hard to define.

Finally, the issue of humans having poor predictions of AI is more general than it seems. If you want to ensure that an AI has the same behaviour in the testing and training environment, then you're essentially trying to guarantee that you can predict that the testing environment behaviour will be the same as the (presumably safe) training environment behaviour.


How to classify methods and problems

That's well and good, but how to various traditional AI methods or problems fit into this framework? This should give us an idea as to whether the framework is useful.

It seems to me that:


  • Friendly AI is trying to solve the values problem directly.
  • IRL and Cooperative IRL are also trying to solve the values problem. The greatest weakness of these methods is the not agents problem.
  • Corrigibility/interruptibility are also addressing the issue of humans not knowing their own values, using the sub-issue that human values are clearer in retrospect. These methods also overlap with poor predictions.
  • AI transparency is aimed at getting round the poor predictions problem.
  • Laurent's work on carefully defining the properties of agents is mainly also about solving the poor predictions problem.
  • Low impact and Oracles are aimed squarely at preventing AIs from becoming powerful. Methods that restrict the Oracle's output implicitly accept that humans are not agents.
  • Robustness of the AI to changes between testing and training environment, degradation and corruption, etc... ensures that humans won't be making poor predictions about the AI.
  • Robustness to adversaries is dealing with the sub-issue that humanity is not an agent.
  • The modular approach of Eric Drexler is aimed at preventing AIs from becoming too powerful, while reducing our poor predictions.
  • Logical uncertainty, if solved, would reduce the scope for certain types of poor predictions about AIs.
  • Wireheading, when the AI takes control of reward channel, is a problem that humans don't know their values (and hence use an indirect reward) and that the humans make poor predictions about the AI's actions.
  • Wireheading, when the AI takes control of the human, is as above but also a problem that humans are not agents.
  • Incomplete specifications are either a problem of not knowing our own values (and hence missing something important in the reward/utility) or making poor predictions (when we though that a situation was covered by our specification, but it turned out not to be).
  • AIs modelling human knowledge seem to be mostly about getting round the fact that humans are not agents.

Putting this all in a table:


Not Agents
Poor PredictionsPowerful
Friendly AI


Corrigibility/interruptibility X
AI transparency

Laurent's work

Low impact and Oracles

Robustness to adversaries

Modular approach

Logical uncertainty

Wireheading (reward channel) X X X
Wireheading (human) X
Incomplete specifications X
AIs modelling human knowledge


Further refinements of the framework

It seems to me that the third category - poor predictions - is the most likely to be expandable. For the moment, it just incorporates all our lack of understanding about how AIs would behave, but this might more useful to subdivide.


7 comments, sorted by Highlighting new comments since Today at 7:34 AM
New Comment

I think values are confusing because they aren't a natural kind. The first decomposition that made sense was 2 axes: stated/revealed and local/global

stated local values are optimized for positional goods, stated global values are optimized for alliance building, revealed local are optimized for basic needs/risk avoidance, revealed global barely exist and when they do are semi-random based on mimesis and other weak signals (humans are not automatically strategic etc.)

Trying to build a coherent picture out of various outputs of 4 semi independent processes doesn't quite work. Even stating it this way reifies values too much. I think there are just local pattern recognizers/optimizers doing different things that we have globally applied this label of 'values' to because of their overlapping connotations in affordance space and because switching between different levels of abstraction is highly useful for calling people out in sophisticated hard to counter ways in monkey politics.

Also useful to think of local/global as Dyson's birds and frogs, or surveying vs navigation.

I'm unfamiliar with existing attempts at value decomposition if anyone knows of papers etc.

On predictions, humans treating themselves and others as agents seems to lead to a lot of problems. Could also deconstruct poor predictions based on which sub-system it runs into the limits of: availability, working memory, failure to propagate uncertainty, inconsistent time preferences...can we just invert the bullet points from superforecasting here?

If we create AI around human upload, or a model of human mind, it solves some of the problems:

1) It will, by definition, have the same values and the same value structure as a human being; in short, – human uploading solves value uploading.

2) It will be also not an agent

3) We could predict human upload behaviour based on our experience with predicting human behaviour.

And it will be not very powerful or very capable to strong self-improvement, because of the messy internal structure.

However, it could still be above human level because of acceleration of hardware and some tweaking. Using it we could construct primitive AI Police or AI Nanny, which will prevent the creation of any other types of AIs.

Convergent instrumental goals would make agent-like things become agents if they can self-modify (humans can't do this to any strong extent).

If we make a model of a specific human, – for example, morally sane and rationally educated person with an excellent understanding of all said above, he could choose the right level self-improving, as he will understand dangers of becoming too much instrumental goals orientated agent. I don't know any such person in real life, btw.

Looks interesting. Thanks for doing this. It would be useful to me to get links to some of the things you mention (like eric drexler's work, I am not familiar with).

I think there might be some categories missed here, it only presents the AI building problems. There are further classes of problem such as social problems. For example the problem of all agreeing not to develop AI until the other problems are solved.

There are also questions around whether we can make effective agents at all. We have existence proof on effective non-agent intelligences, but none around effective agent intelligences.

We invented the model of rational agents for humans, then we had to add lots and lots of exceptions (heuristics and biases). So much that you have disavowed it in this post. Yet we keep with the same model for AIs. Perhaps we need a kuhnian paradigm shift in our understanding of agents.

IELTS, TOEFL, PTE & GRE for sale (

We are an organization specialized in the acquisition of IELTS, TOELF and GRE certification . We can provide original certificates for those of you, who for one reason or the other are unable to take the test or obtain the required band score demanded by their institution, employers or embassy . We can help you overcome this hike by offering Test Free Certificates in IELTS, TOEFL IBT and GRE.?

if interested contact below for more details :

skype : online.service75

Also I think there are other problem sets if physics is such that AI hits a limit for a while and is only very powerful rather than stupidly powerful. E.g. not powerful enough to takeover the whole world but powerful enough to destroy the whole world.

Then you have to deal with potential communities of AIs and their dynamics. Which might fit into poor predictions (but is magnified by having a multiplicity of AI).