This is a cross-post from my blog.

Artificial Intelligence safety researchers are concerned with the behaviour of “highly capable” AI systems. One of the challenges of this line of research is the fact that it’s hard to say, looking at today’s moderately capable AI systems, what highly capable systems will do or how they will work.

There are two intuitions that suggest understanding highly capable systems is particularly important for safety research:

  1. Highly capable AI systems could be very useful, so we expect that if they can be built they probably will be
  2. Highly capable AI systems raise much more serious safety concerns than less capable systems

This seems to make sense to me, but what does anyone actually mean by “highly capable”? I know Shane Legg and Marcus Hutter have had a go at understanding the related idea of “intelligence”: they collected many different definitions and proposed their own definition, which they informally summarise as

Intelligence measures an agent’s ability to achieve goals in a wide range of

I think this is a good definition, especially the more precise version Legg and Hutter offer later in their paper. It also does a good job of explaining the first intuition: A system with high intelligence can achieve many different goals in many environments. This means that a system that is very intelligent can achieve malign goals in environments where people are trying to stop it from doing so. This means that we want to be quite sure that the goals a highly capable system is pursuing are not malign, which is one of the major concerns of AI safety researchers.

It can also account, to some extent, for the first intuition: a system that can achieve many different goals can, presumably, do a better job of achieving our goals than a less intelligent system. However, in this case there is a gap in the argument: we have to assume that, because the system can achieve some goal that it is likely to achive our goals. That is: a more intelligent AI system is more useful if that intelligence can be controlled.

This matters because AI safety research is trying to come up with comptetitive proposal for highly capable AI systems that are both competitive with unsafe systems and safe. My proposal is that competitiveness is not just a function of intelligence (per Legg and Hutter), but of effectiveness, which can be roughly understood as “controllable intelligence”.

AI effectiveness

Hhighly effective AI systems are systems that are both highly intelligent and controllable (though maybe imperfectly so). I am not saying that highly effective systems are necessarily safe, just that, along with goal-achieving capabilities, control mechanisms are a critical part of AI systems, and we need to consider both if we want to understand what future systems might be capable of.

More specifically, I want to propose that we consider effectiveness to be the degree to which a system augement’s a person’s existing capabilities:

An AI system X is at least as effective as a system Y if a person in possession of X can do everything the same person could do if they had Y but not X, in the same time or less. X is strictly more effective than Y if a person in possession of X can also do at least one thing in less time than the same person in possession of Y.

The reason for defining effectiveness like this is that it has a more direct bearing on a system’s usefulness than the definition of intelligence above. By definition, someone with a more effective AI system can do more than someone with a less effective system, and so (we might presume), the more effective system is more desirable.

This definition has at least one weird feature: the effectiveness of an AI system will change over time. Someone in possession of AlexNet in 2012 could classify very large sets of images more accurately than someone without it (that is, sets of images too large for a person to manually classify). Today, however, there are plenty of free tools that substantially outperform AlexNet, so a human with AlexNet no longer has a capability advantage over the same human without.

In the future, it may be the case that there are many AI systems that are extremely effective by today’s standards, but by this definition humans using them may have no capability advantages over humans without them. However, because they enable a whole lot of things that are not possible today, these systems could all pose serious safety concerns that researchers today want to address. We can amend the idea of effectiveness to be a time dependent measure:

An AI system X is at least as 2020-effective as a system Y if a person in possession of X can do everything the same person could do if they had Y and other technology available in the year 2020, but not X. (etc.)

In 2030 there will probably be many extremely 2020-effective AI systems, even if they are not particulary 2030-effective. However, 2020-effectiveness is not necessarily a good measurement of the desirability of a new system in 2030. When assessing the desirability of a new AI system, it doesn’t matter that it does things that would be considered very impressive 10 years ago if it does not achieve impressive things today.

The way it looks to me is that effeciveness relative to a fixed base year is a measure of particular interest when we want to think about what might go wrong with highly effective AI systems, while effectiveness relative to what’s available right now is of particular interest when we want to think about which kinds of systems are more likely to be adopted. However, it may be the case that one is somewhat predictable from the other — for example, the most 2030 effective AI systems might also be among the most 2020-effective systems. In this case, the measure can help us think about both questions.

Effectiveness also depends quite substantially on the background knowledge of the person. A specialist might be able to accomplish a lot with some systems, while a layperson may not be able to do anything with it. I’m not sure how to deal with this feature at this stage.

Decomposing effectivness into intelligence and controllability

In the situation where the AI does all of the hard work, I think it is possible to decompose effectiveness into “intelligence” and “controllability”. If a person has the ability perfectly transmit their goals to an AI system, and they rely on it entirely to achieve the goal, then whether or not the goal is achieved depends entirely on the AI system’s ability to achieve goals, which is intelligence in the Legg-Hutter sense. Therefore, any differences between “effectiveness” and “intelligence” must come from the inability of a person to perfectly transmit their goals to the system.

On the other hand, we are interested in how much the person and the system can achieve together, so the person may perform important steps that the system cannot do. It is not so obvious to me if or how we could decompose the capability of this hybrid system into “intelligence” and “controllability”. One option would be to consider second-order control - can the person A +AI system successfully achieve a third party’s (“person B’s”) goals? The problem here is: if person A cannot perfectly align the AI system with their own goals, then what is the “goal” of the person A + AI system"? We need some way of determining this if we want to ally Legg and Hutter’s definition of intelligence.

Would it be useful to test effectiveness?

To test effectiveness, we can pair a person with an AI system and ask them, in a limited time, to perform a number of tasks. We can repeat this many times for different people and different AI systems, and get an overall measure of which people + systems can perform which tasks.

The results of such a test are likely to be interesting only if performance on some tasks are generalisable to performance on other tasks. If all AI systems are highly specialised, then we may find that system X can enable someone to accomplish task 1 but not any of tasks 2-100, while system Y can enable task 2 but not 1 or 3-100 etc. In this case, no system would be more effective than any other by the original definition.

If performance is not generalisable, then it seems that we live in a world where we are unlikely to encounter unexpected dangers from AI. So even though we would not learn much of interest about the systems tested, it would be valuable to know if generalisation is never possible. It is also possible that present AI systems are strictly specialised while future systems are general, although I would expect that we would see some trend in this direction at some point before systems became highly general.

A particularly interesting and difficult question from the point of view of safety is whether performance is generalisable from safe tasks to unsafe tasks. If a human + AI system can accomplish tasks 1-100, which for arguments sake involve safe things like event planning, publishing mathematical research and game playing, can we guess whether they could accomplish unsafe (or potentially unsafe) things like achieving a large amount of political influence or large scale homicide. We don’t want to test these things directly, so we won’t have past examples to help us make predictions. This makes assessing this kind of generalisation particularly difficult.

If there is a test of effectiveness that is interesting — for example, because the performance of human + AI systems turns out to be quite generalisable — it could end up accelerating the speed of AI advancement because researchers will be interested in performing better on an effectiveness benchmark. If safety research needs more time to “catch up” to AI capability, this could be a negative effect of measuring effectiveness. On the other hand, it could also speed up safety research by offering safety researchers a better benchmark on which they need to compete with regular AI capability.

Safety implications of effectiveness

A system that is ineffective is unlikely to present safety concerns, because no-one can figure out how to use it to do anything they couldn’t do already. If task performance is generalisable, a human+AI system that is highly effective presents safety concerns as it is likely to achieve high performance on safe and unsafe tasks.

It would be interesting to me if more than this could be said, but it’s not obvious that it can. For example, the degree to which an AI system is controllable places a limit on its effectiveness: a completely uncontrollable system seems unlikely to help anyone do anything. Thus a highly effective system will be controllable to some extent. However, it is possible for a system to usually be controllable while still exhibiting catastrophic failures of control often enough for such events to be highly likely overall. So this simplistic argument doesn’t establish that highly effective AI systems won’t be unsafe due to control failures.


The proposed definition of effectiveness and the arguments I’ve offered here are vague. Nevertheless, I think there are reasons to think that better understanding AI effectiveness could be helpful to safety researchers:

  • To know which kinds of systems are likely to be developed or deployed in the future, thinking about their effectiveness could be more useful than thinking about their intelligence 
    • This is especially important for thinking about the alignment tax
  • We might be able to estimate levels of effectiveness at which systems may be dangerous
  • We might be able to measure AI effectiveness and predict how this will change in the future 
    • Though the definition given here is not suitable for this task
  • Theoretical understanding of the relationship between controllability, intelligence and effectiveness could help clarify questions about whether future AI systems are likely to be dangerous by default


New Comment