"Measuring universal intelligence: Towards an anytime intelligence test"; abstract:

In this paper, we develop the idea of a universal anytime intelligence test. The meaning of the terms “universal” and “anytime” is manifold here: the test should be able to measure the intelligence of any biological or artificial system that exists at this time or in the future. It should also be able to evaluate both inept and brilliant systems (any intelligence level) as well as very slow to very fast systems (any time scale). Also, the test may be interrupted at any time, producing an approximation to the intelligence score, in such a way that the more time is left for the test, the better the assessment will be. In order to do this, our test proposal is based on previous works on the measurement of machine intelligence based on Kolmogorov complexity and universal distributions, which were developed in the late 1990s (C-tests and compression-enhanced Turing tests). It is also based on the more recent idea of measuring intelligence through dynamic/interactive tests held against a universal distribution of environments. We discuss some of these tests and highlight their limitations since we want to construct a test that is both general and practical. Consequently, we introduce many new ideas that develop early “compression tests” and the more recent definition of “universal intelligence” in order to design new “universal intelligence tests”, where a feasible implementation has been a design requirement. One of these tests is the “anytime intelligence test”, which adapts to the examinee's level of intelligence in order to obtain an intelligence score within a limited time.


Example popular media coverage: http://www.sciencedaily.com/releases/2011/01/110127131122.htm

The group's homepage: http://users.dsic.upv.es/proy/anynt/

(There's an applet but it seems to be about constructing a simple agent and stepping through various environments, and no working IQ test.)


The basic idea, if you already know your AIXI*, is to start with simple programs** and then test the subject on increasingly harder ones. To save time, boring games such as random environments or one where the agent can 'die'*** are excluded and a few rules added to prevent gaming the test (by, say, deliberately failing on harder tests so as to be given only easy tests which one scores perfectly on) or take into account how slow or fast the subject makes predictions.


* apparently no good overviews of the whole topic AIXI but you could start at http://www.hutter1.net/ai/aixigentle.htm or http://www.hutter1.net/ai/uaibook.htm

** simple as defined by Kolmogorov complexity; since KC is uncomputable, one of the computable variants - which put bounds on resource usage - is used instead

*** make a mistake which turns any future rewards into fixed rewards with no connection to future actions

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 9:55 PM

To save time, boring games such as random environments or one where the agent can 'die'* are excluded

* make a mistake which turns any future rewards into fixed rewards with no connection to future actions

This seems like a glaring bias in algorithm design... like it is somehow building anthropic craziness into the definition of intelligence on purpose? If I was trying to design something that maximized its score it would never bother to consider situations of possible death because they "don't count" in the context it is practicing within, where it is presumably deploying priors/habits/heuristics for use in other contexts. Generalizing this insight to other design or educational processes is kind of worry inducing...

'die' is my own term, since it seemed to be the game term analogous to 'when an agent makes a move that renders it causally unconnected to all future rewards' (again, my own description).

The problem with including games in which one can 'die' is that they take much longer to learn. Suppose the agent the first time it plays the game happens to 'die', and now it only experiences a steady stream of 1,1,1,1,1... (low rewards). Nothing it does changes its future rewards, so exploration (trying different moves) is penalized. Dying on the first move might look like a good strategy!

Imagine if the rules looked like this: die~>1,1,1,1,...; not-die~>either -1 or +10. If the agent first tried out die, saw the +1 rewards, then the next game chose not-die and got -1, it may permanently start exploring down the die branch. An agent might eventually go back and try the not-die route and finally discover the +10, but this would take a while and is at odds with the idea of a reasonably quickly administered IQ test. Better to exclude such tests and switch to a more complex one.

Yes, now that I think about it, I guess their formalism tends towards incredibly low-signal environments where the actions are primarily simple "tokens" that can be named suggestively but aren't capable of actually revealing the data needed for the kind of sophistication I'm thinking of. That is, The environment is generally incapable of displaying an environmental tag that would suggest "novel action X (unlike novel actions Y or Z) could be dramatic and irreversible".

The only way to acquire such insight in a totally "from scratch" game context is to gain experience of having "died" after choosing X (probably several times), or else by having substantially richer environment cues than is normal for systems like this, where concepts like "reversibility" and "predictors of payoff size" could be worked out in trivial contexts and then correctly applied to more significant contexts later on, based on environmental cues that allow the model-based inference of both potential irreversibility and great importance in moderately novel situations.

Legg and Joel Veness have been working on a similar idea: http://www.vetta.org/2011/11/aiq/ But their approach, unsurprisingly, differs. From section 5 discussing OP and earlier approaches:

Hernandez-Orallo and Minaya-Collado (1998) developed a related test, called the C-test, that is also based on a very simple reference machine. Like BF it uses a symbolic alphabet with an end wrap around. Unlike BF, which is a tape based machine, the C-test uses a register machine with just three symbol registers. This means that the state space for programs is much smaller than in BF. Another key difference is that the C-test considers generated sequences of symbols, rather than fully interactive environments. In our view, this makes it not a complete test of intelligence. For example, the important problem of exploration does not feature in a non-interactive setting. Extending the C-test reference machine to be interactive would likely be straightforward: simply add instructions to read and write to input and output tapes, the same way BF does. It would be interesting to see how AIQ behaves when using such a reference machine.

A different approach is used in (Insa-Cabrera et al., 2011b) and (Insa-Cabrera et al., 2011a). Here an interactive reinforcement learning setting is considered, however the space of environments is no longer sampled from a Turing complete reference machine. Instead a small MDP is used (3, 6 and 9 states) with uniformly random transitions. Which state is punishing or rewarding follows a fixed random path through this state space. To measure the complexity of environments, the gzip compression algorithm is applied to a description of the environment. While this makes the test tractable, in our view it does so in a way that deviates significantly from the Universal Intelligence Measure that we are attempting to approximate with AIQ. Interestingly, in their setting human performance was not better than the simple tabular Q-learning algorithm. We suspect that this is because their environments have a simple random pattern structure, something that algorithms are well suited for compared to humans.

Another important difference in our work is that we have directly sampled from program space. This is analogous to the conventional construction of the Solomonoff prior, which samples random bit sequences and treats them as programs. With this approach all programs that compute some environment count towards the environment’s effective complexity, not just the shortest, though the shortest clearly has the largest impact. This makes AIQ very efficient in practice since we can just run sampled programs directly, avoiding the need to have to compute complexity values through techniques such as brute force program search. For example, to compute the complexity of a 15 symbol program, the C-test required the execution of over 2 trillion programs. For longer programs, such as many that we have used in our experiments, this would be completely intractable. One disadvantage of our approach, however, is that we never know the complexity of any given environment; instead we know just the length of one particular program that computes it.

"The universal and unified evaluation of intelligence, be it human, non-human animal, artificial or extraterrestrial, has not been approached from a scientific viewpoint before, and this is a first step," the researcher concludes.

It seems as though they are trying to take credit for Shane Legg's moves.

If you read the paper, I believe Legg is credited a number of times as are Hutter and Schmidhuber.