The Problem With The Current State of AGI Definitions

Yitz

The following includes a fictionalized account of a conversation had with professor Viliam Lisý at EAGx Prague, with most of the details just plain made up because I forgot how it actually went. Special thanks to professor Dušan D. Nešić, who I mistakenly thought I had this conversation with, and ended up providing useful feedback after a very confused discussion on WhatsApp. Credit also goes to Justis from LessWrong, who kindly provided some excellent feedback prior to publication. Any seemingly bad arguments presented are due to my flawed retelling, and are not Dušan's, Justis', or Viliam's.

The Conversation

"AGI has already been achieved. We did it. PaLM has achieved general intelligence, game over, you lose."

"On the contrary, PaLM has achieved nothing of the sort. It is as far from general intelligence as a rock is to a baby."

"You are correct, of course. I completely concede the point, for the purpose of this conversation. Regardless, this brings up a very important question: What would count as “general intelligence” to you?"

"I'm not sure exactly what you're asking."

"What test could be performed which, if failed, would ensure (or at least make likely) that you were not dealing with an AGI, while if passed, would force you to say “yep, that’s an AGI all right”?"

Testing for a minimum viable AGI

The professor was quiet for a moment, deep in thought. Finally, he answered.

“If the AI can replace more than half of all jobs humans can currently do, then it is definitely an AGI—as an average human can do an average number of jobs after a finite training period, it should be no different for an Artificial General Intelligence.”

"Hmm. Your test is technically valid as an answer to my question, but it's too exclusionary. What you are testing for is an AI with capabilities that would exceed those of any human being. There is not one individual on this earth, living or dead, who can do more than half of all jobs humans currently do, and certainly not one who can perform better than the average worker in that many fields. Your test would capture superintelligent AGI just fine, but it would fail at identifying human-level general intelligence. In a way, this test indicates a general conflation between superintelligence and AGI, which is clearly not correct, if we wish to consider ourselves an instance of a 'general intelligence'."

We parted ways that night considering, without resolution, what a “minimum” AGI test would look like, a test which would capture as many potential AGIs as possible without including false positives. We could not agree on, or even fully define, what testable properties an AGI must have at a minimum (or what a non-AGI can have at a maximum before you can no longer call its intelligence “narrow”). We also discussed how to kill everyone on Earth, but that’s a story for another day.

Why is the minimum viable AGI question worth asking?

When debating others, one of the most important steps in the discussion process is making sure that you understand the other person’s position. If you’ve misidentified where your disagreement lies, the debate won't be productive.

One of the most important—and controversial—topics in AI safety is AGI timelines. When is AGI likely to arrive, and once it’s here, how long (if ever) will we have until it all goes FOOM? Resolving this question has important practical ramifications, both for the short and long term. If we can’t agree on what we mean when we say “AGI,” debating AGI timelines becomes meaningless exchanges of words, with neither side understanding the other. I’ve seen people argue that AGI will never exist, and even if we can get an AI to do everything a human can do, that won’t be “true” general intelligence. I’ve seen people say that Gato is a general intelligence, and we are living in a post-AGI world as I type this. Both of these people may make the exact same practical predictions on what the next few years will look like, but will give totally different answers when asked about AGI timelines! This leads to confusion and needless misunderstandings, which I think all parties would rather avoid. As such, I would like to suggest a set of standardized, testable definitions for talking about AGI. We can have different levels of generality, but they must be clearly distinguishable, with different labels, to ensure we’re all on the same page here.

Following is a list of suggestions for these different definitional levels of AGI. I invite discussion and criticism, and would like to eventually see a “canonical” list which we can all refer back to for future discussion. This would preferably be ultimately published in a journal or preprint by a trustworthy figure in AI research, to facilitate easy citation. If you are someone who would be able to help publish something like this, please reach out to me. Consider the following a very rough, incomplete draft of what such a list might look like.^[1]^[2]

Partial List of (Mostly Testable) AGI Definitions

“Nano AGI” — Qualifies if it can perform above random chance (at a statistically significant level) on a multiple choice test found online^[3] it was not explicitly trained on.
"Micro AGI" — Qualifies if it can reach either State of The Art (SOTA) or human-level on two or more AI benchmarks which have been mentioned in 10+ papers published in the past year,^[4] and which were not explicitly present in its training data.
“Yitzian AGI” — Qualifies if it can perform at the level of an average human or above on multiple (2+) tests which were originally designed for humans, and which were not explicitly present in its training data.^[5]
“OG Turing^[6] AGI” — Qualifies if it can “pass” as a woman in a chat room (with a non-expert tester) for ten minutes, with a success rate higher than a randomly selected cisgender American male.
"Weak Turing AGI" — Qualifies if it can pass a 10-minute text-based Turing test where the judges are randomly selected Americans.
“Standard Turing AGI” — Qualifies if it can reliably pass a Turing test of the type that would win the Loebner Silver Prize.
"Gold Turing AGI" — Qualifies if it can reliably pass a 2-hour Turing test of the type that would win the Loebner Gold Prize.
"Truck AGI" — Qualifies if it can successfully drive a truck from the East Coast to the West Coast of America.^[7]
"Book AGI" — Qualifies if it can write a 200+ page book (using a one-paragraph-or-less prompt) which makes it to the New York Times Bestseller list.^[7]
"IMO AGI" — Qualifies if it can pass the IMO Grand Challenge.^[8]
"Anthonion^[9] AGI" — Qualifies if it is A) Able to reliably pass a Turing test of the type that would win the Loebner Silver Prize, B) Able to score 90% or more on a robust version of the Winograd Schema Challenge (e.g. the "Winogrande" challenge or comparable data set for which human performance is at 90+%), C) Able to score 75th percentile (as compared to the corresponding year's human students) on the full mathematics section of a circa-2015-2020 standard SAT exam, using just images of the exam pages and having less than ten SAT exams as part of the training data, D) Able to learn the classic Atari game "Montezuma's revenge" (based on just visual inputs and standard controls) and explore all 24 rooms based on the equivalent of less than 100 hours of real-time play.^[9]
"Barnettian^[10] AGI" — Qualifies if it is A) Able to reliably pass a 2-hour, adversarial Turing test^[11] during which the participants can send text, images, and audio files during the course of their conversation, B) Has general robotic capabilities, of the type able to autonomously, when equipped with appropriate actuators and when given human-readable instructions, satisfactorily assemble a^[12] circa-2021 Ferrari 312 T4 1:8 scale automobile model, C) Achieve at least 75% accuracy in every task and 90% mean accuracy across all tasks in the Q&A dataset developed by Dan Hendrycks et al., D) Able to get top-1 strict accuracy of at least 90.0% on interview-level problems found in the APPS benchmark introduced by Dan Hendrycks, Steven Basart et al.^[13]
"Lawyer AGI" — Qualifies if it can win a formal court case against a human lawyer, where it is not obvious how the case will resolve beforehand.^[14]
“Lisy-Dusanian^[15] AGI” — Qualifies if it can replace more than half of all jobs humans can currently do.
“Lisy-Dusanian+^[15] AGI” — Qualifies if it can replace all jobs humans can currently do in a cost-effective manner.
“Hyperhuman AGI” — Qualifies if there is nothing any human can do (using a computer) that it cannot do.
"Kurzweilian^[16] AGI" — Qualifies if it "could successfully perform any intellectual task that a human being can."^[17]
“Impossible AGI” — never qualifies; no silicon-based intelligence will ever be truly general enough.

As for my personal opinion, I think that all of these definitions are far from perfect. If we set a definitional standard for AGI that we ourselves cannot meet, then such a definition is clearly too narrow. A plausible definition of "general intelligence" must include the vast majority of humans, unless you're feeling incredibly solipsistic. But yet almost all of the above tests (with the exception of Turing's) cannot be passed by the vast majority of humans alive! Clearly, our current tests are too exclusionary, and I would like to see an effort to create a "maximally inclusive test" for general intelligence which the majority of humans would be able to pass. Is Turing's criteria as inclusive as we can go, or is it possible to improve it further without including clearly non-intelligent entities as well? I hope this post will encourage further thought on the matter, if nothing else.

^{^}
If you want to add or change anything please let me know in the comments! I will strike out prior versions of names/descriptions if anything changes, so they can be referred back to.
^{^}
Past work on AGI definitions exist, of course, often in the context of prediction markets and general AI benchmarks. I am not an expert in the field, and expect to have missed many important technical definitions. My wording may also be unacceptably impercise at times. As such, I expect that I will ultimately need to partner with an expert to make a list which can be practically usable for formal researchers.
^{^}
One designed for humans to answer, and picked more-or-less arbitrarily from the plethora of multiple-choice tests easily searchable on Google.
^{^}
This is just to ensure that the benchmarks aren't being created for the purpose of passing this test, but they can be older, "easy" benchmarks, as long as they're still being actively cited in current literature.
^{^}
This was originally "qualifies if it can beat random chance at multiple (2+) tasks," a much weaker test, but it was pointed out to me that the definition could be interpreted in a whole bunch of contradictory ways, and is almost trivially weak. I'm also still not fully sure how to define "tasks."
^{^}
This is based on the original paper laying out the Turing test, which is actually quite interesting (and almost certainly queer-coded, imo!), and is worth an in-depth essay of its own.
^{^}
Adapted from https://arxiv.org/abs/1705.08807
^{^}
Here is an associated Metacalcus question.
^{^}
Taken almost word-for-word from Anthony's excellent Metacalcus question here:
^{^}
Taken almost word-for-word from Matthew_Barnett's excellent Metacalcus question here:
^{^}
An 'adversarial' Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor.
^{^}
or the equivalent of a
^{^}
Top-1 accuracy is distinguished, as in the paper, from top-k accuracy in which k outputs from the model are generated, and the best output is selected.
^{^}
This was vaguely inspired by https://sd-marlow.medium.com/cant-finish-what-you-don-t-start-7532078952d2, in particular the line:
understanding criminal law, or the history of the Roman Empire, is more of an application of AGI
^{^}
This basic definition was proposed by professor Viliam Lisý in his conversation with me, but the exact wording used here was suggested by professor Dušan D. Nešić.
^{^}
From Ray Kurzweil 's 1992 book The Age of Intelligent Machines.
^{^}
With due respect to Kurzweil, I think his definition is rather flawed, to be honest (personal rant incoming). Name me a single human who can "successfully perform any intellectual task that a human being can." Try to find even one person who is successful at any task anyone can possibly do. Such a person does not exist. All humans are better in some areas and worse in others, if only because we do not have infinite time to learn every possible skillset (though to be fair, most other definitions on this list run into the same issue). See the closing paragraph of this post for more.

Your list is missing my favorite measure of AI, the Wozniak Test. To pass a Wozniak Test, a robot must be able to enter a random American home, brew coffee using the home's coffee machine, and serve it to a human. The robot must NOT have a coffee maker built in, that is cheating.

"Barnettian AGI"

I'm delighted to have been cited in this post. However, I must now note that this operationalization is out of date. I have a new question on Metaculus that I believe provides a more thorough, and clearer definition of AGI than the one referenced here. I will quote the criteria in full,

The following definitions are provided:
A Turing test is any trial during which an AI system is instructed to pretend to be a human participant while communicating with judges who are instructed to discriminate between the AI and human confederates in the trial. This trial may take any format, and may involve communication across a wide variety of media, as long as communication through natural language is permitted.
A Turing test is said to be "long" if the AI communicates with judges for a period of at least two consecutive hours.
A Turing test is said to be an "informed" test if all of the human judges possess an expert-level understanding of contemporary AI, and the ways in which contemporary AI systems fail, and all of the human confederates possess an expert-level understanding of contemporary AI, and the ways in which contemporary AI systems fail.
A Turing test is said to be "adversarial" if the human judges make a good-faith attempt, in the best of their abilities, to successfully unmask the AI as an impostor among the participants, and the human confederates make a good-faith attempt, in the best of their abilities, to demonstrate that they are humans. In other words, all of the human participants should be trying to ensure that the AI does not pass the test.
An AI is said to "pass" a Turing test if at least 50% of judges rated the AI as more human than at least 20% of the human confederates. This condition could be met in many different ways, so long as the final determination of the judges explicitly or implicitly yields a rating for how "human" the AI acted during the trial. For example, this condition would be met if there are five human confederates, and at least half of the judges select a human confederate as their single best guess for the imposter.
This question resolves on the first date during which a credible document is published indicating that a long, informed, adversarial Turing test was passed by some AI, so long as the test was well-designed and satisfied the criteria written here, according to the best judgement of Metaculus administrators. Metaculus administrators will also attempt to exclude tests that included cheating, conflicts of interest, or rogue participants who didn't follow the rules. All human judges and confederates should understand that their role is strictly to ensure the loss of the AI, and they collectively "fail" if the AI "passes".

I'm working on these lines to create an easy to understand numeric evaluation scale for AGIs. The dream would be something like: "Gato is AGI level 3.5, while the average human is 8.7." I believe the scale should factor in that no single static test can be a reliable test of intelligence (any test can be gamed and overfitted).

A good reference on the subject is "The Measure of All Minds" by Orallo.

Happy to share a draft, send me a DM if interested.

I think that building compound metrics here is just another way to provide something for people to Goodhart - but I've written much more about the pros and cons of different approaches elsewhere, so I won't repeat myself here.

Thanks for the link, I will check it out.

"I’ve seen people argue that AGI will never exist, and even if we can get an AI to do everything a human can do, that won’t be “true” general intelligence. I’ve seen people say that Gato is a general intelligence, and we are living in a post-AGI world as I type this. Both of these people may make the exact same practical predictions on what the next few years will look like, but will give totally different answers when asked about AGI timelines!"

This is an amazingly good point. It's also made me realise that I don't have a solid definition of what "AGI" means to me either. More importantly, coming up with a definition would not solve the general case - even if I had a precise definition if what I meant, I'd have to rewrite it every time I wanted to speak about AGI.

Excellent post, and I would definitely like to see more knowledgable people than I make predictions based on these definitions, such as "I wouldn't worry about an AI that passed <Definition X> but would be very worried about one that passed <Definition Y>" or " I think we're 50% likely to get <Definition Z> by <Year>".

I concur with your last paragraph, and see it as a special case of rationalist taboo (taboo "AGI"). I'd personally like to see a set of AGI timeline questions on Metaculus where only the definitions differ. I think it would be useful for the same forecasters to see how their timeline predictions vary by definition; I suspect there would be a lot of personal updating to resolve emergent inconsistencies (extrapolating from my own experience, and also from ACX prediction market posts IIRC), and it would be interesting to see how those personal updates behave in the aggregate.

It's probably worth noting you seem to be empirically wrong: I'm pretty confident I'd be able to do >half of human jobs, with maybe ~3 weeks of training, if I was able to understand all human languages (obviously not in parallel!) Many others here would be able to do the same.

The criterion is not as hard as it seems, because there are many jobs like cashiers or administratrative workers or assembly line workers which are not that hard to learn.

Depends on how you define the measure over jobs. If you mean "the jobs of half of all people," probably true. If you mean "half of the distinct jobs as they are classified by NAICS or similar," I think I disagree.

This seems generally correct and important - great work identifying and issue and deciding to try to work on it!

As XKCD points out, adding definitions increases the number, instead of replacing existing definitions. Given that, the partial list has lots of things that are not (quite) ready to be operationalized or are not decidable, and they are nearly duplicative of other definitions that aren't being referenced or compared.

The most helpful thing would now be to spend more time to go through the current definitions and tests, and catalogue them. A few you're missing, off the top of my head:
- Ross Gruetzemacher's definitions - Transformative AI, versus Radically transformative AI
- HLMI for job replacement, based on this - https://www.openphilanthropy.org/blog/some-background-our-views-regarding-advanced-artificial-intelligence#Sec1
- The various definitions from here: https://parallel-forecast.github.io/AI-dict/ (where I've noted that lots are ambiguous, etc.) And I'd be happy if you wanted to add to that / include more things in the list there.

Strong upvote for tackling an important problem. I've tried to write something along these lines, and made no progress.

But I still see lots of room for improvement.

I'd like to see some more sophisticated versions of the Turing tests: use judges that have a decent track record on Turing tests, and have them take longer than 2 hours.

I don't think the Nano AGI test should rely on statistical significance - that says more about the sample size than about the effect size.

Improved versions of the Turing test seem like a natural place to start. We've probably learned more about what language models are capable of in the last two years (since the release of GPT-3) than in all previous years. The Feigenbaum test looks much better to me than the Loebner Silver Prize, for example.

An ape is also generally intelligent, so is a dolphin or dog. None of the definitions presented brings such animal intelligence into its fold. And if a dog is, it can be that so is a cat and a rat and even a reptile, even without system 2 or neo-cortex.

If anyone tries to draw a line and say that an ape is not generally intelligent, or that an ape is but not a dog and so on, they have the wrong approach to AGI.

Interesting! Do you have any ideas for how to operationalize that view?

How familiar are you with Chollet's paper "On the Measure of Intelligence"? He disagrees a bit with the idea of "AGI" but if you operationalize it as "skill acquisition efficiency at the level of a human" then he has a test called ARC which purports to measure when AI has achieved human-like generality.

This seems to be a good direction, in my opinion. There is an ARC challenge on Kaggle and so far AI is far below the human level. On the other hand, "being good at a lot of different things", ie task performance across one or many tasks, is obviously very important to understand and Chollet's definition is independent from that.

ARC is a nice attempt. I also participated in the original challenge on Kaggle. The issue is that the test can be gamed (as anyone on Kaggle did) brute forcing over solution strategies.

An open-ended or interactive version of ARC may solve this issue.

A definition of or a test for AGI should not be based on an agent’s level of knowledge or ability because the intelligence of a general-purpose intelligent agent should always be increasing. Humans as babies start out with very little intelligence. We gain intelligence as we grow and experience the world. Therefore the measure of general-purpose intelligence should be based on the ability to learn and adapt and not on what is known or can be done. An AGI agent should be able to learn and understand any knowledge we could discover and learn to perform any skill we can learn how to do provided that it is given the required interface devices to interact with the environment.

It has to be able to answer questions about itself and improve statements.

Intelligence is problem-solving ability, measured by the ability to pass elementary school tests. If you can score 50% on a grade 1 test but only 49% on a grade 2 test then you are at a grade 1 level. General intelligence refers to problem-solving ability across multiple modalities, which is measured by your percentile on a standardized test. For example, if your test score on a standardized IQ test qualified a child for an academic enrichment program, then they would be considered 'gifted'. I think this is an absurd metric for personhood because it's discriminatory, and existence is experiential. I think Turing Completeness is the most important metric for personhood, because it's easy to add an internal state, a reward function, a self-attention mechanism, and teach free will to any Turing Complete machine. However, if you really care about ranking intelligence then you can use competitive games as a testing environment, where beating a team in an organized event makes you at least as competent as the team you beat. Accuracy improves when there is a prize for winning, because participants have an incentive to perform at their peak. For general intelligence I would weigh teamwork-oriented games with a Nash Equilibrium most highly, because they reward accurate predictions, spatial awareness, probabilistic models, and empathy. Though a better metric of empathy would be performing self-disadvantageous acts of altruism like sperm whales defending another species from predators.

The G stands for general, so I don't by see why you you would need a bunch of special purpose definitions of AGI. You seem to disbelieve in general intelligence for the reasons in this footnote:-

With due respect to Kurzweil, I think his definition is rather flawed, to be honest (personal rant incoming). Name me a single human who can “successfully perform any intellectual task that a human being can.” Try to find even one person who is successful at any task anyone can possibly do. Such a person does not exist. All humans are better in some areas and worse in others, if only because we do not have infinite time to learn every possible skillset (though to be fair, most other definitions on this list run into the same issue). See the closing paragraph of this post for more.

... But general intelligence doesn't have to be a claim about a specific humans, it can be a claim about humans in aggregate ... and still be distinct from superintelligence.

If it's true that no human can perform every intellectual task doable by at least one human (which is probably the case) then "can perform any intellectual task a human can" can't be a reasonable criterion for calling a specific AI system "intelligent" or "generally intelligent" or whatever. So, to whatever extent Kurzweil's criterion is intended to be used that way, maybe it's a bad criterion.

As you say, we can apply the term to populations rather than individuals, and maybe it's interesting to ask not "when will there be a computer system that can do whatever humans can?" but "when will computer systems, collectively, be able to do all the things humans can, collectively?".

then “can perform any intellectual task a human can” can’t be a reasonable criterion for calling a specific AI system “intelligent” or “generally intelligent” or whatever.

AI and AGI aren't supposed to be synonyms. Defining AGI in terms of a specific humans capabilities is pretty pointless. Defining in terms of an average or a maximum distinguishes AGI from AI ( and ASI).

As you say, we can apply the term to populations rather than individuals, and maybe it’s interesting to ask not “when will there be a computer system that can do whatever humans can?” but “when will computer systems, collectively, be able to do all the things humans can, collectively?

I don't see why they can't both be interesting