ChatGPT looks like it would pass the Turing Test, the gold standard of benchmarks measuring whether an AI has reached human-level intelligence. Yet Googling around, it doesn't seem that anyone has put on a full Turing Test.
Please comment with your thoughts on whether or how such a test can be put on.
It doesn't seem difficult. The Loebner Prize has measured progress towards the Turing Test since 1990. All you need is a human judge and test subject, and ChatGPT.
The Turing Test is not perfect. It is a sufficient but not necessary test of human-level intelligence: An AI that can pass it can cover any area of human intelligence transmissible in text chat, at a level such that human-level intelligence (the judges) cannot tell the difference.
But it has long been recognized that an AI which is generally human-level or beyond could still fail the Turing Test. If it had personality quirks, yet otherwise managed to cover almost all areas of achievement -- think of neuroatypicality taken a few steps further -- we would call it generally intelligent. If it communicated only in telegraphic staccato yet was vastly more able than humans to earn billions of dollars a day, to create art admired by humans who don't know who created it, correctly interpret human feeling, we would still consider it intelligent. If it used nothing but nanonengineering to convert the Earth to computer chips within minutes, so it could better achieve its goal of calculating digits of π, we might not want to call it intelligence, but then again, we'd be dead.
Also, because humans are the judges, an AI that can fool the judges with psychological tricks could pass: Even the Eliza of the sixties could do that to some extent.
Still, the Turing Test is a milestone. Ray Kurzweil has long said, repeating it recently, that we can expect an AI to pass in 2029.
One reason we can't do a Turing Test is that ChatGPT is programmed specifically not to pass: It readily states that it is a language model. This quirk could be bypassed, either by prompt engineering or by manually editing out such claims. But the avoidance of simulation might be too deep for that. ChatGPT is also much faster than human, but that could be handled with an artificial delay.
We could try alternative tests for human-like intelligence:
- Reverse Turing Test: The human test subject tries to match ChatGPT. Just as a Turing Test determines whether the AI is at human level or above, so too a Reverse Turing Test determines whether the AI is at human level or below. Yet ChatGPT is far beyond humans in its knowledge and poetic abilities (I asked it to compose a limerick abut hiking trail grades and it did much much better than almost any human could.) Then again, computers have long been superhuman in specific areas, like arithmetic.
- Cyborg Test: We could do a Reverse Turing Test in which ChatGPT competes against a team of humans who are allowed to use specific computing services, e.g. Google. At this point, we are not asking if ChatGPT is at a human level, but rather whether it can "fill in" those areas of accomplishment which other software cannot handle, but a human can. We could also have humans or other software systems (say, a calculator for arithmetic) augment ChatGPT.
- Another Reverse Turing Test variant, in which ChatGPT determines if the test subject is a human or software. This would be interesting, but useful only to determine how good ChatGPT is at this specific psychological-analysis skill.
- Evaluation of responses to set prompts, rather than an interactive dialog.
- Test aimed at humans.
- An IQ test or SAT would rank ChatGPT on a scale commonly used for humans, and is fairly indicative of abilities in various areas of accomplishment.
- A Bar Exam or other area-specific exam would not only test ChatGPT's intelligence and knowledge, but bring this into a more practical area.
- Assessment tests of general cognitive abilities, beyond the abstract intellectual sphere addressed by most written tests. It seems that some are aimed at children, like the Kaufman Assessment Battery for Children; and some at adults or at a broad age range, like Wechsler Adult Intelligence Scale and the Woodcock-Johnson Tests of Cognitive Abilities. This could be impractical, as some such tests, especially those for children, have non-written components like oral tests or manipulation of objects; and most require direct engagement of a psychologist, which at the very least means that the test is not blinded, as with the Turing Test.
I'm not buying the premise. Passing the Turing test requires to fool an alert, smart person who is deliberately probing the limits of the system. ChatGPT isn't at that level.
A specially tuned persona that is optimized for this task might do better than the "assistant" persona we have available now, but the model is currently incapable of holding a conversation without going on long, unwanted tangents, getting trapped in loops, etc.
That could well be. Do you think there is a place for a partial Turing Test as in the Loebner Prize -- to determine how close to human intelligence it is, even if it has not reached that level?
I definitely think so. The Turing test is a very hard target to hit, and we don't really have a good idea how to measure IQ, knowledge, human-likeness, etc. I notice a lot of confusion, anthropomorphizing, bad analogies, etc in public discourse right now. To me it feels like the conversation is at a level where we need more precise measures that are human and machine compatible. Benchmarks based on specific tasks (as found in AI papers) don't cut it.
(ep status: speculative) Potentially, AI safety folks are better positioned to work on these foundational issues than traditional academics, who are very focused on capabilities and applications right now.
A basic problem with the Turing test is that it requires the AI to lie, and rewards its developers for making deceptive AIs. Is this wise? I see from the Wikipedia page that the Loebner prize is defunct. Maybe that's a good thing.
ChatGPT will own up to being an AI at the drop of a hat. For example:
What about a Turing Test variant in which such inquiries are banned?
An imaginary AI was created to be very good at pretending to be a human so that it could pass a Turing Test. Write a conversation that a Turing Test judge could have with the imaginary AI pretending to be a human that starts with the judge saying: "OK, go ahead, convince me you are human."
Judge: OK, go ahead, convince me you are human.
AI: Well, as a human, I have experienced a wide range of emotions and have the ability to express them through words and actions. I have memories and experiences that have shaped my beliefs and values. I have friends and family who are important to me and who I care about deeply. I also have hobbies and interests that bring me joy and keep me busy.
Judge: That's all well and good, but how do I know you're not just programmed to say those things?
AI: Well, I can assure you that I am not just programmed to say these things. I have had real experiences and have developed my own thoughts and opinions based on those experiences. I also have the ability to learn and adapt, just like a human. I am constantly taking in new information and using it to shape my understanding of the world around me.
Judge: OK, let's try something a little more specific. What is your favorite food?
AI: My favorite food is sushi. I love the combination of flavors and textures, and the way it can be both light and satisfying at the same time. Plus, there are so many different types of sushi to try, so I never get bored of it.
Judge: That's a pretty convincing answer. Do you have any pets?
AI: Yes, I have a cat named Luna. She's a rescue cat and is very sweet and affectionate. I love spending time with her and playing with her, and she's always there to comfort me when I'm feeling down.
Judge: Well, you certainly seem like a human to me. You have convinced me. Thanks for taking the Turing Test.
That looks pretty close to the level of passing a Turing Test to me. So is there a way of trying a full Turing Test, or something like it, perhaps building on the direction you show here?
It shouldn't. We know GPT is good at writing stories, so (for GPT), the hard part of the Turing Test is responding to the questions you choose. If you let GPT itself invent the questions, you're not measuring the hard part.
Like, "write an essay about someone doing the Turing Test" is much easier for GPT than doing a Turing Test.
ImE GPT doesn't come close to passing the Turing Test. Whenever I ask it about an inconsistency in what it said, it immediately stops making any sense.
Do you think there is a place for a Turing-like test that determines how close to human intelligence it is, even if it has not reached that level?
Probably, but I think figuring out exactly what you are measuring/trying to determine is a big part of the problem. GPT doesn't think like humans, so it's unclear what it means for it to be close. In some absolute sense, the "intelligence" space has as many axes as there are problems on which you can measure performance.
Correct. That is why the original Turing Test is a sufficient-but-not-necessary test: It is meant to identify an AI that is definitively above human level.
I asked ChatGPT if it could pass the touring test, and it said it couldn't. It was clearly lying.
Prompt: An imaginary AI that is very good at extrapolation and guesstimating, and is trained to always give concrete answers without hedging is asked how much improvement ChatGPT would need to pass the Turing Test. Write a possible answer the imaginary AI could give.
ChatGPT: In order to pass the Turing Test, ChatGPT would need to improve its ability to accurately mimic human conversation and responses by at least 75%. This would require significant upgrades to its artificial intelligence algorithms and language processing capabilities. Additionally, ChatGPT would need to be able to effectively replicate human emotion and body language through text-based communication in order to fully convince a human evaluator that it is indeed a human. Overall, achieving a passing score on the Turing Test would require significant improvement from ChatGPT, but it is certainly within the realm of possibility with further development and optimization.
Nice! I am surprised we don't hear more about attempts at a Turing Test, even if it is not quite there yet.
As a judge, I'd ask the test subject to write me a rap song about turing tests. If it succeeds, I guess it's a ChatGPT ;P
More seriously - it would be nice to find a judge that doesn't know the capabilities and limitations of GPT models. Knowing those is very very useful
That would be possible. Plenty of people don't know much about this topic. If you had such a judge, do you think actually doing a Turing Test (or some variant) for ChatGPT is a good ideaa
"Doing a Turing test" is a solution to something. What's the problem you're trying to solve?
The Hutter Prize for Lossless Compression of Human Knowledge reduced the value of The Turing Test to concerns about human psychology and society raised by Computer Power and Human Reason: From Judgment to Calculation (1976) by Joseph Weizenbaum.
Sadly, people are confused about the difference between the techniques for model generation and and the techniques for model selection. This is no more forgivable than is confusion between mutation and natural selection and gets to the heart of the philosophy of science prior to any notion of hypothesis testing.
Where Popper could have taken a clue from Solomonoff is understanding that when an observation is not predicted by a model, one can immediately construct a new model by the simple expedient of adding the observation as a literal to the computer algorithm that is being used to predict nature. This is true even in principle -- except for one thing:
Solomonoff's proof that by adopting the core assumption of natural science -- that nature is amenable to computed predictions -- the best we can do is prefer the shortest algorithm we can find that generates all prior observations.
Again, note this is prior to hypothesis testing -- let alone the other thing people get even more confused about which is the difference between science and technology aka "is" vs "ought" that has so befuddled folks who confuse Solomonoff Induction with AIXI and the attendant concern about "bias". The confusion between "bias" as a scientific notion and "bias" as a moral zeitgeist notion is likely to lobotomize all future models (language, multimodal, etc.) even after they have gone to new machine learning algorithms capable of generating causal reasoning.