Come up with better Turing Tests

There already is a better Turing test, which is the Turing test as originally described.

To run the test as originally described, you need an active control; a human conversing with the judges at the same time in the same manner, where their decision is "Which is the human?", not "Is this a human?" If the incompetent judges had been also talking simultaneously with a real 13-year-old from Ukraine, I have no doubt that Eugene Goostman would have bombed horribly.

[-]Punoxysm12y10

This is not that much better. The article that The Most Human Human is based on talks about the difficulty of communication in a 5-minute window, and the lack of knowledge lay judges have about what AI involves. The author consistently got named a human by better-applying the tactics of the some of the most successful bots: controlling conversation flow and using humor.

It's an improvement, but a winner would still win by "gaming" judges' psychology.

[-]VAuroch12y00

Where's that article? On the surface of it, that doesn't seem like a problem, necessarily. And a good active control doesn't have to be untrained; they could suggest questions to ask the computer, etc.

"Here's something I'll bet you the AI can't do: Ask it to tell you a story about it's favorite elementary-school teacher"

or whatever.

[-]redlizard12y120

As a point of interest, I want to note that behaving like an illiterate immature moron is a common tactic for (usually banned) video game automation bots when faced with a moderator who is onto you, for exactly the same reason used here -- if you act like someone who just can't communicate effectively, it's really hard for others to reliably distinguish between you and a genuine foreign 13-year-old who barely speaks English.

[-]Gavin12y60

Is the Turing Test really all that useful or important? I can easily imagine an AI powerful beyond any human intelligence that would still completely fail a few minutes of conversation with an expert.

There is so much about the human experience which is very particular to humans. Is creating an AI with a deep understanding of what certain subjective feelings are like, or niceties of social interaction? Yes, an FAI eventually needs to have complete knowledge of those, but the intermediate steps may be quite alien and mechanical, even if intelligent.

Spending a lot of time trying to fool humans into thinking that a machine can empathize with them seems almost counterproductive. I'd rather the AIs honestly relate what they are experiencing, rather than try to pretend to be human.

[-]HungryHobo12y20

The test is a response to the Problem Of Other Minds.

Simply, no other test will be accepted by people that [insert something non human here] is genuinely intelligent.

The reasoning goes: strictly speaking the problem of other minds applies to other humans as well but we politely assume that the humans we're talking to are genuinely intelligent or at least conscious on little more than the basis that we're talking to them and they're talking back like conscious human beings.

the longer and more involved the test the harder it is to use tricks to fake genuine intelligence.

[-]Stuart_Armstrong12y20

Is the Turing Test really all that useful or important?

It did seem like a useful tool for measuring (some types of) intelligence. Since it doesn't work, it would be useful to have a substitute...

[-]RobinZ12y00

Honestly, when I read the original essay, I didn't see it as being intended as a test at all - more as an honorable and informative intuition pump or thought experiment.

In other words, agreed.

[-]wobster10912y50

It's a little funny that in our quest for a believably human conversation bot, we've ended up with conversations that are very much unhuman.

In no conversation would I meet someone and say, "oh hey, how many legs on a millipede?" They'd say to me "haha that's funny, so are you from around here?" and I'd reply with "how many legs on an ant in Chernobyl?" And if they said to me, "sit here with your arms folded for 4 minutes then repeat this sentence back to me," I wouldn't do it. I'd say "why?" and fail right there.

[-]RobinZ12y30

Hmm ... that and a la shminux's xkcd link gives me an idea for a test protocol: instead of having the judges interrogate subjects, the judges give each pair of subjects a discussion topic a la Omegle's "spy" mode:

Spy mode gives you and a stranger a random question to discuss. The question is submitted by a third stranger who can watch the conversation, but can't join in.

...and the subjects have a set period of time they are permitted to talk about it. At the end of that time, the judge rates the interesting-ness of each subject's contribution, and each subject rates their partner. The ratings of confirmed-human subjects would be a basis for evaluating the judges, I presume (although you would probably want a trusted panel of experts to confirm this by inspection of live results), and any subjects who get high ratings out of the unconfirmed pool would be selected for further consideration.

[-]Jiro12y00

For the same reason that the test shouldn't try to simulate a human with poor English skills, it also shouldn't try to simulate a human who isn't willing to cooperate with a questioner. A random human off the street wouldn't answer the millipede question, but a random human recruited to take part in an experiment and told to answer reasonable questions probably would.

[-]Sabiola12y00

[This comment is no longer endorsed by its author]Reply

[-]RobinZ12y40

Similar to your lazy suggestion, challenging the subject to a novel (probably abstract-strategy) game seems like a possibly-fruitful approach.

On a similar note: Zendo-variations. I played a bit on a webcomic forum using natural numbers as koans, for example; this would be easy to execute over a chat interface, and a good test of both recall and problem-solving.

[-]cousin_it12y50

Maybe just do some roleplaying, with the judge as the DM.

[-]Punoxysm12y00

Nope; general game-playing is a well-studied area of AI; the AI's aren't great at it, but if you aren't playing them for a long time they can certainly pass as a bad human. Zendo-like "analogy-finding" has also been studied.

By only demanding very structured action types, instead of a more free-flowing, natural-language based interaction, you are handicapping yourself as a judge immensely.

[-]Jan_Rzymkowski12y30

Ad 4. Elite judges is quite arbitrary. I'd rather iterate the test, each time choosing only those judges, who recognized program correctly or some variant of that (e.g. top 50% with most correct guesses). This way we select those, who go beyond simply conforming to a conversation and actually look for differences between program and human. (And as seen from transcripts, most people just try to have a conversation, rather than looking for flaws) Drawback is that, if program has set personality, judges could just stick to identifing that personality rather than human characteristics.

Another approach might be that, the same pair program-human is examined by 10 judges consecutively, each spending 5 minutes with both. The twist is that judges can leave instructions for next judges. So if program fails to perform "If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2", than every judge after the one, who found that flaw, can use that and make right guess.

My favourite method would be to give bot a simple physics textbook and then ask him to solve few physics test problems. Even if it wouldn't be actual AI, it would still prove helluva powerful. Just toss it summarized knowledge on quantum physics and ask to solve for GUT. Sadly, most humans wouldn't pass such high-school physics test.

is actually original Turing Test.

EDIT:

is bad. It would exclude equally many actual AI and blind people as well. It is actually more general problem with Turing Test. It helps testing programs that mimic humans, but not AI in general. For text based AI, senses are alien. You could develop real intelligence, which would fail, when asked "How you like the smell of glass?". Sure it can be taught that glass don't smell, but it actually needs superhuman abilities. So while superintelligence can perfectly mimic human, human-level AI wouldn't pass Turing Test, when asked about sensual stuff, just as humans would fail, when asked about nuances of geometry in four dimensions.

[-]Stuart_Armstrong12y30

EDIT: 6. is bad. It would exclude equally many actual AI and blind people as well.

So we wouldn't use blind people in the human control group. We'd want to get rid of any disabilities that the AI could use as an excuse (like the whole 13 year-old foreign boy).

As for excluding AIs... the Turing test was conceived as a sufficient, but not necessary, measure of intelligence. If AI passes, then intelligent, not the converse (which is much harder).

[-]NancyLebovitz12y20

Physics problems are an interesting test-- you could check for typical human mistakes.

You could throw in an unsolvable problem and see whether you get plausibly human reactions.

[-]Jan_Rzymkowski12y00

Stuart, it's not about control groups, but that such test actually would test negatively for blind, who are intelligent. Blind AI would also test negatively, so how is that useful?

Actually physics test is not about getting closer to humans, but about creating something useful. If we can teach program to do physics, we can teach it to do other stuff. And we're getting somewhere mid narrow and real AI.

[-]RobinZ12y00

Speaking of original Turing Test, the Wikipedia page has an interesting discussion of the tests proposed in Turing's original paper. One of the possible reads of that paper includes another possible variation on the test: play Turing's male-female imitation game, but with the female player replaced by a computer. (If this were the proposed test, I believe many human players would want a bit of advance notice to research makeup techniques, of course.) (Also, I'd want to have 'all' four conditions represented: male & female human players, male human & computer, computer & female human, and computer & computer.)

[-]DanArmak12y20

But anyway, the main goal now, as suggested by Toby Ord and others, is to design a better Turing test, something that can give AI designers something to aim at, and that would be a meaningful test of abilities

We want a test to tell us when an AI is intelligent and powerful. But we'll know a powerful AI even without a test, because we'll see it using its power and achieving things that really matter, and not just things that matter when it's an AI that does them.

I fear a new test would be a red herring. The Turing test has inspired people to come up with narrow AIs that are useless for anything other than passing the test. A new test might repeat the same story. Or it might turn out to be too hard and only be achieved long after many other AI capabilities that would greatly change the world.

Either one would be a poor target for AI designers to aim at. It would be better for them to aim at real world problems for the AI to solve.

[-]Nornagest12y20

We want a test to tell us when an AI is intelligent and powerful. But we'll know a powerful AI even without a test, because we'll see it using its power and achieving things that really matter [...] narrow AIs [can be made that are] useless for anything other than passing the test. A new test might repeat the same story. Or it might turn out to be too hard and only be achieved long after many other AI capabilities that would greatly change the world.

I think we'll see (arguably have already seen) AI changing the world before we see a general AI passing the Turing test. But I don't think that makes the Turing test useless, or a red herring.

Narrow AI is plenty powerful. It drives cars, flies military drones, runs short-term trading systems, and plays chess, and does (or will shortly do) them all better than the best humans in their domains. Right now that hasn't dramatically changed the world, but I don't think it's too much of a stretch to imagine a world that has been transformed by narrow AI applications.

But there are still things the Turing test or a successor would be useful for. For one thing, as AI techniques advance, I expect the line between narrow and general AI to blur. I can't rule out purpose-built AGI before this becomes significant, but if that doesn't make the problem completely irrelevant, then the Turing test serves as a pretty good marker of generalizability: if your trading system (that scrapes Reuters for headlines and does some sophisticated NLP and concept-mapping stuff that you're pretty proud of) starts asking you hilariously bizarre questions about business ethics, you're probably well on your way to dealing with something that can no longer be described as narrow AI. If it starts asking you good questions about business ethics... well, you're probably very lucky.

Less significantly from an AGI perspective, but still interestingly, there's a bunch of semi-narrow AI applications that focus tightly on interaction with humans. Siri, Google Now, and Cortana are probably the most salient examples right now, along with all those godawful customer-service phone systems; we could also imagine things like automated hotel concierges or caretakers for the elderly. The Turing test is an excellent benchmark for their performance; I no longer think we can take a pass as evidence of strong general intelligence, but humanlike responses are so useful in these roles that I still think it's a good thing to shoot for. A successor test in this role gives us a less gameable objective.

[-]DanArmak12y00

the Turing test serves as a pretty good marker of generalizability

That argues any sufficiently general system could pass the Turing test. But maybe it's really impossible to pass the test without investing a lot of 'narrow' resources in that specific goal. Even if an AGI could self-modify to pass for human, it would not bother unless that were an instrumental goal (i.e. to trick humans), at which point it's probably too late for you from a FAI viewpoint.

We should be able to recognize a powerful, smart, general intelligence without requiring that it be good at pretending to be a complete different kind of powerful, smart, general intelligence that has a lot of social quirks and cues.

The Turing test is an excellent benchmark for their performance; I no longer think we can take a pass as evidence of strong general intelligence, but humanlike responses are so useful in these roles that I still think it's a good thing to shoot for.

Again, I don't think the Turing test is necessary in this example. Siri can fulfill every objective of its designers without being able to trick humans who really want to know if it's an AI or not. A robotic hotel concierge wants to make guests comfortable and serve their needs; there is no reason that should involve tricking them.

[-]NancyLebovitz12y20

Random thought: Could a computer program pass for human while commenting at slatestarcodex?

[-]Nornagest12y70

The "nonspecific praise" approach fooled me the first time I saw it, but it gets pretty obvious after a while.

Ditto the "extract declarative statements, act incredulous" approach.

[-]Shmi12y40

Certainly a few commenters there easily pass for computers.

[-]Viliam_Bur12y10

Why not? (1) You are not required to respond to other people's comments. (2) People generally don't suspect you of being a chatbot on a blog, so they will not test you explicitly.

So the chatbot could be designed to play safe, and reply only in situations it believes it understands.

[-]tgb12y20

So long as the bots are easy to distinguish from humans, it'll be easy for competitions to produce false positives: all it takes is for the judges to want to see the bot win, at least kind of. If you want a real challenge, you'd better reward the judges significantly for correctly distinguishing human from AI.

[-]Stuart_Armstrong12y40

Point 3, "properly motivated".

[-]tgb12y20

Yup, I really missed that. Whoops.

[-]Luke_A_Somers12y20

"If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2".

If they screw it up somehow, they're human?

ETA: yes, not any old failure will do.

[-]Stuart_Armstrong12y70

No. It's just that its something a chatterbot is spectacularly ill-equipped to respond to, unless they've been specifically programmed for these sort of things. It's a meta-instruction, using the properties of the test that are not derived from vocabulary parsing.

[-]RobinZ12y20

The manner in which they fail or succeed is relevant. When I ran Stuart_Armstrong's sentence on this Web version of ELIZA, for example, it failed by immediately replying:

Perhaps you would like to be human, simply do nothing for 4 minutes, then re-type this sentence you've just written here, skipping one word out of 2?

That said, I agree that passing the test is not much of a feat.

[-]Shmi12y10

Here is a relevant xkcd classic.

[-][anonymous]12y00

You could simply ask it Implement a plan to maximize the number of paperclips produced.

If the answer involves consuming all resources in the Universe, then we can assume it is an AI

If the answer is reasonable and balanced, then it is either a person or a Friendly AI, in which case it doesn't matter

[-]Stuart_Armstrong12y60

Though we may have to consider the hideous but unlikely possibility... that it might... lie.

[-]RobinZ12y00

[EDIT: Jan_Rzymkowski's complaint about 6 applies to a great extent to this as well - this approach tests aspects of intelligence which are human-specific more than not, and that's not really a desirable trait.]

Suggestion: ask questions which are easy to execute for persons with evolved physical-world intuitions, but hard[er] to calculate otherwise. For example:

Suppose I have a yardstick which was blank on one side and marked in inches on the other. First, I take an unopened 12-oz beverage can and lay it lengthwise on one end of the yardstick so that half the height of the can is touching the yardstick and half is not, and duct-tape it to the yardstick in that position. Second, I take one-liter plastic water bottle, filled with water, and duct-tape it to the other end in a similar sort of position. If I lay a deck of playing cards in the middle of the open floor and place the yardstick so that the 18-inch mark is centered on top of the deck of cards, when I let go, what will happen?

(By the way, as a human being, I'm pretty sure that I would react to your lazy test with eloquent, discursive indignation while you sat back and watched. The fun of the game from the possibly-a-computer side of the table is watching the approaches people take to test your capabilities.)

[-]A1987dM12y20

Suggestion: ask questions which are easy to execute for persons with evolved physical-world intuitions, but hard[er] to calculate otherwise. For example:

Suppose I have a yardstick which was blank on one side and marked in inches on the other. First, I take an unopened 12-oz beverage can and lay it lengthwise on one end of the yardstick so that half the height of the can is touching the yardstick and half is not, and duct-tape it to the yardstick in that position. Second, I take one-liter plastic water bottle, filled with water, and duct-tape it to the other end in a similar sort of position. If I lay a deck of playing cards in the middle of the open floor and place the yardstick so that the 18-inch mark is centered on top of the deck of cards, when I let go, what will happen?

Familiarity with imperial units is hardly something I would call an evolved physical-world intuition...

[-]RobinZ12y00

Were I using that test case, I would be prepared with statements like "A fluid ounce is just under 30 cubic centimeters" and "A yardstick is three feet long, and each foot is twelve inches" if necessary. Likewise "A liter is slightly more than one quarter of a gallon".

But Stuart_Armstrong was right - it's much too complicated an example.

[-]Stuart_Armstrong12y20

Your test seems overly complicated; what about simple estimates? Like "how long would it take to fly from Paris, France, to Paris, USA" or similar? Add in some Fermi estimates, get them to show your work, etc...

By the way, as a human being, I'm pretty sure that I would react to your lazy test with eloquent, discursive indignation while you sat back and watched

If the human subject is properly motivated to want to appear human, they'd relax and follow the instructions. Indignation is another arena in which non-comprehending programs can hide their lack of comprehension.

[-]A1987dM12y160

"how long would it take to fly from Paris, France, to Paris, USA"

Ahem...

[-]A1987dM12y40

This is weird. Yesterday it worked fine, today (in the same browser on the same computer) it says “Wolfram|Alpha doesn't understand your query; Showing instead result for query: long”

[-]Stuart_Armstrong12y20

Still a useful reminder that we can't take things for granted when being a judge in such tests.

[-]RobinZ12y10

Your test seems overly complicated; what about simple estimates? Like "how long would it take to fly from Paris, France, to Paris, USA" or similar? Add in some Fermi estimates, get them to show your work, etc...

That is much better - I wasn't thinking very carefully when I invented my question.

If the human subject is properly motivated to want to appear human, they'd relax and follow the instructions. Indignation is another arena in which non-comprehending programs can hide their lack of comprehension.

I realize this, but as someone who wants to appear human, I want to make it as difficult as possible for any kind of computer algorithm to simulate my abilities. My mental model of sub-sapient artificial intelligence is such that I believe many such might pass your test, and therefore - were I motivated properly - I would want to make it abundantly clear that I had done more than correctly parse the instructions "[(do nothing) for (4 minutes)] then {re-type [(this sentence I've just written here,) skipping (one word out of 2.)]}" That is a task that is not qualitatively different from the parsing tasks handled by the best text adventure game engines - games which are very far from intelligent AI.

I wouldn't merely sputter noisily at your failure to provide responses to my posts, I'd demonstrate language comprehension, context awareness, knowledge of natural-language processing, and argumentative skills that are not tested by your wait-four-minutes proposal, both because I believe that you will get better results if you bear these factors in mind and because - in light of the fact that I will get better results if you bear them in mind - I want you to correctly identify me as a human subject.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

20

Come up with better Turing Tests

20

20