On the Loebner Silver Prize (a Turing test)

On a longer time horizon, full AI R&D automation does seem like a possible intermediate step to Loebner silver. For July 2024, though, that path is even harder to imagine.

The trouble is that July 2024 is so soon that even GPT-5 likely won't be released by then.

Altman stated a few days ago that they have no plans to start training GPT-5 within the next 6 months. That'd put earliest training start at Dec 2023.
We don't know much about how long GPT-4 pre-trained for, but let's say 4 months. Given that frontier models have taken progressively longer to train, we should expect no shorter for GPT-5, which puts its earliest pre-training finishing in Mar 2024.
GPT-4 spent 6 months on fine-tuning and testing before release, and Brockman has stated that future models should be expected to take at least that long. That puts GPT-5's earliest release in Sep 2024.

Without GPT-5 as a possibility, it'd need to be some other project (Gato 2? Gemini?) or some extraordinary system built using existing models (via fine-tuning, retrieval, inner thoughts, etc.). The gap between existing chatbots and Loebner-silver seems huge though, as I discussed in the post--none of that seems up to the challenge.

Full AI R&D automation would face all of the above hurdles, perhaps with the added challenge of being even harder than Loebner-silver. After all, the Loebner-silver fake human doesn't need to be a genius researcher, since very few humans are. The only aspect in which the automation seems easier is that the system doesn't need to fake being a human (such as by dumbing down its capabilities), and that seems relatively minor by comparison.

What's involved in winning the Loebner Silver Prize?

The link covers up to the 2018 prize. According to Wikipedia, the competition was held once more in 2019, with a dramatically different format (notably not involving judges), and then discontinued. I think the resolution criterion is intended to refer to a more traditional format with judges.

There are two parts to the contest: the selection process and the finals. The selection process is only relevant for deciding which bots to include in the finals, but it's nonetheless interesting to read its transcript: https://web.archive.org/web/20181022113601/https://www.aisb.org.uk/media/files/LoebnerPrize2018/Transcripts_2018.pdf. The bot that would go on to win the bronze medal, Mitsuku, by today's standards would be considered very bad. The selection process consists of 20 pre-decided questions with no follow-up, and my opinion is that an appropriately fine-tuned GPT-4 would likely be indistinguishable from human on these questions.

However, the finals format is much more difficult. Format details:

Four judges, four bots, four humans, four rounds.

In each round, a judge is paired with one bot and one human, and there is 25 minutes of questioning.

The questioning is in instant-messaging style, much like https://www.humanornot.ai/.

To win the silver medal, the system must fool half the judges. (It seems to me that even a perfect bot would lose sometimes due to random chance, and I don't know how they account for that.)

It is not clear how much the judges and humans know about each other. If the judges know who the human confederates are, that would make the contest considerably harder.

How hard is it to win?

Though Mitsuku won the bronze prize in 2018, it was clearly nowhere close to winning the silver prize, based on reading its transcripts, but that's unsurprising given that was in the pre-GPT-3 era.

How much better would today's technology perform? My experience with https://www.humanornot.ai/ was that even just 2 minutes of questioning by an unpracticed judge can typically distinguish human and bot, unless the human is deliberately pretending to be a bot. The Loebner finals, by contrast, involve 100 minutes of questioning by expert judges.

I can't emphasize enough how much harder the adversarial format makes it. If the bot has any weak point, you can tailor your questioning towards that weak point.

There is also a tricky issue of not showing too much capability, which Worswick discusses in his post.

Being humanlike is not the same as being intelligent though. If I were to ask you what the population of Norway is and you gave me the exact answer, I wouldn’t think that was very humanlike. A more human response would be something like, “no idea” and although this is certainly more human, it is neither intelligent or useful.

I'd guess that, if you have a bot that in all other respects can pass as human, this shortcoming could be addressed relatively easily by fine-tuning or maybe even prompt engineering alone. However, it does mean that an out-of-the-box system would fail.

Conclusion

At the time of writing, the Metaculus community predicts that in July 2024 there will be a 25% chance of a system of Loebner-silver-prize capability (along with the other resolution criteria). It is hard for me to imagine how this could happen.

It's too bad that the Loebner prize is no longer held. It would've been a notable milestone for a bot to get a perfect score on the selection questions, which seems plausible with current technology, with questions comparable to the 2018 ones. Seeing progress in the finals would also have helped with understanding how far we are from a silver medal win.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

18

On the Loebner Silver Prize (a Turing test)

18

18

What's involved in winning the Loebner Silver Prize?

How hard is it to win?

Conclusion