The Metaculus question "When will the first weakly general AI system be devised, tested, and publicly announced?" has four resolution criteria, and in my opinion, reliably passing a Loebner-Silver-prize-equivalent Turing test is the hardest, since it is the only one that is adversarial. (See Wikipedia for background on the Loebner Prize.)
Though the Loebner Prize is no longer being run, we can get information about its format from the Wayback Machine.https://web.archive.org/web/20190220020219/https://aisb.org.uk/events/loebner-prize
The link covers up to the 2018 prize. According to Wikipedia, the competition was held once more in 2019, with a dramatically different format (notably not involving judges), and then discontinued. I think the resolution criterion is intended to refer to a more traditional format with judges.
There are two parts to the contest: the selection process and the finals. The selection process is only relevant for deciding which bots to include in the finals, but it's nonetheless interesting to read its transcript: https://web.archive.org/web/20181022113601/https://www.aisb.org.uk/media/files/LoebnerPrize2018/Transcripts_2018.pdf. The bot that would go on to win the bronze medal, Mitsuku, by today's standards would be considered very bad. The selection process consists of 20 pre-decided questions with no follow-up, and my opinion is that an appropriately fine-tuned GPT-4 would likely be indistinguishable from human on these questions.
However, the finals format is much more difficult. Format details:
It is not clear how much the judges and humans know about each other. If the judges know who the human confederates are, that would make the contest considerably harder.
Mitsuku's creator, Steve Worswick, wrote a recap of his 2018 win, and also posted Mitsuku's finals transcripts.
Though Mitsuku won the bronze prize in 2018, it was clearly nowhere close to winning the silver prize, based on reading its transcripts, but that's unsurprising given that was in the pre-GPT-3 era.
How much better would today's technology perform? My experience with https://www.humanornot.ai/ was that even just 2 minutes of questioning by an unpracticed judge can typically distinguish human and bot, unless the human is deliberately pretending to be a bot. The Loebner finals, by contrast, involve 100 minutes of questioning by expert judges.
I can't emphasize enough how much harder the adversarial format makes it. If the bot has any weak point, you can tailor your questioning towards that weak point.
There is also a tricky issue of not showing too much capability, which Worswick discusses in his post.
Being humanlike is not the same as being intelligent though. If I were to ask you what the population of Norway is and you gave me the exact answer, I wouldn’t think that was very humanlike. A more human response would be something like, “no idea” and although this is certainly more human, it is neither intelligent or useful.
I'd guess that, if you have a bot that in all other respects can pass as human, this shortcoming could be addressed relatively easily by fine-tuning or maybe even prompt engineering alone. However, it does mean that an out-of-the-box system would fail.
At the time of writing, the Metaculus community predicts that in July 2024 there will be a 25% chance of a system of Loebner-silver-prize capability (along with the other resolution criteria). It is hard for me to imagine how this could happen.
It's too bad that the Loebner prize is no longer held. It would've been a notable milestone for a bot to get a perfect score on the selection questions, which seems plausible with current technology, with questions comparable to the 2018 ones. Seeing progress in the finals would also have helped with understanding how far we are from a silver medal win.
Focus on imagining how we could get complete AI software R&D automation by then, that's both more important than Loebner-silver-prize capability and implies it (well, it implies that after a brief period sometimes called "intelligence explosion" the capability will be reached).
On a longer time horizon, full AI R&D automation does seem like a possible intermediate step to Loebner silver. For July 2024, though, that path is even harder to imagine.
The trouble is that July 2024 is so soon that even GPT-5 likely won't be released by then.
Without GPT-5 as a possibility, it'd need to be some other project (Gato 2? Gemini?) or some extraordinary system built using existing models (via fine-tuning, retrieval, inner thoughts, etc.). The gap between existing chatbots and Loebner-silver seems huge though, as I discussed in the post--none of that seems up to the challenge.
Full AI R&D automation would face all of the above hurdles, perhaps with the added challenge of being even harder than Loebner-silver. After all, the Loebner-silver fake human doesn't need to be a genius researcher, since very few humans are. The only aspect in which the automation seems easier is that the system doesn't need to fake being a human (such as by dumbing down its capabilities), and that seems relatively minor by comparison.