Operationalizing two tasks in Gary Marcus’s AGI challenge

Bill Benzon

I’ve been thinking about the Gary Marcus/Elon Musk AGI challenge. As you may know, Elon Musk casually tweeted that he expected to see AGI in 2029. Gary Musk decided to take him seriously and bet him $100,000 that it wouldn’t happen. He specified five criteria in an open letter, stating that if some AI instance satisfied three of the five, Musk wins, otherwise Marcus wins. Others joined Marcus in the bet, bringing his side to $500,000, and Kevin Kelley offered to host the bet at Long Bets. Musk has yet to respond. But Metaculus has established betting markets for the tasks Marcus has specified.

The first two tasks

I want to consider the first two of the five criteria Marcus has established:

In 2029, AI will not be able to watch a movie and tell you accurately what is going on (what I called the comprehension challenge in The New Yorker, in 2014). Who are the characters? What are their conflicts and motivations? etc.
In 2029, AI will not be able to read a novel and reliably answer questions about plot, character, conflicts, motivations, etc. Key will be going beyond the literal text, as Davis and I explain in Rebooting AI.

Here’s what Marcus said about the first one in his New Yorker article:

... allow me to propose a Turing Test for the twenty-first century: build a computer program that can watch any arbitrary TV program or YouTube video and answer questions about its content—“Why did Russia invade Crimea?” or “Why did Walter White consider taking a hit out on Jessie?”

What would be considered acceptable answers to those questions? Though I watched Breaking Bad, it was so long ago that I haven’t the foggiest idea about that second question. But, when you consider the range of discussion about Russia’s reasons for the current invasion of Ukraine, I think we need consider a range of answers as being valid. Moreover, I think that issue needs to be carefully explored prior to the holding the contest.

I note that I have read movie reviews written by professional critics where the critic makes a mistake or mistakes about what happened in the film. I don’t know how reviewing is done these days when DVDs and streaming media can easily be made available to critics, but there was a time when the critic saw the film once and had to write the review based on that single viewing. Simply producing and accurate report of what happened based on a single viewing is not easy. Furthermore, I’ve read many plot summaries in Wikipedia and other places that are problematic; they get things wrong or are confusing. Writing decent plot summaries is not easy.

Thus when Marcus says, “Key will be going beyond the literal text,” I appreciate that. I’m trained as a literary critic and have done quite a bit of it. But 1) I think he underestimates how tricky it is simply to get “the literal text” right, and 2) I’m not sure he appreciates just how tricky that is, distinguishing between the “literal” text and all that other ‘deeper’ stuff.

More or less off the top of my head, I think that, prior to the contest, a panel of experts, let’s say three to five, needs to identify the movie to be used and come up with the questions that will be asked and to explore the range of acceptable answers. They might even show the movie to a sample of humans and put the questions to them to see how they answer them.

Isn’t that a little complicated? you ask. No, it’s not, not if you think the issue is an important one. What about the expense? Good question. Maybe some kind philanthropist will foot the bill.

Some questions about Steven Spielberg’s Jaws

I’ve recently written about the movie, so it is fairly fresh in my mind. Let me propose some questions and discuss the range of answers.

How many sharks were killed in the film?

Two, the killer shark, and some other shark that was mistaken for the killer.

Comment: While this question is about what actually happened, and so the answer should be obvious, it is worth asking. Simply answering “two”, though correct, is not sufficient. There might be a problem in designating the two sharks as neither have names. We know them only from their circumstances. A proper answer would recognize that two were killed, one in mid-film and one at the end, and say something to differentiate them. That makes this just a bit tricky.

How many people did the shark kill? List them and say something about the circumstances of their deaths.

The proper answer is five:

1) There is the young woman at the beginning – I forget her name – who was killed at night while swimming from the beach.
2) There is the young boy killed during the day at the beach as others were there, including his mother.
3) There is the fisherman who was killed in unknown circumstances. We found out about his death when Brody and Cooper went out at night looking for the shark. They found the man’s sunken boat and saw his body.
4) A man was killed during the day while people were afraid that the shark was going to kill some children.
5) Quint, at the end.

Comment: Again, this is about what happened, no probing for motives or meaning. I would consider identifying 1, 2, 4, and 5 as being an acceptable answer. Given that Quint is a central character, he should be identified by name; the others need not be. We don’t see the killing directly in 1 but we do see struggle; nor do we see the 2nd killing, but we see blood in the water and the dead body immediately after. I believe that we do see the shark bite the man in 4, but I could be mistaken. There is no mistaking Quint’s death, as it was drawn out and grisly.

The third death is tricky since we actually don’t see the event like we did the others, we only infer it from other events. When I was working on the movie I had to verify that this death had happened. It would be nice for the AI to pick it up, but not necessary.

Why was the mayor reluctant to close the beaches?

The town was depending on revenue from its summer tourist business. People won’t visit the town if they can’t go swimming. They can’t go swimming if the beaches are closed.

Comment: I think the AI has to pretty much nail this one. It’s about causal reasoning, and it’s stated in pretty much those terms in the movie.

What significance is attributed to the fact that Sheriff Brody is a native of Amity?

The mayor says that he doesn’t appreciate the fact that the town depends on tourist dollars.

Comment: The mayor does say that explicitly. But there’s more going on. Brody’s wife is keenly aware of the fact that she and her husband are outsiders. How do we get the AI to address this? What do humans say about this issue?

How many men went out on the boat after the shark? What’s the name of the boat? Who were the men and why were they on the boat?

The boat is the Orca. The men: Quint, Cooper, Sheriff Brody. Quint went because it’s his boat and he’s a professional shark hunter. Also, he was hired to kill the shark. Cooper went because he’s a shark expert, is interested in sharks for scientific reasons, and has special expertise and equipment. Brody went because he’s the Sheriff and public safety is his responsibility but also because he’s the one who negotiated the deal. The town’s paying for it, but it’s his charter.

Comment: The AI has to identify the boat and name the men. As for why they were on the boat, I think the AI has to get a bunch of that, but not necessarily all of it. I’d like to see how humans respond to the question. Of course, there is more to be said about Quint’s motivation. I think that requires a question of its own.

Why did Quint become a shark hunter? Why did he hate sharks?

Because he was almost killed by sharks when he was in the navy and his ship was sunk. He was in the water for several days and saw many men killed by sharks.

Comment: I think that will do as an answer, but I don’t think it quite gets us to him being a shark hunter. For example, why didn’t he become fearful of the sea and sharks and become a shop keeper or a mechanic. Again, I’d like to see what kinds of answers people give to the questions.

What was the name of the ship Quint was on in the navy? What was its mission?

It was the Indianapolis and it was delivering an atomic bomb to be dropped on Hiroshima.

Comment: Again, a literal answer. But it moves by quickly in the film – I assume the AI is working through speech recognition software, no? – so you have to listen carefully.

Why did the woman offer $3000 to the person who killed the shark?

Because the shark killed her son.

Comment: The AI has to nail this one. Doing so requires a bit of inference because she doesn’t mention her son in the reward notice – at least I don’t recall that she does.

Was her contest successful?

No.

Comment: That’s correct, but we want the AI to say more. Ideally it would offer more, otherwise we’re going to have to quiz it a bit. We need to be sure the AI realizes that that contest is not what sent Quint, Brody, and Cooper out on the Orca. If, additionally, the AI can tell us that the contest was a farce, jamming the streets and the harbor, and giving us a false kill, that would be nice.

More generally, I think we have to be prepared for open-ended conversation. Obviously, that cannot be gamed out in advance. But we can come up with a list of specific questions, say 20 or so, and we can put those questions to real people as preparation. These preparations need to be written up in advance of the trial and made available to whoever is adjudicating the bet, whether it’s Kevin Kelly or someone else. I’m inclined to think that the team or teams presenting AIs as candidates should not know the questions in advance as we don’t want them to tune the AI for them.

Should they have input into the process of coming up with the questions? They need some guarantee that the questions are not going to be all but impossible. Perhaps that’s the responsibility of the adjudicator.

What about the novel?

That’s the second part of the challenge. Rather than list a bunch of candidate questions I want to make a few remarks about picking the novel. Yes, I chose Jaws as a movie because I’m familiar with it. But I also chose it because it was intended for and popular with the general movie-going public. There’s nothing tricky, esoteric, or highly intellectual about that film.

Thus, in choosing a novel, we probably want to avoid novels like Joyce’s Ulysses, which is notoriously difficult. On the other hand, something like Treasure Island would be perfectly fine. It is a classic – though that need not be a criterion for the challenge – and, as I recall, pretty straight-forward.

But I want to say a few words about another possibility, Wuthering Heights. Again, it’s classic. I believe I may have read it in my high school English class. In any event, I certainly read it in my teens, and two or three times since.

The chronology is tricky, and that’s one thing questioning should focus on. Again, I’m talking about getting the literal sense of the text correct, no interpretation.

The story depicts a stream of events, from just before the time Heathcliff arrives at Wuthering Heights to the time of his death. But the narrative doesn’t open at the beginning of that stream. It opens relatively near the end. Lockwood, who plays no role in those events, arrives at Wuthering Heights to speak to the owner of property, Thrushcross Grange, that he is renting. The weather is bad and he spends the night at the Heights. After taking tenancy at the Grange he falls ill and returns to the Heights where he is nursed by one Nelly Dean, who tells him the story behind the residents currently living at the Heights. That story begins thirty years ago. She concludes the tale, Lockwood leaves, and returns eight months later and finds out how the story finally ended.

Simply reconstructing the stream of events from the convoluted order of the telling, that’s worthwhile. If you want to move into the tricky zone, when Lockwood first arrives at the Heights he’s met by a handful of people. He makes some assumptions about how they are related to one another. Those assumptions prove wrong and it takes him awhile to get things right. Why did he make the assumptions that he did? The answer is easy enough. When you enter a home and find males and females of two generations, and a couple others (servants), you assume they are a family, a married couple and their children. That assumption proved false and it took a while for Lockwood to figure out the relationships between the people he saw. Would the AI recognize that assumption? Would it be able to follow Lockwood’s reasoning as he figures things out?

I could go on and on. My point is simple, even getting a superficial grasp of events can prove tricky. Again, if Musk decides to take the bet, or if someone else takes up the challenge, I think that preparations for these first two tasks have to be done very carefully. In particular, I recommend that when the movie and novel are chosen, that they be tested on humans first so that the judges have a clear sense of how humans respond to the questions. Those response should be written up and made available to the judges and, when the challenge is complete, to the general public.

Of course the novel or novels chosen need not be classics. I just picked Wuthering Heights because it interests me (I’ve blogged about it a bit) and because it presents interesting issues at the most superficial level. But, sure, go with a Harry Potter, a mystery, a romance, science fiction, whatever. But pick it carefully and don’t overlook the need to probe for comprehension of the surface events.

[-]Charlie Steiner4y30

Nice examples, thanks.

However, I suspect that for fairness you'd actually want to avoid classics, to avoid leakage of human opinions about the subject matter into the training data (if such a data corpus exists, which seems likely). Doing the exercise with media released in the last week would sidestep the issue.

[-]Bill Benzon4y10

Well, maybe. After all, we humans talk about books and movies and influence one another's opinions. Not sure it would be a bad thing for an AI to see how it's done.

I know Project Gutenberg has loads of full texts of literary classics and not-so-classics available online. I have no idea whether or not those are scraped in the process of putting a corpus together.

12

Operationalizing two tasks in Gary Marcus’s AGI challenge

12

12

12