A New Test Claims It Can Tell If AI Has Common Sense

The "Winograd Schema Challenge" requires human-like reasoning to answer multiple-choice questions.

Jul 28 2014, 5:45pm
Terry Winograd, whose work inspired the test. Image: Flickr/Lisa Padilla

As machines get smarter, we're still trying to figure out how to actually examine how smart they are. When is artificial intelligence sufficiently genuine? When is a computer truly as smart as a human being? The Turing Test is the most well-known test of AI, and pits bots against humans on the basis that a winning machine should be indistinguishable from a human conversation partner. A competition announced today takes a different tack, rooting itself instead in that elusive human quality that's so difficult to replicate: common sense.

US software company Nuance Communications just announced its sponsorship of the annual "Winograd Schema Challenge," which will be run by the nonprofit Commonsense Reasoning, at the AAAI conference in Quebec. According to Nuance, it's a "more accurate measure of genuine machine intelligence" than the Turing Test.

When chatbot Eugene Goostman passed the Turing Test earlier this year, many people called foul—though to be fair to the bot, it was more the test itself than his performance that was at fault. Pitching Goostman as a young boy with English as a second language might have been a bit of a sneaky move (or a clever one, depending on your view), but it at least served to reveal the drawbacks of using the Turing Test as a real benchmark.

There are already alternatives, like the Lovelace Test, which is rather more demanding: It requires a machine to create something original, that it wasn't designed to do, to demonstrate its superior intelligence. That's a pretty high bar, perhaps even unreachable, and comparing it to the Turing Test shows how subjective what counts as true "intelligence" really is.

The Winograd Schema doesn't ask for quite such originality, and it doesn't rely on fooling humans for a machine to pass, but it still requires machines to reason in a human-like way.

The test was first proposed in a 2011 paper led by researcher Hector Levesque, and is named after computer scientist Terry Winograd. To pass, a machine answers multiple choice questions, where there are two possible answers: one right, one wrong. It sounds easy—and to humans, it is—but the nature of the questions makes them particularly tricky for a machine that doesn't have natural language abilities.

Take the example proposed by Winograd, which inspired the test:

The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.

The question is, who are "they"? If the sentence uses "feared," "they" are obviously the councilmen; if it uses "advocated," the answer is the demonstrators. 

Ernest Davis, who worked on the paper proposing the test with Levesque, has published a library of potential questions that could be included in this type of challenge. The requirements are that they have to be easy for humans to figure out, not easy to solve by "selectional restrictions" (like if one noun was a singular and the other a plural, which would make it easier to figure out which verb fit), and "Google-proof" so machines can't just compare it to an existing text.

Below are just a handful of the many examples Davis lists. Have a go and check if you're as intelligent as a human:

Paul tried to call George on the phone, but he wasn't [successful/available]. Who was not [successful/available]? 

The drain is clogged with hair. It has to be [cleaned/removed]. What has to be [cleaned/removed]? 

Ann asked Mary what time the library closes, [but/because] she had forgotten. Who had forgotten? 

Sam broke both his ankles and he's walking with crutches. But a month or so from now they should be [better/unnecessary]. What should be [better/unnecessary]? 

Look! There is a [shark/minnow] swimming right below that duck! It had better get away to safety fast! What needs to get away to safety? 

Grace was happy to trade me her sweater for my jacket. She thinks it looks [great/dowdy] on her. What looks [great/dowdy] on Grace? 

Got the picture? I'm not entirely sure it's possible to walk with two broken ankles even with crutches, nor how comparatively likely it would be for a shark to swim near a duck compared to a minnow, but that's irrelevant—the point is that for each given version of the sentence, the answer is logical if you have a human brain. If you're unsure that's the case, see answers at the end of this piece. 

Of course, just as with other tests, the definition of "intelligence" here is pretty limited, and based largely on language capabilities. But then again, you could say the same for many of the standardised tests we use to test human intelligence. Looking at the above questions, I was reminded mostly of taking the 11-Plus exam, a test that used to be given to kids in the UK on leaving primary school, and that included verbal reasoning questions along the lines of "Proceed is to (advance/resume/move) as recede is to (rewind/withdraw/recoil)." (Don't feel bad; a recent test by the Mail on Sunday found seven out of eight parents would fail.)

As with school kids, testing "real" artificial intelligence as opposed to academic ability, or simple exam technique, remains elusive in AI, and it seems reasonable that any machine we deem to be intelligent in a properly human sense would really have to pass a range of different tests—maybe a smattering of Turing, Lovelace, Winograd, and, I don't know, a timed personal essay? 

In the meantime, at least a program that's adept at natural language could probably make a more eloquent sexting bot.

Answers: Paul/George; the drain/the hair; Mary/Ann; the ankles/the crutches; the duck/the minnow; the jacket/the sweater