Watson and the Turing Testby Philip Greenspun, February 2011 |
On January 13, 1994, Ellen Spertus and I did something that, as far as we know, had never been done: conducted the "male-female Turing test" that Alan Turing initially proposed.I propose to consider the question, "Can machines think?" This should begin with definitions of the meaning of the terms "machine" and "think." ... Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words. The new form of the problem can be described in terms of a game which we call the "imitation game." It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart from the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. ... It is A's object in the game to try and cause C to make the wrong identification. ... The object of the game for the third player (B) is to help the interrogator. The best strategy for her is probably to give truthful answers. ... We now ask the question "What will happen when a machine takes the part of A in this game?"
Here's how we advertised it to fellow graduate students at the MIT Artificial Intelligence Laboratory:
How did it go? Here's the follow-up emailAll four teams will gather in Rm. 518, home of the big-screen TV. On screen we will have the YTALK program showing a three-way conversation between INTERROGATOR (a Sun in 518), X (a Sun in an office), and Y (a Sun in another office). Each 5-minute round will pit three pre-selected teams against each other. We will call the teams Question Team, X Team, and Y Team. I, as Lord High Commissioner, flip a coin. If heads, X and Y will both try to be convincing men. If tails, X and Y will both try to be convincing women. I secretly flip another coin. If heads, X will be the man and the X Team is privately informed that it must choose a man to occupy the X terminal and the Y Team is privately asked to choose a woman to occupy the Y terminal. Once the round starts, responders will be alone with their terminals, however, and cannot get help from the rest of the response team. All the members of the Question team can contribute questions and those questions can be adjusted on the fly. After 4.5 minutes, typing is cut off and the Question Team has 30 seconds to debate the sexes of X and Y. Then the Question Team must guess the sex of X ("sexless nose-picking polyester-clad geek", "girlie-man", "baby-man", "mega-babe", and "schwing" will not be accepted as answers). The Question Team scores 1 for guessing correctly, -1 for guessing wrong. Whichever response team (X or Y) supplies the "genuine" man or woman gets the same score as the Question Team (this encourages the genuine responder to try hard to help the Question Team), Whichever response team supplies the "sham" man or woman gets the opposite score from the Question Team, so they are rewarded for successfully deceiving and penalized for being found out. There are four ways for four teams to play each other three at a time. There will be twelve rounds. Each team gets to play nine times, three times in each of the three roles. Miscellaneous Rule: The process of grubbing for tenure is incompatible with an interesting sex life. Consequently, to avoid confusing and embarrassing faculty members who may be present, please avoid asking questions that would be at home on "alt.sex.bestiality.bark.bark.bark" (or any of its superclasses). Any sentence that includes the words "Pony" or "Mazola", for example, would fall into this category.
One round was over with a single question: Phillip Alvelda as Interrogator asking "What did you do with little green army men?" The woman pretending to be a man said "Built a fort to protect them"; the actual man said "Burned them!!!"Here are the final scores from the Male/Female Turing Test The Psychic Fiends Network 5 Golems 3 Sea Lions -1 Tornados -3 In 12 rounds, the Question Team was wrong 4 times (wow!), three times thinking that our macho AI Lab He-men were women (can you imagine?!?). Best question: "What did you do with little green army men?" (Phillip Alvelda of the Golems) Best woman as man: Pearl Tsai (thank you Pearl for romancing the 518 projection TV) Best man as woman: Joe Media Lab [pseudonym for this article] (Joe, too bad you were trying to be a man -- maybe you should ask Anita to explain some sports to you) IDEAS FOR NEXT YEAR Questions about menstruation are out. Questions about body size or clothing are out. Questions involving Tech Square are out. This is mostly because some of these result in easy wins for the guessing teams and make the game boring, e.g., "How many stalls in the Tech Square women's bathroom?"
[Where are these people now? Ellen Spertus didn't listen to all of the distinguished tenured computer scientists who said that her idea (feeding information about which Web pages link to which other pages into a database and then analyzing the links to see if they could be used to help users, e.g., find someone's home page) was stupid. She teachers computer science at Mills College and does some work on a similar system to her Ph.D. research (the newer but similar system is called "Google"). Phillip Alvelda went on to develop micro-sized LCD displays and streaming video systems for mobile phones, now used by Sprint. A Google search reveals that Pearl Tsai worked at Google, got an MBA from Stanford, then an architecture degree from Cal Arts, and is now an architect. Joe Media Lab [not his real name] went straight from the world's most innovative laboratory straight into Google oblivion (but I found him at his old email address and he asked me to change his name!). Anita Flynn, a pioneer in microrobotics, has her own company.]
Chess, a game in which people with high IQs tend to do better than people with low IQs, was also an early project for artificial intelligence programmers. By the 1970s, computer programs could beat 99 percent of humans, whereupon the philosophers concluded a computer couldn't be intelligent unless it could beat the world's best chess-playing human. When Deep Blue, in 1997, did beat Gary Kasparov, the world's best player, chess was officially declared uninteresting.
Jeopardy departs from a regular Turing test in that the questions require a lot of factual knowledge that go well beyond what a person would acquire from day-to-day experience. Nobody in our male-female Turing test asked questions that required knowing the capital of Kentucky. Jeopardy also departs in that the answers have a very simple structure, essentially just one word or phrase.
[I have only met one person who was a successful Jeopardy contestant. He is indeed a smart guy and, when not competing on Jeopardy, earned his living sitting at home with his dog writing speeches for American university presidents (many of whom, as it turns out, are nearly illiterate).]
Watson won Jeopardy with mostly statistical processing plus being superhumanly fast on the buzzer. We expect computers to be good at crunching randomly through big data sets and also to be quick. The microcontroller in a toaster oven is quick, able to scan the front panel buttons hundreds of times per second, but we don't call it intelligent. As far as statistical association goes, noticing that "New York City" and "Big Apple" turn up near each other in a lot of sentences does not seem like the kind of thing that humans do when we say they are acting intelligently.
Finally the task was sort of trivial. Watson did not have to construct open-ended replies to open-ended questions. Watson simply pulled a word or phrase out of a database and stuck "What is" or "Who is" in front.
Separately, Watson reminds us of the scale of computing power required to simulate the behavior of the human brain. As an MIT undergraduate in 1981, I remember a senior professor in A.I. research suggesting that computer ownership would have to be regulated and licensed. Why? Suppose that a kid in Brazil, with a computer as powerful as a VAX 11/780, discovered the "secret" of AI. The kid could program his VAX-power computer to predict the next day's stock market price, make infinite money, use that infinite money to buy infinite power, a private army, etc. The professor was prescient enough to see that integrated circuit technology would put the awesome power of the VAX, one day, within the reach of consumers. How powerful was a typical VAX? It executed approximately 500,000 instructions per second and held up to 8 MB of memory. A Motorola Atrix Android phone has two 1 GHZ processors on board, capable of handling approximately 1 billion instructions per second (2000X as fast as the VAX) and holds up to 16 GB of fast memory (2000X as much solid state memory as the VAX). Watson ran with at least 15 TB of RAM (2 million times as much as the VAX) and has about 2880 CPU cores (clock rate 3 GHz, so let's say it can execute 1.5 billion instructions per second per CPU core or around 5 trillion instructions per second total, 10 million times faster than the VAX).
Is Watson going to lead the way to a range of revolutionary software products that will assist humans? I'm not sure why it would. The information that is most accessible to Watson is in computer-readable text format. If it is in computer-readable text format, it is already accessible to "the Google". If I want to know what city most often appears alongside the phrase "big apple", I can type that into Google and scan a page of results for the answer immediately. It also works to type "apple city", "new york city nickname", "bankrupt city 1970s", "king kong city", etc. Without a good speech-to-text front-end, however, it is hard to see how Watson is going to be hugely more useful than The Google. And achieving speech-to-text may yet prove to be just as hard as all of A.I. (if you don't believe it, try applying your brilliant human intelligence to transcribing a language that you don't understand, e.g., Italian; it is virtually impossible to disambiguate the sounds unless you understand the content).
I disagree that chess was declared uninteresting only after Kasparov was beaten. The fact is that the brute-force algorithms used, which amount to nothing more than a sophisticated depth-first tree search, have always been uninteresting to AI researchers. We've known for some time that computers search trees fairly well. I remember a CS classmate of mine in the early 1980's who wanted to write one as a term project. The professor rolled her eyes and told him to get a real project.Deep Blue may have been an achievement in terms of parallel computer architecture, but it wasn't AI at all. The algorithms used don't mimic what humans do in any way. Humans don't evaluate billions of possible board positions to come up with a move. They somehow manage to subconsciously prune that enormous tree down to a handful of possibilities that are consciously analyzed. Deep Blue sheds no light on how they do it.
-- Mark Ciccarello, March 10, 2011