Cloze Deletion Test as a Measure of AI
The Turing test is the best test we have for AI. The standard Turing test measures the ability to mimic a human. The idea is that being indistinguishable from a human, requires at least human intelligence.
There are some shortcomings with this test. It’s somewhat subjective. It depends on the judge, what questions they ask, and how much time they spend with the AI.
It’s also just a pass or fail test. There is no measure of how good or bad the AI did.
And it’s not automatable. So we can’t test many variations of an AI to see which do better. We can’t use it on the internet to detect spam bots.
What if we want to create a more formal test? How about instead of trying to mimic a human, we try to predict what a human would say. This tests for the same sort of things as the Turing test. That is, understanding of language, the world, and human behavior. In theory, if your predictions were near perfect, you could draw from them. Those draws would be indistinguishable from an actual human’s behavior.
But it also makes the problem harder. It’s not enough to just generate one response that seems human. You need to know the full range of responses humans could give, and predict the probability of each.
One proposed method like this is the Hutter Prize. The Hutter Prize tests compression algorithms on a sample of Wikipedia. Why compression? Well compression is actually closely linked to AI. The ability to compress requires the ability to predict the data, and model it well. But compression is an unnecessary technical detail. Why not just test for prediction ability directly?
The simplest way to do this is with something called a cloze deletion test. A cloze test is just deleting random _____ from sentences, and asking you to fill them back in.
Now, is this actually testing for AGI? Can’t simple statistics and markov chain models do well on this kind of thing?
The thing is, it doesn’t matter. Markov chains can do ok. The current state of the art is recurrent neural networks. Humans can do even better. There is a gradient from simple methods to more complicated ones. True AI should be able to do the best.
The slope is not necessarily linear though. Simple statistics will get a bunch of the easy problems. More complicated methods will be able to get some of the harder ones. Humans should be able to get the really hard ones. Just because a markov chain bot does half as well as a human, does not mean it is half as intelligent. The numbers are only useful to order methods.
The consequences of this is that natural language modelling is the way forward in AI. When AI gets good enough, we should be able to train it purely on a text corpus. We could then talk to it directly. It may not have any vision, hearing, or ability to interact in the world at all.