By their second birthday, children are learning the names of things. What’s this? Cat. And this? Whale. Very good. What’s this color? Red! That’s right. You love red. The human brain is good at making some cognitive tasks look easy—when they aren’t easy at all. Teaching software to recognize objects, for example, has been a challenge in computer science. And up until a few years ago, computers were pretty terrible at it.
Earlier this year, Microsoft revealed its image recognition software was wrong just 4.94% of the time—it was the first to beat an expert human error rate of 5.1%. A month later, Google reported it had achieved a rate of 4.8%.
Now, Chinese search engine giant, Baidu, says their specialized supercomputer, Minwa, has bested Google with an error rate of 4.58%. Put another way, these programs can correctly recognize everyday stuff over 95% of the time. That’s amazing.
And how AI researchers got to this point is equally impressive.
Big Data, Meet Deep Learning
Just five years ago, in 2010, the error rate of the top software was over six times greater (28.2%) than it is today. But how does one even begin measure something like that? The benchmark for image recognition is an annual contest called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). At the heart of the contest is a horde of images collected from online sources like Flickr and other search engines.
Contestants train their algorithms on 1.2 million labeled images of over 1,000 categories. They’re then given a set of unlabeled images and their software must name them. It’s about as close to an objective benchmark as you can get. (And designing such a benchmark is no walk in the park—learn more here.)
The first two years of the contest were somewhat unremarkable. But in 2012, there was a breakthrough.
The winning team employed a technique called deep learning. They not only leveled the competition, they significantly improved accuracy. In 2013, 24 teams competed—more than the previous three years combined—and there were 36 teams last year.
The vast majority now employ deep learning.
Between 2013 and 2014 alone, error was more than halved, and progress in other harder tasks—like not only noting the existence of an object in an image, but pinpointing its location—doubled in precision.
The deep learning method, loosely based on the brain’s layered neural networks, feeds on big data. The bigger, the better. Given thousands of cat pictures, a program learns what a cat looks like and can recognize it again in the future. The basic idea is founded on a decades-old technique, but modern deep learning algorithms make use of much larger datasets and bigger artificial neural networks.
Google has giant server farms at its beck and call, and Baidu says their Minwa supercomputer would rank in the world’s top 300 machines (if it weren’t specially designed to run deep learning algorithms). Minwa’s design allows artificial neural networks of hundreds of billions of connections. According to Baidu, that’s significantly bigger than any network in existence, and now, evidently top of the class at object recognition.
Computers Can Label Cat Pictures…and What Else?
To the AI expert, this is heady stuff. But who else cares? Well, members of the crowd who manually labeled ImageNet’s millions of images might. If like Baidu and Google, you want to organize the world’s online information—stretching your reach beyond text to include photographs and videos—automation is critical.
But the possibilities go further than better organizing Flickr or Google Images.
Deep learning algorithms trained on medical images, for example, may find new patterns useful for diagnosis or learn to spot what cancerous lesions look like. This could prove a useful diagnostic tool for doctors. And these powers of recognition won’t be limited to physicians and researchers in lab coats.
In the near term, all we’d need is a smartphone app connected to the cloud. (Voice recognition works like this, which is why it’s only available when you’re online.) Baidu has even suggested they might be able to shrink a version of their Minwa algorithm down—run it on a standard smartphone’s graphics processor.
Ren Wu, a Baidu researcher, recently showed a prototype smartphone app that can recognize various dog breeds using a compressed version of Minwa’s algorithm.
“If you know how to tap the computational power of a phone’s GPUs, you can actually recognize on the fly directly from the image sensor,” Wu said.
This might be useful for handheld medical diagnostics at home or in the field. And no longer limited to cat pictures online, you could image and identify the cat right in front of you, or trees or flowers on a hike.
And consider robotics. A few years back, we profiled a robotic arm made by Industrial Perception (later acquired by Google). Using Kinect-like 3D computer vision, the arm could look at a haphazard stack of boxes, determine each box’s orientation, decide where to grasp, and move it to a pallet.
This is significant because, until recently, robots mainly operated in circumstances pre-prepared with surgical precision—those boxes would have had to be stacked perfectly. Day by day, that’s less and less the case. Robots like Industrial Perception’s are getting better at dealing with real world messiness.
Future robots may box up objects in front of them like the Terminator outlining John Connor in its field of view. Which is crucial. You can’t complete your mission, if you don’t know what you’re looking at. Next-gen models will see, identify, and take action. Like, knock this guy out and co-opt his wrap-around sunglasses.
Only, we’ll probably program them to ask nicely instead.
This Isn’t Hal 9000—And It Doesn’t Have to Be
This isn’t general AI. It’s just another piece of the puzzle. But that’s OK. It’s a powerful piece.
Deep learning enables more than just image recognition. Video is next. It may further improve speech recognition software, and machine learning, in general. Google’s DeepMind project, for example, is making better-than-human AI gamers using a combination of deep learning and reinforcement algorithms.
An early demonstration showed DeepMind’s algorithm learning to play and ultimately beat the classic Atari game Breakout. In a paper, released earlier this year, DeepMind researchers said the game has mastered 49 classic video games, performing at least as well as a professional human games tester.
Training narrow artificial intelligence, like this, on special tasks is powerful in its own right. Big data will only get bigger and robots more widely used. Whether we’re stacking pallets, hopping in a driverless car, or searching for a meaningful signal in all the noise—narrow AI will prove an extremely useful companion.