That’s because most actions — say, dancing — are actually a series of smaller actions. If an image depicted a person with hands in the air and hip cocked to one side, it would be difficult to know what that person was doing.
Researchers at MIT and U.C. Irvine have developed a new algorithm that can detect actions in video much more effectively than past efforts. It does by applying the lessons of natural language grammar computer scientists have parsed for computers.
“We see an analogy here, which is, if you have a complex action — like making tea or making coffee — that has some subactions, we can basically stitch together these subactions and look at each one as something like verb, adjective, and adverb,” said MIT post-doctoral researcher Hamed Pirsiavash in a news release.
Just as in some languages nouns can go before or after verbs but adjectives have to precede nouns, in a particular action, like making tea, the preparer could put a tea bag into an empty cup before or after putting the water on to boil, but no matter what the kettle will go on the stove before the water is poured.
The grammar model has another advantage: The algorithm can make educated guesses about partially completed actions in a streaming video. The software makes its best guess as to what the action is and subsequently revises it if necessary.
“We’ve known for a very long time that the things that people do are made up of subactivities. The problem is we don’t know what the pieces are,” said David Forsyth, a professor of computer science at the University of Illinois at Urbana-Champaign who was not involved in the project. “There’s a fairly substantial open problem here. I wouldn’t have said that the open problem has completely gone away, but the method itself is very powerful.”
Yet, just as different languages have different grammatical structures, different actions do as well. Based on many examples of the same action fed to it by its human instructors, the machine learning-driven artificial intelligence program will sketch out a grammar for each individual action. The software will recognize only specific actions that it’s been trained to handle. It’s been trained, so far, to identify moves in some common athletic activities. (A good chunk of the work on motion identification algorithms see high-level sports as a significant user base.)
But that doesn’t mean the software won’t be useful. Thanks to the grammar model, the program works faster and with less memory demands than previous attempts to solve the same problem. It also doesn’t require any special clothing like some gesture-recognition programs have. The program can also discard more of its own hypotheses faster, if they don’t adhere to its grammatical rules, easing its power demands. Memory demands remain fixed regardless of the length of the video.
With smaller computing demands, average Joes might be able to use the software. Users would fire up their webcams and task the computer with telling them if they remember to do something — such as take their medication — or if they complete an action — such as a physical therapy exercise or a golf putt — correctly. Down the road, the software could help robots work with humans in relatively unstructured work environments.
With any advances in artificial intelligence, legitimate privacy questions spring to mind. But this algorithm’s move to put the smarts in the hands of a user and develop them for narrowly targeted uses seems like a reasonable way to get the benefits of savvy computers without too much of the creep factor.