Sam was six months old when he first strapped a lightweight camera onto his forehead.
For the next year and a half, the camera captured snippets of his life. He crawled around the family’s pets, watched his parents cook, and cried on the front porch with grandma. All the while, the camera recorded everything he heard.
What sounds like a cute toddler home video is actually a daring concept: Can AI learn language like a child? The results could also reveal how children rapidly acquire language and concepts at an early age.
A new study in Science describes how researchers used Sam’s recordings to train an AI to understand language. With just a tiny portion of one child’s life experience over a year, the AI was able to grasp basic concepts—for example, a ball, a butterfly, or a bucket.
The AI, called Child’s View for Contrastive Learning (CVCL), roughly mimics how we learn as toddlers by matching sight to audio. It’s a very different approach than that taken by large language models like the ones behind ChatGPT or Bard. These models’ uncanny ability to craft essays, poetry, or even podcast scripts has thrilled the world. But they need to digest trillions of words from a wide variety of news articles, screenplays, and books to develop these skills.
Kids, by contrast, learn with far less input and rapidly generalize their learnings as they grow. Scientists have long wondered if AI can capture these abilities with everyday experiences alone.
“We show, for the first time, that a neural network trained on this developmentally realistic input from a single child can learn to link words to their visual counterparts,” study author Dr. Wai Keen Vong at NYU’s Center for Data Science said in a press release about the research.
Child’s Play
Children easily soak up words and their meanings from everyday experience.
At just six months old, they begin to connect words to what they’re seeing—for example, a round bouncy thing is a “ball.” By two years of age, they know roughly 300 words and their concepts.
Scientists have long debated how this happens. One theory says kids learn to match what they’re seeing to what they’re hearing. Another suggests language learning requires a broader experience of the world, such as social interaction and the ability to reason.
It’s hard to tease these ideas apart with traditional cognitive tests in toddlers. But we may get an answer by training an AI through the eyes and ears of a child.
M3GAN?
The new study tapped a rich video resource called SAYCam, which includes data collected from three kids between 6 and 32 months old using GoPro-like cameras strapped to their foreheads.
Twice every week, the cameras recorded around an hour of footage and audio as they nursed, crawled, and played. All audible dialogue was transcribed into “utterances”—words or sentences spoken before the speaker or conversation changes. The result is a wealth of multimedia data from the perspective of babies and toddlers.
For the new system, the team designed two neural networks with a “judge” to coordinate them. One translated first-person visuals into the whos and whats of a scene—is it a mom cooking? The other deciphered words and meanings from the audio recordings.
The two systems were then correlated in time so the AI learned to associate correct visuals with words. For example, the AI learned to match an image of a baby to the words “Look, there’s a baby” or an image of a yoga ball to “Wow, that is a big ball.” With training, it gradually learned to separate the concept of a yoga ball from a baby.
“This provides the model a clue as to which words should be associated with which objects,” said Vong.
The team then trained the AI on videos from roughly a year and a half of Sam’s life. Together, it amounted to over 600,000 video frames, paired with 37,500 transcribed utterances. Although the numbers sound large, they’re roughly just one percent of Sam’s daily waking life and peanuts compared to the amount of data used to train large language models.
Baby AI on the Rise
To test the system, the team adapted a common cognitive test used to measure children’s language abilities. They showed the AI four new images—a cat, a crib, a ball, and a lawn—and asked which one was the ball.
Overall, the AI picked the correct image around 62 percent of the time. The performance nearly matched a state-of-the-art algorithm trained on 400 million image and text pairs from the web—orders of magnitude more data than that used to train the AI in the study. They found that linking video images with audio was crucial. When the team shuffled video frames and their associated utterances, the model completely broke down.
The AI could also “think” outside the box and generalize to new situations.
In another test, it was trained on Sam’s perspective of a picture book as his parent said, “It’s a duck and a butterfly.” Later, he held up a toy butterfly when asked, “Can you do the butterfly?” When challenged with multicolored butterfly images—ones the AI had never seen before—it detected three out of four examples for “butterfly” with above 80 percent accuracy.
Not all word concepts scored the same. For instance, “spoon” was a struggle. But it’s worth pointing out that, like a tough reCAPTCHA, the training images were hard to decipher even for a human.
Growing Pains
The AI builds on recent advances in multimodal machine learning, which combines text, images, audio, or video to train a machine brain.
With input from just a single child’s experience, the algorithm was able to capture how words relate to each other and link words to images and concepts. It suggests that for toddlers hearing words and matching them to what they’re seeing helps build their vocabulary.
That’s not to say other brain processes, such as social cues and reasoning don’t come into play. Adding these components to the algorithm could potentially improve it, the authors wrote.
The team plans to continue the experiment. For now, the “baby” AI only learns from still image frames and has a vocabulary mostly comprised of nouns. Integrating video segments into the training could help the AI learn verbs because video includes movement.
Adding intonation to speech data could also help. Children learn early on that a mom’s “hmm” can have vastly different meanings depending on the tone.
But overall, combining AI and life experiences is a powerful new method to study both machine and human brains. It could help us develop new AI models that learn like children, and potentially reshape our understanding of how our brains learn language and concepts.
Image Credit: Wai Keen Vong