The idea of reinforcement learning—or learning based on reward—has been around for so long it’s easy to forget we don’t really know how it works.

If DeepMind’s new bombshell paper in Nature is any indication, a common approach in AI, one that’s led to humanity’s defeat in the game of Go against machines, may have the answer.

We all subconsciously learn complex behaviors in response to positive and negative feedback, but how that works in the brain remains a century-long mystery. By examining a powerful variant of reinforcement learning, dubbed distributional reinforcement learning, that outperforms original methods, the team suggests that the brain may simultaneously represent multiple predicted futures in parallel. Each future is assigned a different probability, or chance of actually occurring, based on reward.

Here’s the kicker: the team didn’t leave it as an AI-inspired hypothesis. In a collaboration with a lab at Harvard University, they recording straight from a mouse’s brain, and found signs of their idea encoded in its reward-processing neurons.

“Neuroscience and AI have always inspired one another. Reinforcement learning is one of the central approaches in contemporary AI research and was originally motivated by thinking about how rewards could be used to reinforce certain behaviors,” said Dr. Demis Hassabis, co-founder and CEO of DeepMind. “In this new work, we find that distributional reinforcement learning…may also be at work in human biology.”

Dopamine Rush

You’ve heard of “reward-based learning.” Take Pavlov’s dogs, a famous experiment in the early 1900s. Russian physiologist Ivan Pavlov repeatedly rang a bell before feeding his dogs, and found the dogs learned to instinctively salivate as soon as they heard the bell, no food in sight. They learned that the bell—something not associated with food at all—predicted food, the reward.

Neuroscientists eventually figured out that dopamine, a chemical messenger active in reward circuits in the brain, is involved in processing reward signals. Although dopaminergic neurons—neurons that release dopamine—are often associated with a “rush” or “high,” that’s not to say they make us “happy” or “feel good” per se.

Rather, dopaminergic neurons are high-rollers at a betting game. They constantly make predictions about the chance of receiving a reward, and only change how much dopamine they release if the prediction is off. A spot-on bet—mom calls you to lunch, you get lunch—and the neurons stay quiet. If the reward is better or larger than predicted, however, they amp up their activity and shoot off packets of dopamine; if the result is worse or less than expected, the neurons lower their activity. With time, dopaminergic neurons learn to adjust their predictions to better match the real world. In other words, dopamine doesn’t equate to “happiness” or “high”; rather, it responds to predictions’ errors. We learn by trying to correct that error.

These ideas aren’t just hypotheses. Using electrodes and chemical probes inside living brains, neuroscientists can watch dopaminergic neurons in action as animals are challenged with tasks that eventually lead to rewards—food, water, sex, and drugs.

But a question remained: do dopaminergic neurons work as a unit, encoding for a single expected outcome? Or do they diverge in their predictions?

Meanwhile in AI Land…

While neuroscientists busily examined living brains, AI researchers directly tested their high-level ideas in machines.

Enter reinforcement learning. When challenged with a task, an AI algorithm starts out with random predictions. It then takes an action, observes whether it gets a reward, and adjusts its predictions based on reality. After millions of trials, the AI hopefully minimizes its prediction errors, meaning that it knows exactly how to solve the task. Step by step, it can then tackle extremely complex problems, such as beating a human champion at Go.

That’s the standard narrative. When AI researchers dug into the weeds, however, they immediately faced a head-scratcher: how do you represent the chance for reward? The traditional approach is to give it an average number—a general “gut feeling,” so to speak, based on classic reward-learning theory in neuroscience. But in the real world, the chance of getting a reward is never perfectly averaged into a fixed number.

“When playing the lottery, for example, people expect to either win big, or win nothing—no one is thinking about getting the average outcome,” said first author Dr. Will Dabney, a research scientist at DeepMind.

In 2017, DeepMind researchers decided to encode that randomness into reinforcement learning. Rather than affixing a single number for prediction error, they swapped it for a sophisticated distribution of probabilities. In distributional reinforcement learning, the AI algorithm predicts a full spectrum of future rewards: some are more optimistic and amplify their reward signals when the reward is larger than expected; others more pessimistic, lowering their reward signals when it’s smaller than predicted.

Sound familiar? The ties to the brain’s reward-based predictive powers are hard to ignore. This led the DeepMind team to ask: can we see a similar distribution in individual dopaminergic neural predictions as well?

Come Together

Partnering with Harvard, the teams tested out their idea in the brains of mice.

The mice were taught to work for a juice reward, but how much tasty juice—a drop or a gush—was highly unpredictable, basically relying on the roll of a seven-sided die. As the mice struggled to predict their next sugary “hit,” the team watched the electrical activity of individual dopaminergic neurons in their brains’ reward regions.

In contrast to neuroscience canon, the team said, dopaminergic neurons didn’t act as one. Rather than collectively encoding for a single expected outcome, they were each “tuned” to a different prediction, with some expecting a larger amount of juice reward, and others less hopeful, predicting smaller volumes. When the team mapped out the whole swath of predictions, the distribution closely matched reality—the distribution of actual rewards.

“We found that dopaminergic neurons in the brain were each tuned to different levels of pessimism or optimism. If they were a choir, they wouldn’t all be singing the same note, but harmonizing,” the authors wrote.

In other words, they seemed to operate on very similar principles to distributed reinforcement learning, a powerful method in AI.

That’s great news for both AI and neuroscience. Although AI researchers empirically already know that distributional reinforcement learning, combined with deep neural networks, is extremely powerful, the new data further validates it as a potential path towards AI that learns in a manner more similar to human brains—the only example of high-level intelligence we know of.

“When we’re able to demonstrate that the brain employs algorithms like those we are using in our AI work, it bolsters our confidence that those algorithms will be useful in the long run—that they will scale well to complex real-world problems, and interface well with other computational processes. There’s a kind of validation involved: if the brain is doing it, it’s probably a good idea,” said senior author Dr. Matt Botvinick, Director of Neuroscience Research at DeepMind.

But neuroscience also has much to gain. If the brain follows a similar reward “program,” what happens if it preferentially listens to optimistic rather than pessimistic dopaminergic neurons? Does it cause impulsivity or depression? What mechanisms in the brain’s reward circuits, or single dopaminergic neurons, can shift reward prediction? Do these changes lead to drug addiction, gambling, and other unhealthy behaviors? And finally, by borrowing “back” ideas from AI, can we finally understand what motivates us, why we take risks, and what makes us tick?

“As the fields of AI and neuroscience try to understand the nature of intelligence, we continually borrow methods and inspiration from each other. I feel extremely lucky to have been able to add to this ongoing conversation, as we try to piece together the fundamental algorithms behind learning and intelligence, be it biological or artificial,” said Dabney.

Image Credit: Image by Raman Oza from Pixabay

Shelly Xuelai Fan is a neuroscientist-turned-science writer. She completed her PhD in neuroscience at the University of British Columbia, where she developed novel treatments for neurodegeneration. While studying biological brains, she became fascinated with AI and all things biotech. Following graduation, she moved to UCSF to study blood-based factors that rejuvenate aged brains. She is the ...

Follow Shelly: