Bigger is better—or at least that’s been the attitude of those designing AI language models in recent years. But now DeepMind is questioning this rationale, and says giving an AI a memory can help it compete with models 25 times its size.
When OpenAI released its GPT-3 model last June, it rewrote the rulebook for language AIs. The lab’s researchers showed that simply scaling up the size of a neural network and the data it was trained on could significantly boost performance on a wide variety of language tasks.
Since then, a host of other tech companies have jumped on the bandwagon, developing their own large language models and achieving similar boosts in performance. But despite the successes, concerns have been raised about the approach, most notably by former Google researcher Timnit Gebru.
In the paper that led to her being forced out of the company, Gebru and colleagues highlighted that the sheer size of these models and their datasets makes them even more inscrutable than your average neural network, which are already known for being black boxes. This is likely to make detecting and mitigating bias in these models even harder.
Perhaps an even bigger problem they identify is the fact that relying on ever more computing power to make progress in AI means that the cutting-edge of the field lies out of reach for all but the most well-resourced commercial labs. The seductively simple proposition that just scaling models up can lead to continual progress also means that fewer resources go into looking for promising alternatives.
But in new research, DeepMind has shown that there might be another way. In a series of papers, the team explains how they first built their own large language model, called Gopher, which is more than 60 percent larger than GPT-3. Then they showed that a far smaller model imbued with the ability to look up information in a database could go toe-to-toe with Gopher and other large language models.
The researchers have dubbed the smaller model RETRO, which stands for Retrieval-Enhanced Transformer. Transformers are the specific type of neural network used in most large language models; they train on large amounts of data to predict how to reply to questions or prompts from a human user.
RETRO also relies on a transformer, but it has been given a crucial augmentation. As well as making predictions about what text should come next based on its training, the model can search through a database of two trillion chunks of text to look for passages using similar language that could improve its predictions.
The researchers found that a RETRO model that had just 7 billion parameters could outperform the 178 billion parameter Jurassic-1 transformer made by AI21 Labs on a wide variety of language tasks, and even did better than the 280 billion-parameter Gopher model on most.
As well as cutting down the amount of training required, the researchers point out that the ability to see which chunks of text the model consulted when making predictions could make it easier to explain how it reached its conclusions. The reliance on a database also opens up opportunities for updating the model’s knowledge without retraining it, or even modifying the corpus to eliminate sources of bias.
Interestingly, the researchers showed that they can take an existing transformer and retro-fit it to work with a database by retraining a small section of its network. These models easily outperformed the original, and even got close to the performance of RETRO models trained from scratch.
It’s important to remember, though, that RETRO is still a large model by most standards; it’s nearly five times larger than GPT-3’s predecessor, GPT-2. And it seems likely that people will want to see what’s possible with an even bigger RETRO model with a larger database.
DeepMind certainly thinks further scaling is a promising avenue. In the Gopher paper they found that while increasing model size didn’t significantly improve performance in logical reasoning and common-sense tasks, in things like reading comprehension and fact-checking the benefits were clear.
Perhaps the most important lesson from RETRO is that scaling models isn’t the only—or even the fastest—route to better performance. While size does matter, innovation in AI models is also crucial.
Image Credit: DeepMind