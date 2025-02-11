For most of artificial intelligence’s history, many researchers expected that building truly capable systems would need a long series of scientific breakthroughs: revolutionary algorithms, deep insights into human cognition, or fundamental advances in our understanding of the brain. While scientific advances have played a role, recent AI progress has revealed an unexpected insight: A lot of the recent improvement in AI capabilities has come simply from scaling up existing AI systems.1

Here, scaling means deploying more computational power, using larger datasets, and building bigger models. This approach has worked surprisingly well so far.2 Just a few years ago, state-of-the-art AI systems struggled with basic tasks like counting.3,4 Today, they can solve complex math problems, write software, create extremely realistic images and videos, and discuss academic topics.

This article will provide a brief overview of scaling in AI over the past years. The data comes from Epoch, an organization that analyzes trends in computing, data, and investments to understand where AI might be headed.5 Epoch maintains the most extensive dataset on AI models and regularly publishes key figures on AI growth and change.

What Is Scaling in AI Models?

Let’s briefly break down what scaling means in AI. Scaling is about increasing three main things during training, which typically need to grow together:

• The amount of data used for training the AI;

• The model’s size, measured in “parameters”;

• Computational resources, often called "compute" in AI.

The idea is simple but powerful: Bigger AI systems, trained on more data and using more computational resources, tend to perform better. Even without substantial changes to the algorithms, this approach often leads to better performance across many tasks.6

Here is another reason why this is important: As researchers scale up these AI systems, they not only improve in the tasks they were trained on but can sometimes lead them to develop new abilities that they did not have on a smaller scale.7 For example, language models initially struggled with simple arithmetic tests like three-digit addition, but larger models could handle these easily once they reached a certain size.8 The transition wasn't a smooth, incremental improvement but a more abrupt leap in capabilities.

This abrupt jump in capability, rather than steady improvement, can be concerning. If, for example, models suddenly develop unexpected and potentially harmful behaviors simply as a result of getting bigger, it would be harder to anticipate and control.

This makes tracking these metrics important.

What Are the Three Components of Scaling Up AI models?

Data: scaling up the training data

One way to view today's AI models is by looking at them as very sophisticated pattern recognition systems. They work by identifying and learning from statistical regularities in the text, images, or other data on which they are trained. The more data the model has access to, the more it can learn about the nuances and complexities of the knowledge domain in which it’s designed to operate.9

In 1950, Claude Shannon built one of the earliest examples of “AI”: a robotic mouse named Theseus that could "remember" its path through a maze using simple relay circuits. Each wall Theseus bumped into became a data point, allowing it to learn the correct route. The total number of walls or data points was 40. You can find this data point in the chart; it is the first one.

While Theseus stored simple binary states in relay circuits, modern AI systems utilize vast neural networks, which can learn much more complex patterns and relationships and thus process billions of data points.

All recent notable AI models—especially large, state-of-the-art ones—rely on vast amounts of training data. With the y-axis displayed on a logarithmic scale, the chart shows that the data used to train AI models has grown exponentially. From 40 data points for Theseus to trillions of data points for the largest modern systems in a little more than seven decades.

Since 2010, the training data has doubled approximately every nine to ten months. You can see this rapid growth in the chart, shown by the purple line extending from the start of 2010 to October 2024, the latest data point as I write this article.10

Datasets used for training large language models, in particular, have experienced an even faster growth rate, tripling in size each year since 2010. Large language models process text by breaking it into tokens—basic units the model can encode and understand. A token doesn't directly correspond to one word, but on average, three English words correspond to about four tokens.

GPT-2, released in 2019, is estimated to have been trained on 4 billion tokens, roughly equivalent to 3 billion words. To put this in perspective, as of September 2024, the English Wikipedia contained around 4.6 billion words.11 In comparison, GPT-4, released in 2023, was trained on almost 13 trillion tokens, or about 9.75 trillion words.12 This means that GPT-4’s training data was equivalent to over 2,000 times the amount of text of the entire English Wikipedia.

As we use more data to train AI systems, we might eventually run out of high-quality human-generated materials like books, articles, and research papers. Some researchers predict we could exhaust useful training materials within the next few decades13. While AI models themselves can generate vast amounts of data, training AI on machine-generated materials could create problems, making the models less accurate and more repetitive.14

Parameters: scaling up the model size

Increasing the amount of training data lets AI models learn from much more information than ever before. However, to pick up on the patterns in this data and learn effectively, models need what are called "parameters". Parameters are a bit like knobs that can be tweaked to improve how the model processes information and makes predictions. As the amount of training data grows, models need more capacity to capture all the details in the training data. This means larger datasets typically require the models to have more parameters to learn effectively.

Early neural networks had hundreds or thousands of parameters. With its simple maze-learning circuitry, Theseus was a model with just 40 parameters—equivalent to the number of walls it encountered. Recent large models, such as GPT-3, boast up to 175 billion parameters.15 While the raw number may seem large, this roughly translates into 700 GB if stored on a disk, which is easily manageable by today’s computers.

The chart shows how the number of parameters in AI models has skyrocketed over time. Since 2010, the number of AI model parameters has approximately doubled every year. The highest estimated number of parameters recorded by Epoch is 1.6 trillion in the QMoE model.