The artificial intelligence industry is obsessed with size. Bigger algorithms. More data. Sprawling data centers that could, in a few years, consume enough electricity to power whole cities.
This insatiable appetite is why OpenAI—which is on track to make $3.7 billion in revenue but lose $5 billion this year—just announced it’s raised $6.6 billion more in funding and opened a line of credit for another $4 billion.
Eye-popping numbers like these make it easy to forget size isn’t everything.
Some researchers, particularly those with fewer resources, are aiming to do more with less. AI scaling will continue, but those algorithms will also get far more efficient as they grow.
Last week, researchers at the Allen Institute for Artificial Intelligence (Ai2) released a new family of open-source multimodal models competitive with state-of-the-art models like OpenAI’s GPT-4o—but an order of magnitude smaller. Called Molmo, the models range from 1 billion to 72 billion parameters. GPT-4o, by comparison, is estimated to top a trillion parameters.
It’s All in the Data
Ai2 said it accomplished this feat by focusing on data quality over quantity.
Algorithms fed billions of examples, like GPT-4o, are impressively capable. But they also ingest a ton of low-quality information. All this noise consumes precious computing power.
To build their new multimodal models, Ai2 assembled a backbone of existing large language models and vision encoders. They then compiled a more focused, higher quality dataset of around 700,000 images and 1.3 million captions to train new models with visual capabilities. That may sound like a lot, but it’s on the order of 1,000 times less data than what’s used in proprietary multimodal models.
Instead of writing captions, the team asked annotators to record 60- to 90-second verbal descriptions answering a list of questions about each image. They then transcribed the descriptions—which often stretched across several pages—and used other large language models to clean up, crunch down, and standardize them. They found that this simple switch, from written to verbal annotation, yielded far more detail with little extra effort.
Tiny Models, Top Dogs
The results are impressive.
According to a technical paper describing the work, the team’s largest model, Molmo 72B, roughly matches or outperforms state-of-the-art closed models—including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro—across a range of 11 academic benchmarks as well as by user preference. Even the smaller Molmo models, which are a tenth the size of its biggest, compare favorably to state-of-the-art models.
Molmo can also point to the things it identifies in images. This kind of skill might help developers build AI agents that identify buttons or fields on a webpage to handle tasks like making a reservation at a restaurant. Or it could help robots better identify and interact with objects in the real world.
Ai2 CEO Ali Farhadi acknowledged it’s debatable how much benchmarks can tell us. But we can use them to make a rough model-to-model comparison.
“There are a dozen different benchmarks that people evaluate on. I don’t like this game, scientifically… but I had to show people a number,” Farhadi said at a Seattle release event. “Our biggest model is a small model, 72B, it’s outperforming GPTs and Claudes and Geminis on those benchmarks. Again, take it with a grain of salt; does this mean that this is really better than them or not? I don’t know. But at least to us, it means that this is playing the same game.”
Open-Source AI
In addition to being smaller, Molmo is open-source. This matters because it means people now have a free alternative to proprietary models.
There are other open models that are beginning to compete with the top dogs on some marks. Meta’s Llama 3.1 405B, for example, is the first scaled up open-weights large language model. But it’s not multimodal. (Meta released multimodal versions of its smaller Llama models last week. It may do the same for its biggest model in the months to come.)
Molmo is also more open than Llama. Meta’s models are best described as “open-weights” models, in that the company releases model weights but not the code or data used in training. The biggest Molmo model is based on Alibaba Cloud’s open-weights Qwen2 72B—which, like Llama, doesn’t include training data or code—but Ai2 did release the dataset and code they used to make their model multimodal.
Also, Meta limits commercial use to products with under 700 million users. In contrast, Molmo carries an Apache 2.0 license. This means developers can modify the models and commercialize products with few limitations.
“We’re targeting, researchers, developers, app developers, people who don’t know how to deal with these [large] models. A key principle in targeting such a wide range of audience is the key principle that we’ve been pushing for a while, which is: make it more accessible,” Farhadi said.
Nipping at the Heels
There are a few things of note here. First, while the makers of proprietary models try to monetize their models, open-source alternatives with similar capabilities are arriving. These alternatives, as Molmo shows, are also smaller, meaning they can run locally, and more flexible. They’re legitimate competition for companies raising billions on the promise of AI products.
“Having an open source, multimodal model means that any startup or researcher that has an idea can try to do it,” Ofir Press, a post-doc at Princeton University, told Wired.
At the same time, working with images and text is old hat for OpenAI and Google. The companies are pulling ahead again by adding advanced voice capabilities, video generation, and reasoning skills. With billions in new investment and access to a growing horde of quality data from deals with publishers, the next generation of models could raise the stakes again.
Still, Molmo suggests that even as the biggest companies plow billions into scaling the technology, open-source alternatives may not be far behind.
Image Credit: Resource Database / Unsplash