“Open” AI models have a lot to give. The practice of sharing source code with the public spurs innovation and democratizes AI as a tool.
Or so the story goes. A new analysis in Nature puts a twist on the narrative: Most supposedly “open” AI models, such as Meta’s Llama 3, are hardly that.
Rather than encouraging or benefiting small startups, the “rhetoric of openness is frequently wielded in ways that…exacerbate the concentration of power” in large tech companies, wrote David Widder at Cornell University, Meredith Whittaker at Signal Foundation, and Sarah West at AI Now Institute.
Why care? Debating AI openness seems purely academic. But with growing use of ChatGPT and other large language models, policymakers are scrambling to catch up. Can models be allowed in schools or companies? What guiderails should be in place to protect against misuse?
And perhaps most importantly, most AI models are controlled by Google, Meta, and other tech giants, which have the infrastructure and financial means to either develop or license the technology—and in turn, guide the evolution of AI to meet their financial incentives.
Lawmakers around the globe have taken note. This year, the European Union adopted the AI Act, the world’s first comprehensive legislation to ensure AI systems used are “safe, transparent, non-discriminatory, and environmentally friendly.” As of September, there were over 120 AI bills in Congress, chaperoning privacy, accountability, and transparency.
In theory, open AI models can deliver those needs. But “when policy is being shaped, definitions matter,” wrote the team.
In the new analysis, they broke down the concept of “openness” in AI models across the entire development cycle and pinpointed how the term can be misused.
What Is ‘Openness,’ Anyway?
The term “open source” is nearly as old as software itself.
At the turn of the century, small groups of computing rebels released code for free software that anyone could download and use in defiance of corporate control. They had a vision: Open-source software, such as freely available word processors similar to Microsoft’s, could level the playing field for little guys and allow access to people who couldn’t afford the technology. The code also became a playground, where eager software engineers fiddled around with the code to discover flaws in need of fixing—resulting in more usable and secure software.
With AI, the story’s different. Large language models are built with numerous layers of interconnected artificial “neurons.” Similar to their biological counterparts, the structure of those connections heavily influences a model’s performance in a specific task.
Models are trained by scraping the internet for text, images, and increasingly, videos. As this training data flows through their neural networks, they adjust the strengths of their artificial neurons’ connections—dubbed “weights”—so that they generate desired outputs. Most systems are then evaluated by people to judge the accuracy and quality of the results.
The problem? Understanding these systems’ internal processes isn’t straightforward. Unlike traditional software, sharing only the weights and code of an AI model, without the underlying training data, makes it difficult for other people to detect potential bugs or security threats.
This means previous concepts from open-source software are being applied in “ill-fitting ways to AI systems,” wrote the team, leading to confusion about the term.
Openwashing
Current “open” AI models span a range of openness, but overall, they have three main characteristics.
One is transparency, or how much detail about an AI model’s setup its creator publishes. Eleuther AI’s Pythia series, for example, allows anyone to download the source code, underlying training data, and full documentation. They also license the AI model for wide reuse, meeting the definition of “open source” from the Open Source Initiative, a non-profit that has defined the term as it has evolved over nearly three decades. In contrast, Meta’s Llama 3, although described as open, only allows people to build on their AI through an API—a sort of interface that lets different software communicate, without sharing the underlying code—or download just the model’s weights to tinker but with restrictions on their usage.
“This is ‘openwashing’ systems that are better understood as closed,” wrote the authors.
A second characteristic is reusability, in that openly licensed data and details of an AI model can be used by other people (although often only through a cloud service—more on that later.) The third characteristic, extensibility, lets people fine-tune existing models for their specific needs.
“[This] is a key feature championed particularly by corporate actors invested in open AI,” wrote the team. There’s a reason: Training AI models requires massive computing power and resources, often only available to large tech companies. Llama 3, for example, was trained on 15 trillion tokens—a unit for processing data, such as words or characters. These choke points make it hard for startups to build AI systems from scratch. Instead, they often retrain “open” systems to adapt them to a new task or run more efficiently. Stanford’s AI Alpaca model, based on Llama, for example, gained interest for the fact it could run on a laptop.
There’s no doubt that many people and companies have benefited from open AI models. But to the authors, they may also be a barrier to the democratization of AI.
The Dark Side
Many large-scale open AI systems today are trained on cloud servers, the authors note. The UAE’s Technological Innovation Institute developed Falcon 40B and trained it on Amazon’s AWS servers. MosaicML’s AI is “tied to Microsoft’s Azure.” Even OpenAI has partnered with Microsoft to offer its new AI models at a price.
While cloud computing is extremely useful, it limits who can actually run AI models to a handful of large companies—and their servers. Stanford’s Alpaca eventually shut down partially due to a lack of financial resources.
Secrecy around training data is another concern. “Many large-scale AI models described as open neglect to provide even basic information about the underlying data used to train the system,” wrote the authors.
Large language models process huge amounts of data scraped from the internet, some of which is copyrighted, resulting in a number of ongoing lawsuits. When datasets aren’t readily made available, or when they’re incredibly large, it’s tough to fact-check the model’s reported performance, or if the datasets “launder others’ intellectual property,” according to the authors.
The problem gets worse when building frameworks, often developed by large tech companies, to minimize the time “[reinventing] the wheel.” These pre-written pieces of code, workflows, and evaluation tools help developers quickly build on an AI system. However, most tweaks don’t change the model itself. In other words, whatever potential problems or biases that exist inside the models could also propagate to downstream applications.
An AI Ecosystem
To the authors, developing AI that’s more open isn’t about evaluating one model at a time. Rather, it’s about taking the whole ecosystem into account.
Most debates on AI openness miss the larger picture. As AI advances, “the pursuit of openness on its own will be unlikely to yield much benefit,” wrote the team. Instead, the entire cycle of AI development—from setting up, training, and running AI systems to their practical uses and financial incentives—has to be considered when building open AI policies.
“Pinning our hopes on ‘open’ AI in isolation will not lead us to that world,” wrote the team.
Image Credit: x / x