In 2020, artificial intelligence company OpenAI stunned the tech world with its GPT-3 machine learning algorithm. After ingesting a broad slice of the internet, GPT-3 could generate writing that was hard to distinguish from text authored by a person, do basic math, write code, and even whip up simple web pages.
OpenAI followed up GPT-3 with more specialized algorithms that could seed new products, like an AI called Codex to help developers write code and the wildly popular (and controversial) image-generator DALL-E 2. Then late last year, the company upgraded GPT-3 and dropped a viral chatbot called ChatGPT—by far, its biggest hit yet.
Now, a rush of competitors is battling it out in the nascent generative AI space, from new startups flush with cash to venerable tech giants like Google. Billions of dollars are flowing into the industry, including a $10-billion follow-up investment by Microsoft into OpenAI.
This week, after months of rather over-the-top speculation, OpenAI’s GPT-3 sequel, GPT-4, officially launched. In a blog post, interviews, and two reports (here and here), OpenAI said GPT-4 is better than GPT-3 in nearly every way.
More Than a Passing Grade
GPT-4 is multimodal, which is a fancy way of saying it was trained on both images and text and can identify, describe, and riff on what’s in an image using natural language. OpenAI said the algorithm’s output is higher quality, more accurate, and less prone to bizarre or toxic outbursts than prior versions. It also outperformed the upgraded GPT-3 (called GPT 3.5) on a slew of standardized tests, placing among the top 10 percent of human test-takers on the bar licensing exam for lawyers and scoring either a 4 or a 5 on 13 out of 15 college-level advanced placement (AP) exams for high school students.
To show off its multimodal abilities—which have yet to be offered more widely as the company evaluates them for misuse—OpenAI president Greg Brockman sketched a schematic of a website on a pad of paper during a developer demo. He took a photo and asked GPT-4 to create a webpage from the image. In seconds, the algorithm generated and implemented code for a working website. In another example, described by The New York Times, the algorithm suggested meals based on an image of food in a refrigerator.
The company also outlined its work to reduce risk inherent in models like GPT-4. Notably, the raw algorithm was complete last August. OpenAI spent eight months working to improve the model and rein in its excesses.
Much of this work was accomplished by teams of experts poking and prodding the algorithm and giving feedback, which was then used to refine the model with reinforcement learning. The version launched this week is an improvement on the raw version from last August, but OpenAI admits it still exhibits known weaknesses of large language models, including algorithmic bias and an unreliable grasp of the facts.
By this account, GPT-4 is a big improvement technically and makes progress mitigating, but not solving, familiar risks. In contrast to prior releases, however, we’ll largely have to take OpenAI’s word for it. Citing an increasingly “competitive landscape and the safety implications of large-scale models like GPT-4,” the company opted to withhold specifics about how GPT-4 was made, including model size and architecture, computing resources used in training, what was included in its training dataset, and how it was trained.
Ilya Sutskever, chief technology officer and cofounder at OpenAI, told The Verge “it took pretty much all of OpenAI working together for a very long time to produce this thing” and lots of other companies “would like to do the same thing.” He went on to suggest that as the models grow more powerful, the potential for abuse and harm makes open-sourcing them a dangerous proposition. But this is hotly debated among experts in the field, and some pointed out the decision to withhold so much runs counter to OpenAI’s stated values when it was founded as a nonprofit. (OpenAI reorganized as a capped-profit company in 2019.)
The algorithm’s full capabilities and drawbacks may not become apparent until access widens further and more people test (and stress) it out. Before reining it in, Microsoft’s Bing chatbot caused an uproar as users pushed it into bizarre, unsettling exchanges.
Overall, the technology is quite impressive—like its predecessors—but also, despite the hype, more iterative than GPT-3. With the exception of its new image-analyzing skills, most abilities highlighted by OpenAI are improvements and refinements of older algorithms. Not even access to GPT-4 is novel. Microsoft revealed this week that it secretly used GPT-4 to power its Bing chatbot, which had recorded some 45 million chats as of March 8.
AI for the Masses
While GPT-4 may not to be the step change some predicted, the scale of its deployment almost certainly will be.
GPT-3 was a stunning research algorithm that wowed tech geeks and made headlines; GPT-4 is a far more polished algorithm that’s about to be rolled out to millions of people in familiar settings like search bars, Word docs, and LinkedIn profiles.
In addition to its Bing chatbot, Microsoft announced plans to offer services powered by GPT-4 in LinkedIn Premium and Office 365. These will be limited rollouts at first, but as each iteration is refined in response to feedback, Microsoft could offer them to the hundreds of millions of people using their products. (Earlier this year, the free version of ChatGPT hit 100 million users faster than any app in history.)
It’s not only Microsoft layering generative AI into widely used software.
Google said this week it plans to weave generative algorithms into its own productivity software—like Gmail and Google Docs, Slides, and Sheets—and will offer developers API access to PaLM, a GPT-4 competitor, so they can build their own apps on top of it. Other models are coming too. Facebook recently gave researchers access to its open-source LLaMa model—it was later leaked online—while a Google-backed startup, Anthropic, and China’s tech giant Baidu rolled out their own chatbots, Claude and Ernie, this week.
As models like GPT-4 make their way into products, they can be updated behind the scenes at will. OpenAI and Microsoft continually tweaked ChatGPT and Bing as feedback rolled in. ChatGPT Plus users (a $20/month subscription) were granted access to GPT-4 at launch.
It’s easy to imagine GPT-5 and other future models slotting into the ecosystem being built now as simply, and invisibly, as a smartphone operating system that upgrades overnight.
If there’s anything we’ve learned in recent years, it’s that scale reveals all.
It’s hard to predict how new tech will succeed or fail until it makes contact with a broad slice of society. The next months may bring more examples of algorithms revealing new abilities and breaking or being broken, as their makers scramble to keep pace.
“Safety is not a binary thing; it is a process,” Sutskever told MIT Technology Review. “Things get complicated any time you reach a level of new capabilities. A lot of these capabilities are now quite well understood, but I’m sure that some will still be surprising.”
Longer term, when the novelty wears off, bigger questions may loom.
The industry is throwing spaghetti at the wall to see what sticks. But it’s not clear generative AI is useful—or appropriate—in every instance. Chatbots in search, for example, may not outperform older approaches until they’ve proven to be far more reliable than they are today. And the cost of running generative AI, particularly at scale, is daunting. Can companies keep expenses under control, and will users find products compelling enough to vindicate the cost?
Also, the fact that GPT-4 makes progress on but hasn’t solved the best-known weaknesses of these models should give us pause. Some prominent AI experts believe these shortcomings are inherent to the current deep learning approach and won’t be solved without fundamental breakthroughs.
Factual missteps and biased or toxic responses in a fraction of interactions are less impactful when numbers are small. But on a scale of hundreds of millions or more, even less than a percent equates to a big number.
“LLMs are best used when the errors and hallucinations are not high impact,” Matthew Lodge, the CEO of Diffblue, recently told IEEE Spectrum. Indeed, companies are appending disclaimers warning users not to rely on them too much—like keeping your hands on the steering wheel of that Tesla.
It’s clear the industry is eager to keep the experiment going though. And so, hands on the wheel (one hopes), millions of people may soon begin churning out presentation slides, emails, and websites in a jiffy, as the new crop of AI sidekicks arrives in force.
Image Credit: Luke Jones / Unsplash