In early 2019 OpenAI, a startup co-founded by Elon Musk devoted to ensuring artificial general intelligence is safe for humanity, announced it had created a neural network for natural language processing called GPT-2. In what some saw as a publicity stunt and others as a sign of an imminent robot apocalypse, OpenAI initially chose not to publicly release the text generator. Because the tool could produce text realistic enough that it was, in some cases, hard to distinguish from human writing, its creators worried GPT-2 could be appropriated as an easy way for bad actors to crank out lots of fake news or propaganda.
Fake news has certainly become a widespread and insidious problem, and in a year when we’re dealing with both a global pandemic and the possible re-election of Donald Trump as the US president, it seems like a more powerful and lifelike text-generating AI is one of the last things we need right now.
Despite the potential risks, though, OpenAI announced late last month that GPT-2’s successor is complete. It’s called—you guessed it—GPT-3.
A paper published by OpenAI researchers on the pre-print server arXiv describes GPT-3 as an autoregressive language model with 175 billion parameters. 175 billion is a lot; for comparison’s sake, the final version of GPT-2, released in November 2019, had 1.5 billion parameters. Microsoft’s Turing Natural Language Generation model, released for a private demo in February, had 17 billion parameters.
“Parameter” refers to an attribute a machine learning model defines based on its training data. So how did OpenAI go from 1.5 billion of these to 175 billion? Contrary to what you might guess based on GPT-3’s massive size, the tech behind it isn’t more advanced than that of similar tools, and contains no new training methods or architectures; its creators simply scaled up the quantity of input data by an order of magnitude.
The data came from Common Crawl, a non-profit that scans the open web every month and downloads content from billions of HTML pages then makes it available in a special format for large-scale data mining. In 2017 the average monthly “crawl” yielded over three billion web pages. Common Crawl has been doing this since 2011, and has petabytes of data in over 40 different languages. The OpenAI team applied some filtering techniques to improve the overall quality of the data, including adding curated datasets like Wikipedia.
GPT stands for Generative Pretrained Transformer. The “transformer” part refers to a neural network architecture introduced by Google in 2017. Rather than looking at words in sequential order and making decisions based on a word’s positioning within a sentence, text or speech generators with this design model the relationships between all the words in a sentence at once. Each word gets an “attention score,” which is used as its weight and fed into the larger network. Essentially, this is a complex way of saying the model is weighing how likely it is that a given word will be preceded or followed by another word, and how much that likelihood changes based on the other words in the sentence.
Through finding the relationships and patterns between words in a giant dataset, the algorithm ultimately ends up learning from its own inferences, in what’s called unsupervised machine learning. And it doesn’t end with words—GPT-3 can also figure out how concepts relate to each other, and discern context.
In the paper, the OpenAI team notes that GPT-3 performed well when tasked with translation, answering questions, and doing reading comprehension-type exercises that required filling in the blanks where words had been removed. They also say the model was able to do “on-the-fly reasoning,” and that it generated sample news articles 200 to 500 words long that were hard to tell apart from ones written by people.
The authors acknowledge that GPT-3 could be misused in several ways, including to generate misinformation and spam, phishing, abuse of legal and governmental processes, and even fake academic essays. More than a few high school seniors would certainly jump at the chance to have an AI write their college admissions essay (but among the potential misuses of this tool, that’s the least of our worries).
At the beginning of this year, an editor at The Economist gave GPT-2 a list of questions about what 2020 had in store. The algorithm predicted economic turbulence, “major changes in China,” and no re-election for Donald Trump, among other things. It’s a bit frightening to imagine what GPT-3 might predict for 2021 once we input all the articles from 2020, which is turning out to be a historic year in a pretty terrible way.
For now, though, no one outside OpenAI has access to GPT-3; the company hasn’t put out any details of when, how, or whether the algorithm will be released to the public. It could happen in phases, similar to GPT-2. But the sheer size of the new version presents added complications; according to Joe Davison, a research engineer at a startup that’s also working on natural language processing, “The computational resources needed to actually use GPT-3 in the real world make it extremely impractical.”
In the meantime, though, OpenAI has a newly-minted supercomputer custom-made by Microsoft for machine learning research. This will make it easier to quickly improve GPT-3’s abilities, and maybe even start work on the model’s next iteration in the not-too-distant future.
How powerful might these natural language processing algorithms get? Perhaps it’s simultaneously a comfort and a shortfall to think that even having been fed with years’ worth of the internet’s entire pool of knowledge, no model could have predicted what 2020 would bring—but then again, no human could have, either.
Image Credit: Willi Heidelbach from Pixabay