A language model is a probability distribution over a sequence of words—which is to say, it gives the likelihood that a particular sequence of words will appear in natural text or speech. An accurate language model can be used for a wide variety of NLP tasks, including (most directly) story generation and (more surprisingly) question answering, arithmetic, and computer programming.
GPT-3 is the latest in a series of increasingly capable language models for natural language processing (NLP). GPT-3 is a deep neural network—specifically, a Generative Pretrained Transformer. It contains 175 billion parameters trained on the Common Crawl dataset, constituting nearly a trillion words. GPT-3 was created by OpenAI in May 2020 and published here. It has since inspired a great deal of buzz—but how does it actually perform, and what does that mean for further progress in the field?
What’s Novel about GPT-3?
With 175 billion parameters, GPT-3 is two orders of magnitude larger than its direct predecessor, GPT-2, which has 1.5 billion parameters. It is also one order of magnitude larger than Microsoft’s language model, Turing NLG, which was released in February 2020.
More remarkably, GPT-3 also provides a much simpler way of applying the model to NLP tasks. Previous language models were applied to an NLP task using a traditional fine-tuning approach. With this method, the language model is first pre-trained for the generic task of estimating the probability of word sequences, then it is fine-tuned for each specific NLP task through the presentation of a large corpus of training examples for the task. The language model provides an initial task model via transfer learning, then the task model is tuned using traditional machine learning. Consequently, this approach suffers from the two main problems machine learning generally struggles with: amassing training examples is difficult and costly; and the learned models are brittle in that they overfit the training examples and don’t generalize well to new tasks, even ones that are very similar.
GPT-3 removes the need for traditional fine-tuning of models for each NLP task. It can be used directly for a wide variety of tasks, given only a small amount of guidance on the task. GPT-3 was evaluated under three levels of guidance: “few-shot learning,” in which the user provides a small number of demonstrations of the task (typically 10 to 100, which is very small compared with the size of traditional datasets for training neural networks); “one-shot learning,” in which the user provides only one demonstration; and “zero-shot learning,” in which the user provides only a brief, natural language description of the task rather than demonstrating it.
For example, GPT-3 has been used for the NLP task of machine translation between English and French. Under the “few-shot learning” condition, it is presented with a handful of English passages and their French equivalents. Under the “zero-shot learning” condition, it is presented with only the instruction: “Translate English to French.”
To the extent that GPT-3 performs well on NLP tasks under any of these training regimens, it directly addresses the two main problems confronting machine learning.
Does GPT-3 Work?
In many ways, GPT-3 represents an important new breakthrough in NLP, but its performance in testing has been mixed. GPT-3’s successes in a variety of tasks, as detailed below, demonstrates that size alone can account for some significant advances in NLP capabilities. At the same time, these advances do not apply to all tasks across the board.
Through the construction of a series of transformer-based language models, ranging in size from ELMo’s 100 million parameters in 2018 to GPT-3’s 175 billion parameters in 2020, performance on NLP tasks has steadily improved. Arguably, task performance is a direct function of the size of the language model. GPT-3’s performance on NLP tasks seems to follow that trend line, but head-to-head comparisons have not yet been conducted. Instead, GPT-3’s evaluation so far has focused on the qualitative shift in training regimens described above.
GPT-3 has been evaluated on over two dozen NLP tasks under “human-like” training conditions to test its generality. The tasks range from ones that use the language model directly (such as sentence completion) to tasks that use the model indirectly, and sometimes in mysterious ways, such as solving arithmetic problems. In most of these comparisons, GPT-3’s competition is a fine-tuned model, which might perform better than GPT-3 but suffers the traditional problems of machine learning, as described above. The complete results on many NLP tasks are reported elsewhere, and some of the more notable results are summarized below.
Direct tests of language modeling
The LAMBADA test requires models to predict the last word of paragraph-length stories. In the zero-shot setting, GPT-3 achieves 76% accuracy, a gain of eight points on the previous state of the art.
The HellaSwag test involves picking the best ending to a story. In the one-shot setting GPT-3 achieves 78% accuracy. This fell short of the current state of the art, the ALUM model, which is fine-tuned to achieve 85%.
The StoryCloze test involves selecting the sentence that best completes a five-sentence story out of multiple choices. GPT-3 achieves 83% in the zero-shot setting and 88% in the few-shot setting. This is four points lower than the best fine-tuned model, but it improves over previous zero-shot results by about 10 points.
Question-Answering
The question-answering task has traditionally been approached by first using an information retrieval (IR) system to find relevant passages of text in a corpus, then using a trained model to generate an answer from those passages. This approach is called “open book” question answering. GPT-3 was tested on the harder “closed book” task, which does not have the benefit of an IR system to reduce the search space. It was tested on three question-answering tasks.
When tested using the TriviaQA reading comprehension dataset, GPT-3 achieves 71% in the few-shot setting. This beats the state-of-the-art system, which was fine-tuned for the task, by seven points.
On a harder task, the WebQuestions benchmark for question answering, GPT-3 achieves 41% in the few-shot setting. This is comparable to the best systems, which were all fine-tuned for the task. And on an even harder task, the Natural Questions dataset, GPT-3 achieves 30% in the few-shot setting, underperforming the best fine-tuned model by seven points.
Translation
GPT-3 was trained on texts drawn from English (93% by word count) and other languages (7%).
In the zero-shot setting, GPT-3 performs poorly. In the one-shot setting it is nearly competitive. In the few-shot setting, GPT-3 improves to match the best of the fine-tuned unsupervised models.
Common sense reasoning
These tasks require physical or scientific reasoning, as distinct from sentence completion, reading comprehension, or broad-knowledge question answering. The PhysicalQA dataset asks common-sense questions about the physical world, such as: “To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?” GPT-3 tops the leaderboard with a few-shot performance of 83%, as compared with human performance of 95%.
The Abstraction and Reasoning Challenge (ARC/Challenge) involves multiple-choice questions taken from 3rd to 9th grade science exams. GPT-3 achieves 51% in the few-shot setting, which significantly underperforms the state-of-the-art, fine-tuned models by 27 points.
Finally, the OpenBookQA dataset requires multi-step reasoning. GPT-3 improves significantly from zero- to few-shot settings but is still over 20 points short of the best fine-tuned model.
Arithmetic
GPT-3 was also tested on addition, subtraction, and multiplication problems. For addition and subtraction, the problems ranged from single digit (e.g. 3+5=?) to five digits (e.g. 46371-35790=?). For multiplication, only two-digit problems were presented.
GPT-3 performed well on small problems: it achieved 100% accuracy on two-digit addition, 99% at two-digit subtraction, 80% at three-digit addition, and 94% at three-digit subtraction. Its performance declined on larger problems: 25% accuracy on four-digit operations and nine percent accuracy on five-digit operations, suggesting at least some capacity to generalize to larger numbers of digits. It achieved 29% accuracy at two-digit multiplication. Finally, it achieved 21% accuracy at single-digit combined operations (e.g. 9*(7+5)).
Some initial tests were conducted to confirm that GPT-3 was not relying solely on a memory of arithmetic facts. The training corpus was searched for all of the three-digit arithmetic problems in the test set. Out of 2,000 addition problems, only 17 (0.8%) were found to come directly (verbatim) from the training corpus; and out of 2,000 subtraction problems, only two (0.1%) were found in the corpus.
Does GPT-3 Usher in Natural Language Understanding and General AI?
Using the narrow definition of “language model” (i.e., a probability distribution over a sequence of words), GPT-3 is remarkably strong. It enables very accurate predictions of how to fill in the blanks, or to extend a sequence of words in ways that are sensible both syntactically and semantically.
Is that enough to enable natural language understanding, rather than just processing, or even general AI? It seems doubtful.
GPT-3’s lackluster performance on tasks that require even simple forms of common sense inference and reasoning, as described above, is telling. Similar reservations are expressed here and here. Nevertheless, GPT-3’s “pseudo-understanding” of textual information can improve current NLP systems, and empower future ones in ways that have not even been conceived.