Ethics report: Why were Timnit Gebru and Margaret Mitchell fired from Google?:1 page group report with references.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? "1F99C On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Emily M. Bender∗
[email protected] University of Washington Seattle, WA, USA Timnit Gebru∗
[email protected] Black in AI Palo Alto, CA, USA Angelina McMillan-Major
[email protected] University of Washington Seattle, WA, USA Shmargaret Shmitchell
[email protected] The Aether ABSTRACT The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, es- pecially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmen- tal and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models. CCS CONCEPTS •Computingmethodologies→Natural language processing. ACM Reference Format: Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmar- garet Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? . In Conference on Fairness, Accountability, and Trans- parency (FAccT ’21), March 3–10, 2021, Virtual Event, Canada. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3442188.3445922 1 INTRODUCTION One of the biggest trends in natural language processing (NLP) has been the increasing size of language models (LMs) as measured by the number of parameters and size of training data. Since 2018 ∗Joint first authors FAccT ’21, March 3–10, 2021, Virtual Event, Canada ACM ISBN 978-1-4503-8309-7/21/03. https://doi.org/10.1145/3442188.3445922 alone, we have seen the emergence of BERT and its variants [39, 70, 74, 113, 146], GPT-2 [106], T-NLG [112], GPT-3 [25], and most recently Switch-C [43], with institutions seemingly competing to produce ever larger LMs. While investigating properties of LMs and how they change with size holds scientific interest, and large LMs have shown improvements on various tasks (§2), we ask whether enough thought has been put into the potential risks associated with developing them and strategies to mitigate these risks. We first consider environmental risks. Echoing a line of recent work outlining the environmental and financial costs of deep learn- ing systems [129], we encourage the research community to priori- tize these impacts. One way this can be done is by reporting costs and evaluating works based on the amount of resources they con- sume [57]. As we outline in §3, increasing the environmental and financial costs of these models doubly punishes marginalized com- munities that are least likely to benefit from the progress achieved by large LMs and most likely to be harmed by negative environ- mental consequences of its resource consumption. At the scale we are discussing (outlined in §2), the first consideration should be the environmental cost. Just as environmental impact scales with model size, so does the difficulty of understanding what is in the training data. In §4, we discuss how large datasets based on texts from the Internet overrepresent hegemonic viewpoints and encode biases potentially damaging to marginalized populations. In collecting ever larger datasets we risk incurring documentation debt. We recommend mitigating these risks by budgeting for curation and documentation at the start of a project and only creating datasets as large as can be sufficiently documented. As argued by Bender and Koller [14], it is important to under- stand the limitations of LMs and put their success in context. This not only helps reduce hype which can mislead the public and re- searchers themselves regarding the capabilities of these LMs, but might encourage new research directions that do not necessarily depend on having larger LMs. As we discuss in §5, LMs are not performing natural language understanding (NLU), and only have success in tasks that can be approached by manipulating linguis- tic form [14]. Focusing on state-of-the-art results on leaderboards without encouraging deeper understanding of the mechanism by which they are achieved can cause misleading results as shown 610 This work is licensed under a Creative Commons Attribution International 4.0 License. https://doi.org/10.1145/3442188.3445922 https://doi.org/10.1145/3442188.3445922 https://creativecommons.org/licenses/by/4.0/ FAccT ’21, March 3–10, 2021, Virtual Event, Canada Bender and Gebru, et al. in [21, 93] and direct resources away from efforts that would facili- tate long-term progress towards natural language understanding, without using unfathomable training data. Furthermore, the tendency of human interlocutors to impute meaning where there is none can mislead both NLP researchers and the general public into taking synthetic text as meaningful. Combined with the ability of LMs to pick up on both subtle biases and overtly abusive language patterns in training data, this leads to risks of harms, including encountering derogatory language and experiencing discrimination at the hands of others who reproduce racist, sexist, ableist, extremist or other harmful ideologies rein- forced through interactions with synthetic language. We explore these potential harms in §6 and potential paths forward in §7. We hope that a critical overview of the risks of relying on ever- increasing size of LMs as the primary driver of increased perfor- mance of language technology can facilitate a reallocation of efforts towards approaches that avoid some of these risks while still reap- ing the benefits of improvements to language technology. 2 BACKGROUND Similar to [14], we understand the term language model (LM) to refer to systems which are trained on string prediction tasks: that is, predicting the likelihood of a token (character, word or string) given either its preceding context or (in bidirectional and masked LMs) its surrounding context. Such systems are unsupervised and when deployed, take a text as input, commonly outputting scores or string predictions. Initially proposed by Shannon in 1949 [117], some of the earliest implemented LMs date to the early 1980s and were used as components in systems for automatic speech recognition (ASR), machine translation (MT), document classification, and more [111]. In this section, we provide a brief overview of the general trend of language modeling in recent years. For a more in-depth survey of pretrained LMs, see [105]. Before neural models, n-gram models also used large amounts of data [20, 87]. In addition to ASR, these large n-gram models of English were developed in the context of machine translation from another source language with far fewer direct translation examples. For example, [20] developed an n-gram model for English with a total of 1.8T n-grams and noted steady improvements in BLEU score on the test set of 1797 Arabic translations as the training data was increased from 13M tokens. The next big step was the move towards using pretrained rep- resentations of the distribution of words (called word embeddings) in other (supervised) NLP tasks. These word vectors came from systems such as word2vec [85] and GloVe [98] and later LSTM models such as context2vec [82] and ELMo [99] and supported state of the art performance on question answering, textual entail- ment, semantic role labeling (SRL), coreference resolution, named entity recognition (NER), and sentiment analysis, at first in Eng- lish and later for other languages as well. While training the word embeddings required a (relatively) large amount of data, it reduced the amount of labeled data necessary for training on the various supervised tasks. For example, [99] showed that a model trained with ELMo reduced the necessary amount of training data needed to achieve similar results on SRL compared to models without, as shown in one instance where a model trained with ELMo reached Year Model # of Parameters Dataset Size 2019 BERT [39] 3.4E+08 16GB 2019 DistilBERT [113] 6.60E+07 16GB 2019 ALBERT [70] 2.23E+08 16GB 2019 XLNet (Large) [150] 3.40E+08 126GB 2020 ERNIE-Gen (Large) [145] 3.40E+08 16GB 2019 RoBERTa (Large) [74] 3.55E+08 161GB 2019 MegatronLM [122] 8.30E+09 174GB 2020 T5-11B [107] 1.10E+10 745GB 2020 T-NLG [112] 1.70E+10 174GB 2020 GPT-3 [25] 1.75E+11 570GB 2020 GShard [73] 6.00E+11 – 2021 Switch-C [43] 1.57E+12 745GB Table 1: Overview of recent large language models the maximum development F1 score in 10 epochs as opposed to 486 without ELMo. This model furthermore achieved the same F1 score with 1% of the data as the baseline model achieved with 10% of the training data. Increasing the number of model parameters, however, did not yield noticeable increases for LSTMs [e.g. 82]. Transformer models, on the other hand, have been able to con- tinuously benefit from larger architectures and larger quantities of data. Devlin et al. [39] in particular noted that training on a large dataset and fine-tuning for specific tasks leads to strictly increasing results on the GLUE tasks [138] for English as the hyperparameters of the model were increased. Initially developed as Chinese LMs, the ERNIE family [130, 131, 145] produced ERNIE-Gen, which was also trained on the original (English) BERT dataset, joining the ranks of very large LMs. NVIDIA released the MegatronLM which has 8.3B parameters and was trained on 174GB of text from the English Wikipedia, OpenWebText, RealNews and CC-Stories datasets [122]. Trained on the same dataset, Microsoft released T-NLG,1 an LM with 17B parameters. OpenAI’s GPT-3 [25] and Google’s GShard [73] and Switch-C [43] have increased the definition of large LM by orders of magnitude in terms of parameters at 175B, 600B, and 1.6T parameters, respectively. Table 1 summarizes a selection of these LMs in terms of training data size and parameters. As increasingly large amounts of text are collected from the web in datasets such as the Colossal Clean Crawled Corpus [107] and the Pile [51], this trend of increasingly large LMs can be expected to continue as long as they correlate with an increase in performance. A number of these models also have multilingual variants such as mBERT [39] and mT5 [148] or are trained with some amount of multilingual data such as GPT-3 where 7% of the training data was not in English [25]. The performance of these multilingual mod- els across languages is an active area of research. Wu and Drezde [144] found that while mBERT does not perform equally well across all 104 languages in its training data, it performed better at NER, POS tagging, and dependency parsing than monolingual models trained with comparable amounts of data for four low-resource languages. Conversely, [95] surveyed monolingual BERT models developed with more specific architecture considerations or addi- tional monolingual data and found that they generally outperform 1https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter- language-model-by-microsoft/ 611 Stochastic Parrots