The Mad-Libs-esque pretraining task that BERT uses — called masked-language modeling — isn’t new. In fact, it’s been used as a tool for assessing language comprehension in humans for decades. For Google, it also offered a practical way of enabling bidirectionality in neural networks, as opposed to the unidirectional pretraining methods that had previously dominated the field. “Before BERT, unidirectional language modeling was the standard, even though it is an unnecessarily restrictive constraint,” said Kenton Lee, a research scientist at Google.
Each of these three ingredients — a deep pretrained language model, attention and bidirectionality — existed independently before BERT. But until Google released its recipe in late 2018, no one had combined them in such a powerful way.
Refining the Recipe
Like any good recipe, BERT was soon adapted by cooks to their own tastes. In the spring of 2019, there was a period “when Microsoft and Alibaba were leapfrogging each other week by week, continuing to tune their models and trade places at the number one spot on the leaderboard,” Bowman recalled. When an improved version of BERT called RoBERTa first came on the scene in August, the DeepMind researcher Sebastian Ruder dryly noted the occasion in his widely read NLP newsletter: “Another month, another state-of-the-art pretrained language model.”
BERT’s “pie crust” incorporates a number of structural design decisions that affect how well it works. These include the size of the neural network being baked, the amount of pretraining data, how that pretraining data is masked and how long the neural network gets to train on it. Subsequent recipes like RoBERTa result from researchers tweaking these design decisions, much like chefs refining a dish.
In RoBERTa’s case, researchers at Facebook and the University of Washington increased some ingredients (more pretraining data, longer input sequences, more training time), took one away (a “next sentence prediction” task, originally included in BERT, that actually degraded performance) and modified another (they made the masked-language pretraining task harder). The result? First place on GLUE — briefly. Six weeks later, researchers from Microsoft and the University of Maryland added their own tweaks to RoBERTa and eked out a new win. As of this writing, yet another model called ALBERT, short for “A Lite BERT,” has taken GLUE’s top spot by further adjusting BERT’s basic design.
“We’re still figuring out what recipes work and which ones don’t,” said Facebook’s Ott, who worked on RoBERTa.
Still, just as perfecting your pie-baking technique isn’t likely to teach you the principles of chemistry, incrementally optimizing BERT doesn’t necessarily impart much theoretical knowledge about advancing NLP. “I’ll be perfectly honest with you: I don’t follow these papers, because they are extremely boring to me,” said Linzen, the computational linguist from Johns Hopkins. “There is a scientific puzzle there,” he grants, but it doesn’t lie in figuring out how to make BERT and all its spawn smarter, or even in figuring out how they got smart in the first place. Instead, “we are trying to understand to what extent these models are really understanding language,” he said, and not “picking up weird tricks that happen to work on the data sets that we commonly evaluate our models on.”
In other words: BERT is doing something right. But what if it’s for the wrong reasons?
Clever but Not Smart
In July 2019, two researchers from Taiwan’s National Cheng Kung University used BERT to achieve an impressive result on a relatively obscure natural language understanding benchmark called the argument reasoning comprehension task. Performing the task requires selecting the appropriate implicit premise (called a warrant) that will back up a reason for arguing some claim. For example, to argue that “smoking causes cancer” (the claim) because “scientific studies have shown a link between smoking and cancer” (the reason), you need to presume that “scientific studies are credible” (the warrant), as opposed to “scientific studies are expensive” (which may be true, but makes no sense in the context of the argument). Got all that?
Comments