Though AI models are proving to be more and more powerful as they increase in size, performance improvements from scale have not yet plateaued, according to researchers at Google.
While neural networks have grown, are they really any smarter? Companies are making larger and larger language-processing systems, though they still suffer from the same weaknesses: they can generate toxic, biased, and inaccurate text. Experts have argued against making language models larger, comparing the technology to “stochastic parrots,” and argued this software doesn’t understand language and simply regurgitates patterns seen in training data.
The algorithms may spit out racist remarks, produce misinformation, or memorize personal identifiable information. The safety and ethical risks involved in building such systems increases as they grow in size, prompting academics to argue against scaling up: it’s just making a bad situation worse. Some believe more time and effort should be spent inventing new algorithms that are smaller and less computationally intensive, instead of just making existing architectures larger.
A text-processing-and-generating 540-billion parameter transformer-based system just built by researchers at Google, however, shows the performance of language models can still improve with size.
“We evaluated [Pathways Language Model] (PaLM) on hundreds of language understanding and generation tasks, and found that it achieves state-of-the-art few-shot performance across most tasks, by significant margins in many cases,” Sharan Narang and Aakanksha Chowdhery, software engineers at Google Research, said.
PaLM was better at a wide range of tasks, from question-answering and reading comprehension to common sense reasoning, than OpenAI’s GPT-3, Nvidia and Microsoft’s Megatron-Turing NLG, and DeepMind’s Chinchilla and Gopher language models, the Googlers claimed. PaLM is bigger, and contains more parameters than all of these models.
It can also generate code, and seems to perform comparably to OpenAI’s Codex 12B model despite being trained on less Python code, according to results published in a recent paper [PDF].
PaLM excels in another area: training efficiency. It was trained using 6,144 chips across two Cloud TPU v4 Pods, Google’s largest training system configuration to date. According to the team, the software was more efficient to train than other language models.
“The goal is always to optimize the parallelism strategy, model architecture, and compiler implementation together to maximize the FLOPS utilization,” Aakanksha Chowdhery told The Register.
Despite PaLM’s capabilities, it still generates offensive and untruthful text and reflects biases in its training data. For example, it is more likely to associate Muslims with violence or terrorism stereotypes. Like other language models, PaLM was trained on text scraped from the internet. In fact, 50 percent of its training data come from conversations on social media websites.
“Our analysis reveals that our training data, and consequently PaLM, do reflect various social stereotypes and toxicity associations around identity terms,” the team admitted in the paper. “Removing these associations, however, is non-trivial; for instance, filtering off content that is deemed toxic by an automated tool may disproportionately exclude content about or authored by marginalized subgroups in the training data.”
PaLM’s capabilities and limitations are partly due to it memorizing snippets of its training data. It has a memorization rate of 40 percent for examples that appear more than 500 times in the datatset, compared with 0.75 percent for an example that appears once. Memorization is double-edged sword; it’s useful for recalling facts in information, but it also makes the system more likely to learn prejudices too.
Still, the researchers claim PaLM “shows breakthrough capabilities on numerous very difficult tasks.” It is apparently able to explain jokes, or perform multi-step arithmetic problems, and repair broken code. “Further understanding of risks and benefits of these models is a topic of ongoing research, together with developing scalable solutions that can put guardrails against malicious uses of language models,” Narang and Chowdhery said.
PaLM is being used for research purposes. Googlers developed the model as a proof of concept to scale up a language model using its Pathways architecture. The goal is to experiment with the new technique to one day build a single AI system that can generalize across thousands or millions of tasks and is trained on different types of data. ®