Google just built a record breaking 1.6 Trillion parameter AI language model

By Matthew Griffin Intelligence and the Senses 26th January 2021

WHY THIS MATTERS IN BRIEF

The theory goes that the larger the AI language model the better that AI will be at translating languages and holding natural human-like conversations.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, connect, watch a keynote, or browse my blog.

Parameters are the key to machine learning algorithms, and arguably the more they have the better they are at accomplishing specific tasks. They’re the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well.

Facebook's bots unexpectedly created their own language

For example, OpenAI’s GPT-3 language model, which is one of the largest language models ever trained at 175 billion parameters, can write articles, convince bloggers it’s human, help cheating students get passing grades on coursework, and even complete basic code. All for starters.

In what might be one of the most comprehensive tests of this correlation to date, Google researchers developed and benchmarked techniques they claim enabled them to train a language model containing more than a trillion parameters. They say their 1.6 trillion parameter model, which appears to be the largest of its size to date, achieved an up to 4 times speedup over the previously largest Google-developed language model (T5-XXL). All of which means that Google’s AI could very well be the world’s best performing language AI.

A controversial AI can allegedly reveal your personality from just a photo

As the researchers note in a paper detailing their work, large-scale training like this is an effective path toward powerful models. Simple architectures, backed by large datasets and parameter counts, surpass far more complicated algorithms. But effective, large-scale training is extremely computationally intensive. That’s why the researchers pursued what they call the Switch Transformer, a “sparsely activated” technique that uses only a subset of a model’s weights, or the parameters that transform input data within the model.

The Switch Transformer builds on a mix of experts, an AI model paradigm first proposed in the early ’90s. The rough concept is to keep multiple experts, or models specialized in different tasks, inside a larger model and have a “gating network” choose which experts to consult for any given data.

The novelty of the Switch Transformer is that it efficiently leverages hardware designed for dense matrix multiplications — mathematical operations widely used in language models — such as GPUs and Google’s tensor processing units (TPUs). In the researchers’ distributed training setup, their models split unique weights on different devices so the weights increased with the number of devices but maintained a manageable memory and computational footprint on each device.

New study suggests that AI's experience cognitive decline as they age

In an experiment, the researchers pretrained several different Switch Transformer models using 32 TPU cores on the Colossal Clean Crawled Corpus, a 750GB-sized dataset of text scraped from Reddit, Wikipedia, and other web sources. They tasked the models with predicting missing words in passages where 15 percent of the words had been masked out, as well as other challenges, like retrieving text to answer a list of increasingly difficult questions.

The researchers claim their 1.6-trillion-parameter model with 2,048 experts (Switch-C) exhibited “no training instability at all,” in contrast to a smaller model (Switch-XXL) containing 395 billion parameters and 64 experts. However, on one benchmark — the Sanford Question Answering Dataset (SQuAD) — Switch-C scored lower (87.7) versus Switch-XXL (89.6), which the researchers attribute to the opaque relationship between fine-tuning quality, computational requirements, and the number of parameters.

Sony's newest TV has it's own AI brain to improve your viewing pleasure

This being the case, the Switch Transformer led to gains in a number of downstream tasks. For example, it enabled an over 7 times pre-training speedup while using the same amount of computational resources, according to the researchers, who demonstrated that the large sparse models could be used to create smaller, dense models fine-tuned on tasks with 30 percent of the quality gains of the larger model. In one test where a Switch Transformer model was trained to translate between over 100 different languages, the researchers observed “a universal improvement” across its ability to translate 101 languages, with 91 percent of the languages benefitting from an over 4 times speedup compared with a baseline model.

“Though this work has focused on extremely large models, we also find that models with as few as two experts improve performance while easily fitting within memory constraints of commonly available GPUs or TPUs,” the researchers wrote in the paper. “We cannot fully preserve the model quality, but compression rates of 10 to 100 times are achievable by distilling our sparse models into dense models while achieving circa 30 percent of the quality gain of the expert model.”

Coinbase CEO says AI should have its own crypto wallets

In future work, the researchers plan to apply the Switch Transformer to “new and across different modalities,” including image and text. They believe that model sparsity can confer advantages in a range of different media, as well as multimodal models.

Unfortunately, the researchers’ work didn’t take into account the impact of these large language models in the real world. Models often amplify the biases encoded in this public data; a portion of the training data is not uncommonly sourced from communities with pervasive gender, race, and religious prejudices. AI research firm OpenAI notes that this can lead to placing words like “naughty” or “sucked” near female pronouns and “Islam” near words like “terrorism.” Other studies, like one published in April by Intel, MIT, and Canadian AI initiative CIFAR researchers, have found high levels of stereotypical bias from some of the most popular models, including Google’s BERT and XLNet, OpenAI’s GPT-2, and Facebook’s RoBERTa. This bias could be leveraged by malicious actors to foment discord by spreading misinformation, disinformation, and outright lies that “radicalize individuals into violent far-right extremist ideologies and behaviors,” according to the Middlebury Institute of International Studies.

World's first vaccine entirely designed by an AI heads to human trials

It’s unclear whether Google’s policies on published machine learning research might have played a role in this. Reuters reported late last year that researchers at the company are now required to consult with legal, policy, and public relations teams before pursuing topics such as face and sentiment analysis and categorizations of race, gender, or political affiliation. And in early December, Google fired AI ethicist Timnit Gebru, reportedly in part over a research paper on large language models that discussed risks, including the impact of their carbon footprint on marginalized communities and their tendency to perpetuate abusive language, hate speech, microaggressions, stereotypes, and other dehumanizing language aimed at specific groups of people.

Matthew Griffin / About Author

Matthew Griffin is a multi-award winning Futurist and expert in Disruption and Innovation, Geopolitics, Leadership, and Technology, who NASA have described as a "walking encyclopaedia of the future" and a "futurist Polymath." 15-time best selling author of the "Codex of the Future" series, Matthew is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working with royal households, world leaders, G7, G20, and G77 governments, NGOs, and multi-national mid and mega cap firms to help them explore, shape, and lead the next 50 years of business and society.

An award-winning YouTube creator with over a million followers, with an unrivalled global reach and impact, Matthew is a highly sought-after international keynote speaker, lecturer, and mentor who collaborates with global leaders through the United Nations Alliance of Civilizations (UNAOC) and United Nations General Assembly (UNGA) to shape pivotal initiatives such as the UN’s AI for Humanity program, the United Nations Conference of the Parties (UN COP), and the World Economic Forum in Davos.

As the former Global Head of Cloud, National Security, and Enterprise Sales for companies including Atos, Dell-EMC, and IBM, Matthew has a proven track record of building multi-billion dollar business units and turning failing divisions into market leaders. His ability to identify, analyse, and communicate the implications of hundreds of emerging technologies and trends is unparalleled, and his insights are trusted by many of the world’s most respected organisations, including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi, Coca-Cola, Dentons, Deloitte, Dow Jones, EY, Google, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, Siemens AG and Siemens Energy, T-Mobile, UBS, VISA, Walmart, Workday, Worldpay and many others.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.