Researchers warn we could run out of data to train AI's by 2026

By Matthew Griffin Intelligence and the Senses 7th December 2023

WHY THIS MATTERS IN BRIEF

Today’s largest AI’s have been trained on almost all of the free and available data they can find on the internet, and from open source data sets, and they’re running out of new data to learn from.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.

As Artificial Intelligence (AI) reaches the peak of its popularity, researchers have warned the industry might be running out of training data – the fuel that runs powerful AI systems. This could slow down the growth of AI models, especially large language models like ChatGPT and GPT-4, and may even alter the trajectory of the AI revolution, or even the growth of malicious criminal AI’s such as DarkBertGPT that I talked about recently which in that case might be a good thing.

Digital humans that sign make more content accessible to the deaf community

But why is a potential lack of data an issue, considering how much there is on the web? And is there a way to address the risk?

We need a lot of data to train powerful, accurate, and high-quality AI algorithms. For instance, the algorithm powering ChatGPT was originally trained on 570 gigabytes of text data, or about 300 billion words.

The Future of AI and GenAI, by Keynote Matthew Griffin

Similarly, the Stable Diffusion algorithm, which is behind many AI image-generating apps, was trained on the LAION-5B dataset comprised of 5.8 billion image-text pairs. If an algorithm is trained on an insufficient amount of data, it will produce inaccurate or low-quality outputs.

The quality of the training data is also important which is just one of the reasons why AI giant OpenAI is now trying to create new data partnerships. Low-quality data such as social media posts or blurry photographs are easy to source but aren’t sufficient to train high-performing AI models.

Some engineers just made themselves some unnatural human cyborg cells

Text taken from social media platforms might be biased or prejudiced, or may include disinformation or illegal content which could be replicated by the model. For example, when Microsoft tried to train its AI bot using Twitter content, it learned to produce racist and misogynistic outputs.

This is why AI developers seek out high-quality content such as text from books, online articles, scientific papers, Wikipedia, and certain filtered web content. The Google Assistant was trained on 11,000 romance novels taken from self-publishing site Smashwords to make it more conversational.

The AI industry has been training AI systems on ever-larger datasets, which is why we now have high-performing models such as ChatGPT or DALL-E 3. At the same time, research shows online data stocks are growing much more slowly than datasets used to train AI.

In a paper published last year, a group of researchers predicted we will run out of high-quality text data before 2026 if current AI training trends continue. They also estimated low-quality language data will be exhausted sometime between 2030 and 2050, and low-quality image data between 2030 and 2060.

World first as AI and Blockchain come together to boost "Robot Intelligence"

AI could contribute up to $15.7 trillion to the world economy by 2030, according to accounting and consulting group PwC. But running out of usable data could slow down its development.

While the above points might alarm some AI fans, the situation may not be as bad as it seems. There are many unknowns about how AI models will develop in the future, as well as a few ways to address the risk of data shortages.

One opportunity is for AI developers to improve algorithms so they use the data they already have more efficiently.

It’s likely in the coming years they will be able to train high-performing AI systems using less data, and possibly less computational power. This would also help reduce AI’s carbon footprint.

Another option is to use AI to create synthetic data to train systems. In other words, developers can simply generate the data they need, curated to suit their particular AI model.

Meta's AI became an expert in Diplomacy and human gamers weren't the wiser

Several projects are already using synthetic content, often sourced from data-generating services such as Mostly AI. This will become more common in the future.

Developers are also searching for content outside the free online space, such as that held by large publishers and offline repositories. Think about the millions of texts published before the internet. Made available digitally, they could provide a new source of data for AI projects.

News Corp, one of the world’s largest news content owner, which has much of its content behind a paywall, recently said it was negotiating content deals with AI developers. Such deals to access hitherto private datasets would force AI companies to pay for training data – whereas they have mostly scraped it off the internet for free so far.

Content creators have protested against the unauthorized use of their content to train AI models, with some suing companies such as Microsoft, OpenAI, and Stability AI. Being remunerated for their work may help restore some of the power imbalance that exists between creatives and AI companies.

Matthew Griffin / About Author

Matthew Griffin is a multi-award winning Futurist and expert in Disruption and Innovation, Geopolitics, Leadership, and Technology, who NASA have described as a "walking encyclopaedia of the future" and a "futurist Polymath." 15-time best selling author of the "Codex of the Future" series, Matthew is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working with royal households, world leaders, G7, G20, and G77 governments, NGOs, and multi-national mid and mega cap firms to help them explore, shape, and lead the next 50 years of business and society.

An award-winning YouTube creator with over a million followers, with an unrivalled global reach and impact, Matthew is a highly sought-after international keynote speaker, lecturer, and mentor who collaborates with global leaders through the United Nations Alliance of Civilizations (UNAOC) and United Nations General Assembly (UNGA) to shape pivotal initiatives such as the UN’s AI for Humanity program, the United Nations Conference of the Parties (UN COP), and the World Economic Forum in Davos.

As the former Global Head of Cloud, National Security, and Enterprise Sales for companies including Atos, Dell-EMC, and IBM, Matthew has a proven track record of building multi-billion dollar business units and turning failing divisions into market leaders. His ability to identify, analyse, and communicate the implications of hundreds of emerging technologies and trends is unparalleled, and his insights are trusted by many of the world’s most respected organisations, including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi, Coca-Cola, Dentons, Deloitte, Dow Jones, EY, Google, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, Siemens AG and Siemens Energy, T-Mobile, UBS, VISA, Walmart, Workday, Worldpay and many others.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.