OpenAI seeks new private AI training datasets to help them realise AGI

By Matthew Griffin Intelligence and the Senses 8th December 2023

WHY THIS MATTERS IN BRIEF

AI companies have trained their giant AI models on almost all of the freely available datasets, and now they want private datasets to help them reach AGI.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.

It’s an open secret that the data sets used to train Artificial Intelligence (AI) models are quite deeply flawed, which is just one of the reasons why we continue to see bias in many AI models. Image corpora tends to be US and Western-centric, partly because Western images dominated the internet when the data sets were compiled. And as most recently highlighted by a study out of the Allen Institute for AI, the data used to train large language models like Meta’s Llama 2 contains toxic language and biases.

World first as scientists created the world's first DNA storage file system

Models amplify these flaws in harmful ways. Now, OpenAI says that it wants to combat them by partnering with outside institutions to create new, hopefully improved data sets – something that will also help alleviate the problem of running out of valuable AI training data by as early as 2026 …

The Future of AI, by Futurist Keynote Matthew Griffin

This week OpenAI announced Data Partnerships, an effort to collaborate with third-party organizations to build public and private data sets for AI, as well as Artificial General Intelligence (AGI) model training.

In a blog post, OpenAI says Data Partnerships is intended to “enable more organizations to help steer the future of AI and AGI” and “benefit from models that are more useful.” And in doing so make OpenAI’s AI models even better – and eventually more profitable.

Amazon tests AI agents that buy from other sites in Agentic Commerce push

“To ultimately make [AI] that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures and languages, which requires as broad a training data set as possible,” OpenAI writes. “Including your content can make AI models more helpful to you by increasing their understanding of your domain.”

As a part of the Data Partnerships program, OpenAI says that it’ll collect “large-scale” data sets that “reflect human society” and that aren’t easily accessible online today. While the company plans to work across a wide range of modalities, including images, audio and video, it’s particularly seeking data that “expresses human intention” (e.g. long-form writing or conversations) across different languages, topics and formats.

OpenAI says it’ll work with organizations to digitize training data if necessary, using a combination of Optical Character Recognition (OCR) and automatic speech recognition tools and removing sensitive or personal information if necessary.

OpenAI one ups Google with AI driven shopping and zero ads

At the start, OpenAI’s looking to create two types of data sets: an open source data set that’d be public for anyone to use in AI model training and a set of private data sets for training proprietary – or specialist – AI models. The private sets are intended for organizations that wish to keep their data private but want OpenAI’s models to have a better understanding of their domain, OpenAI says; so far, OpenAI’s worked with the Icelandic Government and Miðeind ehf to improve GPT-4’s ability to speak Icelandic and with the Free Law Project to improve its models’ understanding of legal documents.

“Overall, we are seeking partners who want to help us teach AI to understand our world in order to be maximally helpful to everyone,” OpenAI writes.

So, can OpenAI do better than the many data-set-building efforts that’ve come before it? I’m not so sure — minimizing data set bias is a problem that’s stumped many of the world’s experts. At the very least, I’d hope that the company’s transparent about the process — and about the challenges it inevitably encounters in creating these data sets.

OpenAI releases PhD level intelligence GPT-5 model

Despite the blog post’s grandiose language, there also seems to be a clear commercial motivation, here, to improve the performance of OpenAI’s models at the expense of others — and without compensation to the data owners to speak of. I suppose that’s well within OpenAI’s right. But it seems a little tone deaf in light of open letters and lawsuits from creatives alleging that OpenAI’s trained many of its models on their work without their permission or payment.

Matthew Griffin / About Author

Matthew Griffin is a multi-award winning Futurist and expert in Disruption and Innovation, Geopolitics, Leadership, and Technology, who NASA have described as a "walking encyclopaedia of the future" and a "futurist Polymath." 15-time best selling author of the "Codex of the Future" series, Matthew is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working with royal households, world leaders, G7, G20, and G77 governments, NGOs, and multi-national mid and mega cap firms to help them explore, shape, and lead the next 50 years of business and society.

An award-winning YouTube creator with over a million followers, with an unrivalled global reach and impact, Matthew is a highly sought-after international keynote speaker, lecturer, and mentor who collaborates with global leaders through the United Nations Alliance of Civilizations (UNAOC) and United Nations General Assembly (UNGA) to shape pivotal initiatives such as the UN’s AI for Humanity program, the United Nations Conference of the Parties (UN COP), and the World Economic Forum in Davos.

As the former Global Head of Cloud, National Security, and Enterprise Sales for companies including Atos, Dell-EMC, and IBM, Matthew has a proven track record of building multi-billion dollar business units and turning failing divisions into market leaders. His ability to identify, analyse, and communicate the implications of hundreds of emerging technologies and trends is unparalleled, and his insights are trusted by many of the world’s most respected organisations, including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi, Coca-Cola, Dentons, Deloitte, Dow Jones, EY, Google, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, Siemens AG and Siemens Energy, T-Mobile, UBS, VISA, Walmart, Workday, Worldpay and many others.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.