OpenAI seeks new private AI training datasets to help them realise AGI

0 12

By Matthew Griffin Intelligence and the Senses 8th December 2023

WHY THIS MATTERS IN BRIEF

AI companies have trained their giant AI models on almost all of the freely available datasets, and now they want private datasets to help them reach AGI.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.

It’s an open secret that the data sets used to train Artificial Intelligence (AI) models are quite deeply flawed, which is just one of the reasons why we continue to see bias in many AI models. Image corpora tends to be US and Western-centric, partly because Western images dominated the internet when the data sets were compiled. And as most recently highlighted by a study out of the Allen Institute for AI, the data used to train large language models like Meta’s Llama 2 contains toxic language and biases.

New first as malware bricks Baracuda email gateways forcing a full rip out

Models amplify these flaws in harmful ways. Now, OpenAI says that it wants to combat them by partnering with outside institutions to create new, hopefully improved data sets – something that will also help alleviate the problem of running out of valuable AI training data by as early as 2026 …

The Future of AI, by Futurist Keynote Matthew Griffin

This week OpenAI announced Data Partnerships, an effort to collaborate with third-party organizations to build public and private data sets for AI, as well as Artificial General Intelligence (AGI) model training.

In a blog post, OpenAI says Data Partnerships is intended to “enable more organizations to help steer the future of AI and AGI” and “benefit from models that are more useful.” And in doing so make OpenAI’s AI models even better – and eventually more profitable.

Pittsburgs AI traffic lights help commuters get there faster

“To ultimately make [AI] that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures and languages, which requires as broad a training data set as possible,” OpenAI writes. “Including your content can make AI models more helpful to you by increasing their understanding of your domain.”

As a part of the Data Partnerships program, OpenAI says that it’ll collect “large-scale” data sets that “reflect human society” and that aren’t easily accessible online today. While the company plans to work across a wide range of modalities, including images, audio and video, it’s particularly seeking data that “expresses human intention” (e.g. long-form writing or conversations) across different languages, topics and formats.

OpenAI says it’ll work with organizations to digitize training data if necessary, using a combination of Optical Character Recognition (OCR) and automatic speech recognition tools and removing sensitive or personal information if necessary.

Researchers hacked a Tesla's autopilot using three stickers on the road

At the start, OpenAI’s looking to create two types of data sets: an open source data set that’d be public for anyone to use in AI model training and a set of private data sets for training proprietary – or specialist – AI models. The private sets are intended for organizations that wish to keep their data private but want OpenAI’s models to have a better understanding of their domain, OpenAI says; so far, OpenAI’s worked with the Icelandic Government and Miðeind ehf to improve GPT-4’s ability to speak Icelandic and with the Free Law Project to improve its models’ understanding of legal documents.

“Overall, we are seeking partners who want to help us teach AI to understand our world in order to be maximally helpful to everyone,” OpenAI writes.

So, can OpenAI do better than the many data-set-building efforts that’ve come before it? I’m not so sure — minimizing data set bias is a problem that’s stumped many of the world’s experts. At the very least, I’d hope that the company’s transparent about the process — and about the challenges it inevitably encounters in creating these data sets.

Jeff bezos announces target to build the world's first Moon base in 2023

Despite the blog post’s grandiose language, there also seems to be a clear commercial motivation, here, to improve the performance of OpenAI’s models at the expense of others — and without compensation to the data owners to speak of. I suppose that’s well within OpenAI’s right. But it seems a little tone deaf in light of open letters and lawsuits from creatives alleging that OpenAI’s trained many of its models on their work without their permission or payment.

Matthew Griffin / About Author

Matthew Griffin, multi-award winning Futurist and named Futurist of the Year 2024, has been described as a "Walking encyclopaedia of the future" by NASA and a futurist polymath. One of the world's most renowned futurists and strategic foresight experts Matthew is the 15 times author of the blockbuster "Codex of the Future" series, and is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working across the next 50 years, XPotential University, the world's first free futures and foresight university, and the World Futures Forum which works with the United Nations to solve the worlds greatest challenges. Matthew is an in demand international keynote, acclaimed university lecturer and mentor, and host of the hit Fanatical Futurist podcast.

A rare talent in his past Matthew helped build and run several multi-billion dollar business units for Atos, Dell-EMC, and IBM, and his ability to identify, track, and explain the impacts of hundreds of emerging technologies and trends on global business, culture, and society has earned him a powerful reputation and a roster of clients that include royal households, world leaders, G7, G20, and G77+ governments, and many of the world's most respected brands including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi Group, Coca Cola, Dentons, Deloitte, Disney, Dow, EY, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, T-Mobile, UBS, VISA, and many others. He was also the only futurist invited to talk at the UN COP28 held in Dubai alongside world leaders.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.