Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the thegem domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/j8p72agj2cgw/fanaticalfuturist.com/wp-includes/functions.php on line 6121

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wp-2fa domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/j8p72agj2cgw/fanaticalfuturist.com/wp-includes/functions.php on line 6121
OpenAI seeks new private AI training datasets to help them realise AGI – Matthew Griffin | Keynote Speaker & Master Futurist
Scroll Top

OpenAI seeks new private AI training datasets to help them realise AGI

WHY THIS MATTERS IN BRIEF

AI companies have trained their giant AI models on almost all of the freely available datasets, and now they want private datasets to help them reach AGI.

 

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trendsconnect, watch a keynote, or browse my blog.

It’s an open secret that the data sets used to train Artificial Intelligence (AI) models are quite deeply flawed, which is just one of the reasons why we continue to see bias in many AI models. Image corpora tends to be US and Western-centric, partly because Western images dominated the internet when the data sets were compiled. And as most recently highlighted by a study out of the Allen Institute for AI, the data used to train large language models like Meta’s Llama 2 contains toxic language and biases.

 

RELATED
Microsoft will offer DNA storage in the cloud by 2020

 

Models amplify these flaws in harmful ways. Now, OpenAI says that it wants to combat them by partnering with outside institutions to create new, hopefully improved data sets – something that will also help alleviate the problem of running out of valuable AI training data by as early as 2026 …

 

The Future of AI, by Futurist Keynote Matthew Griffin

 

This week OpenAI announced Data Partnerships, an effort to collaborate with third-party organizations to build public and private data sets for AI, as well as Artificial General Intelligence (AGI) model training.

In a blog post, OpenAI says Data Partnerships is intended to “enable more organizations to help steer the future of AI and AGI” and “benefit from models that are more useful.” And in doing so make OpenAI’s AI models even better – and eventually more profitable.

 

RELATED
JPMorgan unleashes artificial intelligence to automate its legal work

 

“To ultimately make [AI] that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures and languages, which requires as broad a training data set as possible,” OpenAI writes. “Including your content can make AI models more helpful to you by increasing their understanding of your domain.”

As a part of the Data Partnerships program, OpenAI says that it’ll collect “large-scale” data sets that “reflect human society” and that aren’t easily accessible online today. While the company plans to work across a wide range of modalities, including images, audio and video, it’s particularly seeking data that “expresses human intention” (e.g. long-form writing or conversations) across different languages, topics and formats.

OpenAI says it’ll work with organizations to digitize training data if necessary, using a combination of Optical Character Recognition (OCR) and automatic speech recognition tools and removing sensitive or personal information if necessary.

 

RELATED
In under 7 minutes and with just $1 these autonomous chatbots built a software company

 

At the start, OpenAI’s looking to create two types of data sets: an open source data set that’d be public for anyone to use in AI model training and a set of private data sets for training proprietary – or specialist – AI models. The private sets are intended for organizations that wish to keep their data private but want OpenAI’s models to have a better understanding of their domain, OpenAI says; so far, OpenAI’s worked with the Icelandic Government and Miðeind ehf to improve GPT-4’s ability to speak Icelandic and with the Free Law Project to improve its models’ understanding of legal documents.

“Overall, we are seeking partners who want to help us teach AI to understand our world in order to be maximally helpful to everyone,” OpenAI writes.

So, can OpenAI do better than the many data-set-building efforts that’ve come before it? I’m not so sure — minimizing data set bias is a problem that’s stumped many of the world’s experts. At the very least, I’d hope that the company’s transparent about the process — and about the challenges it inevitably encounters in creating these data sets.

 

RELATED
One Time Programs could create autonomous ransomware with no call home

 

Despite the blog post’s grandiose language, there also seems to be a clear commercial motivation, here, to improve the performance of OpenAI’s models at the expense of others — and without compensation to the data owners to speak of. I suppose that’s well within OpenAI’s right. But it seems a little tone deaf in light of open letters and lawsuits from creatives alleging that OpenAI’s trained many of its models on their work without their permission or payment.

Related Posts

Leave a comment

Pin It on Pinterest

Share This