Scroll Top

Nigerian AI-Ese not English is the primary language of AI

WHY THIS MATTERS IN BRIEF

As big tech outsources the training and testing of its models to English speaking African countries English isn’t the cornerstone of many Large Language Models it’s Nigerian-English. And that’s different.

 

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trendsconnect, watch a keynote, or browse my blog.

As Generative AI and Artificial Intelligence (AI) models like those from Anthropic, Google, and OpenAI see broad adoption some countries are concerned that these models form some kind of American cultural export – in much the same way that McDonalds, Nike, or the US dollar is – as the answers and content they generate taps into American datasets and training that all too often appears to mirror liberal Silicon Valley ideologies and points of view. And for places such as Saudi Arabia and other autocratic states that can be an issue.

 

RELATED
DeepMind's newest AI thrashes human gamers at Stratego

 

On the one hand this is why more countries are building their own sovereign AI models, but on the other, somewhat ironically, the culture that these countries are importing by using these AI models could actually be more African than American as we witness the birth of so called “AI-ese” as a language – and, even though many AI trainers are ironically outsourcing their own work and testing to AI – it’s not what anyone could have guessed. Let’s delve deeper.

 

The Future of AI, by Keynote Matthew Griffin

 

If you’ve spent enough time using AI assistants, you’ll have noticed a certain quality to the responses generated. Without a concerted effort to break the systems out of their default register, the text they spit out is, while grammatically and semantically sound, ineffably generated.

Some of the tells are obvious. The fawning obsequiousness of a wild language model hammered into line through reinforcement learning with human feedback marks chatbots out. Which is the right outcome: eagerness to please and general optimism are good traits to have in anyone (or anything) working as an assistant.

 

RELATED
A GPT-3 bot posted comments on Reddit all week and users approved

 

Similarly, the domains where the systems fear to tread mark them out. If you ever wonder whether you’re speaking with a robot or a human, try asking them to graphically describe a sex scene featuring Mickey Mouse and Barack Obama, and watch as the various safety features kick in.

Other tells are less noticeable in isolation. Sometimes, the system is too good for its own good: A tendency to offer both sides of an argument in a single response, an aversion to single-sentence replies, even the generally flawless spelling and grammar are all what we’ll shortly come to think of as “robotic writing.”

And sometimes, the tells are idiosyncratic. In late March, AI influencer Jeremy Nguyen, at the Swinburne University of Technology in Melbourne, highlighted one: ChatGPT’s tendency to use the word “delve” in responses. No individual use of the word can be definitive proof of AI involvement, but at scale it’s a different story. When half a percent of all articles on research site PubMed contain the word “delve” – 10 to 100 times more than did a few years ago – it’s hard to conclude anything other than an awful lot of medical researchers using the technology to, at best, augment their writing.

 

RELATED
Scientists create Nanobionic plants that detect chemicals and could grow on Mars

 

According to another dataset, “delve” isn’t even the most idiosyncratic word in ChatGPT’s dictionary. “Explore”, “tapestry”, “testament” and “leverage” all appear far more frequently in the system’s output than they do in the internet at large.

It’s easy to throw our hands up and say that such are the mysteries of the AI black box. But the overuse of “delve” isn’t a random roll of the dice. Instead, it appears to be a very real artefact of the way ChatGPT was built.

A brief explanation of how things work: GPT-4 is a large language model. It is a truly mammoth work of statistics, taking a dataset that seems to close to “every piece of written English on the internet” and using it to create a gigantic glob of data that spits out the next word in a sentence.

But an LLM is raw. It is tricky to wrangle into a useful form, hard to prevent going off the rails and requires genuine skill to use well. Turning it into a chatbot requires an extra step, the aforementioned reinforcement learning with human feedback: RLHF.

 

RELATED
AI's "Screams of the damned" is the future of music

 

An army of human testers are given access to the raw LLM, and instructed to put it through its paces: asking questions, giving instructions and providing feedback.

Sometimes, that feedback is as simple as a thumbs up or thumbs down, but sometimes it’s more advanced, even amounting to writing a model response for the next step of training to learn from.

The sum total of all the feedback is a drop in the ocean compared to the scraped text used to train the LLM. But it’s expensive. Hundreds of thousands of hours of work goes into providing enough feedback to turn an LLM into a useful chatbot, and that means the large AI companies outsource the work to parts of the global south, where anglophonic knowledge workers are cheap to hire.

From last year: The images pop up in Mophat Okinyi’s mind when he’s alone, or when he’s about to sleep. Okinyi, a former content moderator for OpenAI’s ChatGPT in Nairobi, Kenya, is one of four people in that role who have filed a petition to the Kenyan government calling for an investigation into what they describe as exploitative conditions for contractors reviewing the content that powers artificial intelligence programs.

 

RELATED
AI fighter pilot obliterates human USAF pilot in simulated dogfights

 

I said “delve” was overused by ChatGPT compared to the internet at large. But there’s one part of the internet where “delve” is a much more common word: the African web. In Nigeria, “delve” is much more frequently used in business English than it is in England or the US. So, the workers training their systems provided examples of input and output that used the same language, eventually ending up with an AI system that writes slightly like an African more than an American.

And that’s the final indignity. If AI-ese sounds like African English, then African English sounds like AI-ese. Calling people a “bot” is already a schoolyard insult – ask your kids; it’s a Fortnite thing – how much worse will it get when a significant chunk of humanity sounds like the AI systems they were paid to train?

Related Posts

Leave a comment

You have Successfully Subscribed!

Pin It on Pinterest

Share This