Explore Topics:
AIBiotechnologyRoboticsComputingFutureScienceSpaceEnergyTech
Future

When Hordes of Little AI Chatbots Are More Useful Than Giants Like ChatGPT

Matteo Palma
and
Stuart Mills
Oct 01, 2023
little specialized AI chatbots versus big generic AI chatbots like ChatGPT

Share

AI is developing rapidly. ChatGPT has become the fastest-growing online service in history. Google and Microsoft are integrating generative AI into their products. And world leaders are excitedly embracing AI as a tool for economic growth.

As we move beyond ChatGPT and Bard, we’re likely to see AI chatbots become less generic and more specialized. AIs are limited by the data they're exposed to in order to make them better at what they do—in this case, mimicking human speech and providing users with useful answers.

Training often casts the net wide, with AI systems absorbing thousands of books and web pages. But a more select, focused set of training data could make AI chatbots even more useful for people working in particular industries or living in certain areas.

The Value of Data

An important factor in this evolution will be the growing costs of amassing training data for advanced large language models (LLMs), the type of AI that powers ChatGPT. Companies know data is valuable: Meta and Google make billions from selling advertisements targeted with user data. But the value of data is now changing. Meta and Google sell data “insights”; they invest in analytics to transform many data points into predictions about users.

Data is valuable to OpenAI—the developer of ChatGPT—in a subtly different way. Imagine a tweet: “The cat sat on the mat.” This tweet is not valuable for targeted advertisers. It says little about a user or their interests. Maybe, at a push, it could suggest interest in cat food and Dr. Suess.

But for OpenAI, which is building LLMs to produce human-like language, this tweet is valuable as an example of how human language works. A single tweet cannot teach an AI to construct sentences, but billions of tweets, blogposts, Wikipedia entries, and so on, certainly can. For instance, the advanced LLM GPT-4 was probably built using data scraped from X (formerly Twitter), Reddit, Wikipedia and beyond.

The AI revolution is changing the business model for data-rich organizations. Companies like Meta and Google have been investing in AI research and development for several years as they try to exploit their data resources.

Organizations like X and Reddit have begun to charge third parties for API access, the system used to scrape data from these websites. Data scraping costs companies like X money, as they must spend more on computing power to fulfill data queries.

Moving forward, as organizations like OpenAI look to build more powerful versions of its GPT models, they will face greater costs for acquiring data. One solution to this problem might be synthetic data.

Going Synthetic

Synthetic data is created from scratch by AI systems to train more advanced AI systems—so that they improve. They are designed to perform the same task as real training data but are generated by AI.

It’s a new idea, but it faces many problems. Good synthetic data needs to be different enough from the original data it’s based on in order to tell the model something new, while similar enough to tell it something accurate. This can be difficult to achieve. Where synthetic data is just convincing copies of real-world data, the resulting AI models may struggle with creativity, entrenching existing biases.

Another problem is the “Hapsburg AI” problem. This suggests that training AI on synthetic data will cause a decline in the effectiveness of these systems—hence the analogy using the infamous inbreeding of the Hapsburg royal family. Some studies suggest this is already happening with systems like ChatGPT.

Be Part of the Future

Sign up to receive top stories about groundbreaking technologies and visionary thinkers from SingularityHub.

100% Free. No Spam. Unsubscribe any time.

One reason ChatGPT is so good is because it uses reinforcement learning with human feedback (RLHF), where people rate its outputs in terms of accuracy. If synthetic data generated by an AI has inaccuracies, AI models trained on this data will themselves be inaccurate. So the demand for human feedback to correct these inaccuracies is likely to increase.

However, while most people would be able to say whether a sentence is grammatically accurate, fewer would be able to comment on its factual accuracy—especially when the output is technical or specialized. Inaccurate outputs on specialist topics are less likely to be caught by RLHF. If synthetic data means there are more inaccuracies to catch, the quality of general-purpose LLMs may stall or decline even as these models “learn” more.

Little Language Models

These problems help explain some emerging trends in AI. Google engineers have revealed that there is little preventing third parties from recreating LLMs like GPT-3 or Google’s LaMDA AI. Many organizations could build their own internal AI systems, using their own specialized data, for their own objectives. These will probably be more valuable for these organizations than ChatGPT in the long run.

Recently, the Japanese government noted that developing a Japan-centric version of ChatGPT is potentially worthwhile to their AI strategy, as ChatGPT is not sufficiently representative of Japan. The software company SAP has recently launched its AI “roadmap” to offer AI development capabilities to professional organizations. This will make it easier for companies to build their own, bespoke versions of ChatGPT.

Consultancies such as McKinsey and KPMG are exploring the training of AI models for “specific purposes.” Guides on how to create private, personal versions of ChatGPT can be readily found online. Open source systems, such as GPT4All, already exist.

As development challenges—coupled with potential regulatory hurdles—mount for generic LLMs, it is possible that the future of AI will be many specific little—rather than large—language models. Little language models might struggle if they are trained on less data than systems such as GPT-4.

But they might also have an advantage in terms of RLHF, as little language models are likely to be developed for specific purposes. Employees who have expert knowledge of their organization and its objectives may provide much more valuable feedback to such AI systems, compared with generic feedback for a generic AI system. This may overcome the disadvantages of less data.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Image Credit: Mohamed Nohassi / Unsplash

I carried out my doctoral studies under the supervision of Professor Paolo Samorì, at the Institute of Supramolecular Science and Engineering (ISIS) (founded by Nobel laureate Professor. J.M. Lehn) of the University Louis Pasteur in Strasbourg, France. During my doctoral career I have investigated the nanoscale structural and electronic properties of supramolecular assemblies for organic electronics, by the use of scanning probe techniques. My doctoral work has been awarded the “Young scientist award” by the European Materials Research Society. More recently I have been working as a postdoctoral scientist in the departments of Mechanical Engineering and Applied Physics at Columbia University (New York, U.S.A.) as part of the groups of Professor James Hone and Dr. Shalom Wind, and in close collaboration with Professor Colin Nuckolls group. At Columbia I have focused my research efforts on the use of surface chemistry and nanofabrication strategies to control (bio)molecular self-assembly at the nanometer scale, for i) high throughput monitoring of bio-molecular interactions at the single-molecule level, and ii) controlled self-assembly of nanostructures in materials science. Since September 2013 I have been a Lecturer in Chemistry, and Principal Investigator, in the School of Biological and Chemical Sciences at Queen Mary University of London.

I am an assistant professor of economics at the University of Leeds. I focus on behavioral economics and digital economy. My research interests also include nudge theory, artificial intelligence, public policymaking, and economic philosophy.

Related Articles

A long spiral staircase with railing

Scaling Up: How Increasing Inputs Has Made Artificial Intelligence More Capable

Veronika Samborska
Anthropic Unveils the Strongest Defense Against AI Jailbreaks Yet

Anthropic Unveils the Strongest Defense Against AI Jailbreaks Yet

Edd Gent
Hand holding a pill

Will AI Revolutionize Drug Development? These Are the Root Causes of Drug Failure It Must Address

Christian Macedonia
and
Duxin Sun
A long spiral staircase with railing
Artificial Intelligence

Scaling Up: How Increasing Inputs Has Made Artificial Intelligence More Capable

Veronika Samborska
Anthropic Unveils the Strongest Defense Against AI Jailbreaks Yet
Artificial Intelligence

Anthropic Unveils the Strongest Defense Against AI Jailbreaks Yet

Edd Gent
Hand holding a pill
Artificial Intelligence

Will AI Revolutionize Drug Development? These Are the Root Causes of Drug Failure It Must Address

Christian Macedonia
and
Duxin Sun

What we’re reading

Be Part of the Future

Sign up to receive top stories about groundbreaking technologies and visionary thinkers from SingularityHub.

100% Free. No Spam. Unsubscribe any time.

SingularityHub chronicles the technological frontier with coverage of the breakthroughs, players, and issues shaping the future.

Follow Us On Social

About

  • About Hub
  • About Singularity

Get in Touch

  • Contact Us
  • Pitch Us
  • Brand Partnerships

Legal

  • Privacy Policy
  • Terms of Use
© 2025 Singularity