AI Has a Secret: We’re Still Not Sure How to Test for Human Levels of Intelligence

Two of San Francisco’s leading players in artificial intelligence have challenged the public to come up with questions capable of testing the capabilities of large language models (LLMs) like Google Gemini and OpenAI’s o1. Scale AI, which specializes in preparing the vast tracts of data on which the LLMs are trained, teamed up with the Center for AI Safety (CAIS) to launch the initiative, Humanity’s Last Exam.

Featuring prizes of $5,000 for those who come up with the top 50 questions selected for the test, Scale and CAIS say the goal is to test how close we are to achieving “expert-level AI systems” using the “largest, broadest coalition of experts in history.”

Why do this? The leading LLMs are already acing many established tests in intelligence, mathematics, and law, but it’s hard to be sure how meaningful this is. In many cases, they may have pre-learned the answers due to the gargantuan quantities of data on which they are trained, including a significant percentage of everything on the internet.

Data is fundamental to this whole area. It is behind the paradigm shift from conventional computing to AI, from “telling” to “showing” these machines what to do. This requires good training datasets, but also good tests. Developers typically do this using data that hasn’t already been used for training, known in the jargon as “test datasets.”

If LLMs are not already able to pre-learn the answer to established tests like bar exams, they probably will be soon. The AI analytics site Epoch AI estimates that 2028 will mark the point at which AIs will effectively have read everything ever written by humans. An equally important challenge is how to keep assessing AIs once that rubicon has been crossed.

Of course, the internet is expanding all the time, with millions of new items being added daily. Could that take care of these problems?

Perhaps, but this bleeds into another insidious difficulty, referred to as “model collapse.” As the internet becomes increasingly flooded by AI-generated material which recirculates into future AI training sets, this may cause AIs to perform increasingly poorly. To overcome this problem, many developers are already collecting data from their AIs’ human interactions, adding fresh data for training and testing.

Some specialists argue that AIs also need to become embodied: moving around in the real world and acquiring their own experiences, as humans do. This might sound far-fetched until you realize that Tesla has been doing it for years with its cars. Another opportunity involves human wearables, such as Meta’s popular smart glasses by Ray-Ban. These are equipped with cameras and microphones and can be used to collect vast quantities of human-centric video and audio data.

Narrow Tests

Yet even if such products guarantee enough training data in the future, there is still the conundrum of how to define and measure intelligence—particularly artificial general intelligence (AGI), meaning an AI that equals or surpasses human intelligence.

Traditional human IQ tests have long been controversial for failing to capture the multifaceted nature of intelligence, encompassing everything from language to mathematics to empathy to sense of direction.

There’s an analogous problem with the tests used on AIs. There are many well established tests covering such tasks as summarizing text, understanding it, drawing correct inferences from information, recognizing human poses and gestures, and machine vision.

Some tests are being retired, usually because the AIs are doing so well at them, but they’re so task-specific as to be very narrow measures of intelligence. For instance, the chess-playing AI Stockfish is way ahead of Magnus Carlsen, the highest scoring human player of all time, on the Elo rating system. Yet Stockfish is incapable of doing other tasks such as understanding language. Clearly it would be wrong to conflate its chess capabilities with broader intelligence.

But with AIs now demonstrating broader intelligent behavior, the challenge is to devise new benchmarks for comparing and measuring their progress. One notable approach has come from French Google engineer François Chollet. He argues that true intelligence lies in the ability to adapt and generalize learning to new, unseen situations. In 2019, he came up with the “abstraction and reasoning corpus” (ARC), a collection of puzzles in the form of simple visual grids designed to test an AI’s ability to infer and apply abstract rules.

Unlike previous benchmarks that test visual object recognition by training an AI on millions of images, each with information about the objects contained, ARC gives it minimal examples in advance. The AI has to figure out the puzzle logic and can’t just learn all the possible answers.

Though the ARC tests aren’t particularly difficult for humans to solve, there’s a prize of $600,000 for the first AI system to reach a score of 85 percent. At the time of writing, we’re a long way from that point. Two recent leading LLMs, OpenAI’s o1 preview and Anthropic’s Sonnet 3.5, both score 21 percent on the ARC public leaderboard (known as the ARC-AGI-Pub).

Another recent attempt using OpenAI’s GPT-4o scored 50 percent, but somewhat controversially because the approach generated thousands of possible solutions before choosing the one that gave the best answer for the test. Even then, this was still reassuringly far from triggering the prize—or matching human performances of over 90 percent.

While ARC remains one of the most credible attempts to test for genuine intelligence in AI today, the Scale/CAIS initiative shows that the search continues for compelling alternatives. (Fascinatingly, we may never see some of the prize-winning questions. They won’t be published on the internet, to ensure the AIs don’t get a peek at the exam papers.)

We need to know when machines are getting close to human-level reasoning, with all the safety, ethical, and moral questions this raises. At that point, we’ll presumably be left with an even harder exam question: how to test for a superintelligence. That’s an even more mind-bending task that we need to figure out.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Image Credit: Steve Johnson / Unsplash

Andrew Rogoyski
Andrew Rogoyski
Andrew’s experience spans 30 years in industry, government, and academia. Originally a physicist at the Rutherford Appleton Lab, Andrew joined Logica at the height of the early AI boom, a decade later moving to space consultancy Esys, then became MD of QinetiQ’s Space Division, where early AI techniques were being applied to applications like satellite imagery. Andrew subsequently worked as a strategist, specializing in innovation and cyber security, including secondment to Cabinet Office, before becoming CGI’s vice president of cyber security, where AI methods were used for threat detection. Andrew joined Roke Manor Research as Innovation Director, developing a number of products and services that utilized leading-edge AI techniques. Andrew returned to academia as director of innovation and partnerships at Surrey’s new Institute of People-Centered Artificial Intelligence, a group that leverages the University’s 35 years in AI by developing a new focus on creating AI solutions that focus on delivering benefit to people and society.
RELATED
latest
Don't miss a trend
Get Hub delivered to your inbox

featured