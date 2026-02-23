Researchers Break Open AI’s Black Box—and Use What They Find Inside to Control It
A new tool takes relatively few resources to chart algorithms' inner workings and steer their behavior.
Image Credit
Diego Arenas de Rodrigo on Unsplash
Share
The inner workings of large AI systems remain largely opaque, raising significant safety and trust issues. Researchers have now developed a technique to extract and manipulate the internal concepts governing model behavior, providing a new way to understand and steer their activity.
Modern AI models are marvels of engineering, but even their creators remain in the dark about how they represent knowledge internally. This is why subtle shifts in prompting can produce surprisingly different outputs. Simply asking a model to show its work before answering often improves accuracy, while certain deliberately malicious prompts can override built-in safety features.
This has motivated significant research aimed at teasing out the patterns of activity in these models’ neural networks that correspond to specific concepts. Investigators hope to use these methods to better understand why models behave certain ways and potentially modify their behavior on the fly.
Now researchers have unveiled an efficient new way of extracting concepts from models that works across language, reasoning, and vision algorithms. In a paper in Science, the researchers used these concepts to both monitor and effectively steer model behavior.
“Our results illustrate the power of internal representations for advancing AI safety and model capabilities,” the authors write. “We showed how these representations enabled model steering, through which we exposed vulnerabilities and improved model capabilities.”
Key to the team’s approach is a new algorithm called the Recursive Feature Machine (RFM). They trained the algorithm on pairs of prompts—some containing a concept of interest, others not—and then identified patterns of activity in the model’s neural network tracking each concept.
This allows the algorithm to learn "concept vectors"—essentially patterns of activity that nudge the model in the direction of a specific concept. The vectors can be used to modify the model’s internal processes when it’s generating an output to steer it toward or away from specific concepts or behaviors.
To test the approach, the researchers asked GPT-4o to produce 512 concepts across five concept classes and generate training data on each. They extracted concept vectors from the data and used the vectors to steer the behavior of several large AI models.
Be Part of the Future
Sign up to receive top stories about groundbreaking technologies and visionary thinkers from SingularityHub.
The approach worked well across a broad range of model types, including large language models, vision-language models, and reasoning models. Surprisingly, they found newer, larger, and better-performing models were actually more steerable than some smaller ones.
Crucially, the team showed they could use the technique to expose and address serious vulnerabilities in the models. In one test, they created a vector for the concept of “anti-refusal,” which allowed them to bypass built-in safety features in vision-language models to prevent them from giving advice on how take drugs. But they also learned a vector for “anti-deception,” which they successfully used to steer a model away from giving misleading answers.
One of the study’s more interesting findings was that the extracted features were transferable across languages. A concept vector learned with English training data could be used to alter outputs in other languages. The researchers also found they could combine multiple concept vectors to manipulate model behavior in more sophisticated ways.
But the new technique’s real power is in its efficiency. It took fewer than 500 training samples and less than a minute of processing time on a single Nvidia A100 GPU to identify activity patterns associated with a concept and steer towards it.
The researchers say this could not only make it possible to systematically map concepts within large AI models, but it could also lead to more efficient ways of tweaking model behavior after training compared to existing methods.
The approach is still a long way from delivering complete model transparency. But it’s a useful addition in the growing arsenal of model analysis tools that will become increasingly important as AI pushes deeper into all of our lives.
Related Articles
What the Rise of AI Scientists May Mean for Human Research
Scientists Want to Give ChatGPT an Inner Monologue to Improve Its ‘Thinking’
Humanity’s Last Exam Stumps Top AI Models—and That’s a Good Thing
What the Rise of AI Scientists May Mean for Human Research
Scientists Want to Give ChatGPT an Inner Monologue to Improve Its ‘Thinking’
Humanity’s Last Exam Stumps Top AI Models—and That’s a Good Thing
What we’re reading