The inner workings of large AI systems remain largely opaque, raising significant safety and trust issues. Researchers have now developed a technique to extract and manipulate the internal concepts governing model behavior, providing a new way to understand and steer their activity.

Modern AI models are marvels of engineering, but even their creators remain in the dark about how they represent knowledge internally. This is why subtle shifts in prompting can produce surprisingly different outputs. Simply asking a model to show its work before answering often improves accuracy, while certain deliberately malicious prompts can override built-in safety features.

This has motivated significant research aimed at teasing out the patterns of activity in these models’ neural networks that correspond to specific concepts. Investigators hope to use these methods to better understand why models behave certain ways and potentially modify their behavior on the fly.

Now researchers have unveiled an efficient new way of extracting concepts from models that works across language, reasoning, and vision algorithms. In a paper in Science, the researchers used these concepts to both monitor and effectively steer model behavior.

“Our results illustrate the power of internal representations for advancing AI safety and model capabilities,” the authors write. “We showed how these representations enabled model steering, through which we exposed vulnerabilities and improved model capabilities.”

Key to the team’s approach is a new algorithm called the Recursive Feature Machine (RFM). They trained the algorithm on pairs of prompts—some containing a concept of interest, others not—and then identified patterns of activity in the model’s neural network tracking each concept.

This allows the algorithm to learn "concept vectors"—essentially patterns of activity that nudge the model in the direction of a specific concept. The vectors can be used to modify the model’s internal processes when it’s generating an output to steer it toward or away from specific concepts or behaviors.

To test the approach, the researchers asked GPT-4o to produce 512 concepts across five concept classes and generate training data on each. They extracted concept vectors from the data and used the vectors to steer the behavior of several large AI models.