Breaking Into AI’s Black Box: Anthropic Maps the Mind of Its Claude Large Language Model

The opaque inner workings of AI systems are a barrier to their broader deployment. Now, startup Anthropic has made a major breakthrough in our ability to peer inside artificial minds.

One of the great strengths of deep learning neural networks is they can, in a certain sense, think for themselves. Unlike previous generations of AI, which were painstakingly hand coded by humans, these algorithms come up with their own solutions to problems by training on reams of data.

This makes them much less brittle and easier to scale to large problems, but it also means we have little insight into how they reach their decisions. That makes it hard to understand or predict errors or to identify where bias may be creeping into their output.

A lack of transparency limits deployment of these systems in sensitive areas like medicine, law enforcement, or insurance. More speculatively, it also raises concerns around whether we would be able to detect dangerous behaviors, such as deception or power seeking, in more powerful future AI models.

Now though, a team from Anthropic has made a significant advance in our ability to parse what’s going on inside these models. They’ve shown they can not only link particular patterns of activity in a large language model to both concrete and abstract concepts, but they can also control the behavior of the model by dialing this activity up or down.

The research builds on years of work on “mechanistic interpretability,” where researchers reverse engineer neural networks to understand how the activity of different neurons in a model dictate its behavior.

That’s easier said than done because the latest generation of AI models encode information in patterns of activity, rather than particular neurons or groups of neurons. That means individual neurons can be involved in representing a wide range of different concepts.

The researchers had previously shown they could extract activity patterns, known as features, from a relatively small model and link them to human interpretable concepts. But this time, the team decided to analyze Anthropic’s Claude 3 Sonnet large language model to show the approach could work on commercially useful AI systems.

They trained another neural network on the activation data from one of Sonnet’s middle layers of neurons, and it was able to pull out roughly 10 million unique features related to everything from people and places to abstract ideas like gender bias or keeping secrets.

Interestingly, they found that features for similar concepts were clustered together, with considerable overlap in active neurons. The team says this suggests that the way ideas are encoded in these models corresponds to our own conceptions of similarity.

More pertinently though, the researchers also discovered that dialing up and down the activity of neurons involved in encoding these features could have significant impacts on the model’s behavior. For example, massively amplifying the feature for the Golden Gate Bridge led the model to force it into every response no matter how irrelevant, even claiming that the model itself was the iconic landmark.

The team also experimented with some more sinister manipulations. In one, they found that over-activating a feature related to spam emails could get the model to bypass restrictions and write one of its own. They could also get the model to use flattery as a means of deception by amping up a feature related to sycophancy.

The team say there’s little danger of attackers using the approach to get models to produce unwanted or dangerous output, mostly because there are already much simpler ways to achieve the same goals. But it could prove a useful way to monitor models for worrying behavior. Turning the activity of different features up or down could also be a way to steer models towards desirable outputs and away from less positive ones.

However, the researchers were keen to point out that the features they’ve discovered make up just a small fraction of all of those contained within the model. What’s more, extracting all features would take huge amounts of computing resources, even more than were used to train the model in the first place.

That means we’re still a long way from having a complete picture of how these models “think.” Nonetheless, the research shows that it is, at least in principle, possible to make these black boxes slightly less inscrutable.

Image Credit: mohammed idris djoudi / Unsplash

Edd Gent
Edd Gent
I am a freelance science and technology writer based in Bangalore, India. My main areas of interest are engineering, computing and biology, with a particular focus on the intersections between the three.
Don't miss a trend
Get Hub delivered to your inbox