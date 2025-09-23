Training AI models on your data can provide powerful new insights, but it can also potentially result in them leaking sensitive information. Now Google has released a new model designed from the bottom up to prevent these kinds of privacy breaches.

Large language models are a promising way to extract valuable information from the piles of unstructured data most companies are sitting on. But much of this data is full of highly sensitive details about customers, intellectual property, and company finances.

That’s a problem because language models tend to memorize some of the data they’re trained on and can occasionally spit it back out verbatim. That can make it very hard to ensure these models don’t reveal private data to the wrong people in the wrong context.

One potential workaround is an approach called differential privacy, which allows you to extract insights from data without revealing the specifics of the underlying information. However, it makes training AI models significantly less effective, requiring more data and computing resources to achieve a given level of accuracy.

Now though, Google researchers have mapped the trade-offs between privacy guarantees, compute budgets, and data requirements to come up with a recipe for efficiently building privacy-preserving AI models. And they’ve used this playbook to create a 1-billion-parameter model called VaultGemma that performs on par with older models of similar sizes, showing privacy can be protected without entirely sacrificing capability.

“VaultGemma represents a significant step forward in the journey toward building AI that is both powerful and private by design,” the researchers write in a blog post.

Differential privacy involves injecting a small amount of noise, or random data, during the AI training process. This doesn’t change the overarching patterns and insights the model learns, but it obfuscates the contributions of particular data points. This makes it harder for the model to memorize specific details from the dataset that could later be regurgitated.

However, the amount of privacy this technique provides, known as the privacy budget, is directly proportional to the amount of noise added in the training process. And the more noise you add, the less effective the training process and the more data and compute you have to use. These three factors interact in complicated ways that make it tricky to figure out the most efficient way to build a model with specific privacy guarantees and performance.