What Google’s TurboQuant can and can’t do for AI’s spiraling cost


aiiiboard-gettyimages-1462023760

Orla/ iStock / Getty Images Plus via Getty Images

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • Google’s TurboQuant can dramatically reduce AI memory usage.
  • TurboQuant is a response to the spiraling cost of AI.
  • A positive outcome is making AI more accessible by lowering inference costs.

With the cost of artificial intelligence skyrocketing thanks to soaring prices for computer components such as memory, Google last week responded with a proposed technical innovation called TurboQuant.

TurboQuant, which Google researchers discussed in a blog post, is another DeepSeek AI moment, a profound attempt to reduce the cost of AI. It could have a lasting benefit by reducing AI’s memory usage, making models much more efficient. 

Also: What is DeepSeek AI? Is it safe? Here’s everything you need to know

Even so, just as DeepSeek did not stop massive investment in AI chips, observers say TurboQuant will likely lead to continued growth in AI investment. It’s the Jevons paradox: Make something more efficient, and it ends up increasing overall usage of that resource. 

However, TurboQuant is an approach that may help run AI locally by slimming the hardware demands of a large language model. 

More memory, more money 

The big cost factor for AI at the moment  — and probably for the foreseeable future — is the ever-greater use of memory and storage technologies. AI is data-hungry, introducing a reliance on memory and storage unprecedented in the history of computing. 

TurboQuant, first described by Google researchers in a paper a year ago, employs “quantization” to reduce the number of bits and bytes required to represent the data. 

Also: Why you’ll pay more for AI in 2026, and 3 money-saving tips to try

Quantization is a form of data compression that uses fewer bits to represent the same value. In the case of TurboQuant, the focus is on what’s called the “key-value cache,” or, for shorthand, “KV cache,” one of the biggest memory hogs of AI. 

When you type into a chatbot such as Google’s Gemini, the AI has to compare what you’ve typed to a repository of measures that serve as a kind of database.

The thing that you type is called the query, and it is matched against data held in memory, called a key, to find a numeric match. Basically, it’s a similarity score. The key is then used to retrieve from memory exactly which words should be returned to you as the AI’s response, known as the value. 

Normally, every time you type, the AI model must calculate a new key and value, which can slow the whole operation. To speed things up, the machine retains a key-value cache in memory to store recently used keys and values. 

The cache then becomes its own problem: The more you work with a model, the more memory the key-value cache takes up. “This scaling is a significant bottleneck in terms of memory usage and computational speed, especially for long context models,” according to Google lead author Amir Zandieh and colleagues.

Also: AI isn’t getting smarter, it’s getting more power hungry – and expensive

Making things worse, AI models are increasingly being built with more complex keys and values, known as the context window. That gives the model more search options, potentially improving accuracy. Gemini 3, the current version, made a big leap in context window to one million tokens. Prior state-of-the-art models such as OpenAI’s GPT-4 had a context window of just 32,768 tokens. A larger context window also increases the amount of memory a key-value cache consumes. 

Speeding up quantization for real-time

The solution to that expanding KV cache is to quantize the keys and the values so the whole thing takes up less space. Zandieh and team claim in their blog post that the data compression is “massive” with TurboQuant.  “Reducing the KV cache size without compromising accuracy is essential,” they write.

Quantization has been used by Google and others for years to slim down neural networks. What’s novel about TurboQuant is that it’s meant to quantize in real time. Previous compression approaches reduced the size of a neural network at compile time, before it is run in production. 

Also: Nvidia wants to own your AI data center from end to end

That’s not good enough, observed Zandieh. The KV cache is a living digest of what’s learned at “inference time,” when people are typing to an AI bot, and the keys and values are changing. So, quantization has to happen fast enough and accurately enough to keep the cache small while also staying up to date. The “turbo” in TurboQuant implies this is a lot faster than traditional compile-time quantization. 

Two-stage approach

TurboQuant has two stages. First, the queries and keys are compressed. This can be done geometrically because queries and keys are vectors of data that can be depicted on an X-Y graph as a line, which can be rotated on that graph. They call the rotations “PolarQuant.” By randomly trying different rotations with PolarQuant and then retrieving the original line, they find a smaller number of bits that still preserves accuracy.

As they put it, “PolarQuant acts as a high-efficiency compression bridge, converting Cartesian inputs into a compact Polar ‘shorthand’ for storage and processing.”

google-2026-polarquant-illustration

Google

The compressed vectors still produce errors when the comparison is performed between the query and the key, which is known as the “inner product” of two vectors. To fix that, they use a second method, QJL, introduced by Zandieh in 2024. That approach keeps one of the two vectors in its original state, so that multiplying a compressed (quantized) vector with an uncompressed vector serves as a test to improve the accuracy of the multiplication. 

They tested TurboQuant by applying it to Meta Platforms’s open-source Llama 3.1-8B AI model, and found that “TurboQuant achieves perfect downstream results across all benchmarks while reducing the key value memory size by a factor of at least 6x” — a six-fold reduction in the amount of KV cache needed.

The approach also differs from other methods for compressing the KV cache, such as the approach taken last year by DeepSeek, which constrained key and value searches to speed up inference.

Also: DeepSeek claims its new AI model can cut the cost of predictions by 75% – here’s how

In another test, using Google’s Gemma open-source model and models from French AI startup Mistral, “TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy,” they wrote, “all while achieving a faster runtime than the original LLMs (Gemma and Mistral).” 

“It is exceptionally efficient to implement and incurs negligible runtime overhead,” they observed

google-2026-turboquant-performance

Google

Will AI be any cheaper?

Zandieh and team expect TurboQuant to have a significant impact on the production use of AI inference. “As AI becomes more integrated into all products, from LLMs to semantic search, this work in fundamental vector quantization will be more critical than ever,” they wrote. 

Also: Want to try OpenClaw? NanoClaw is a simpler, potentially safer AI agent

But will it really reduce the cost of AI? Yes and no. 

In an age of agentic AI, programs such as OpenClaw software that operate autonomously, there are a lot of parts to AI besides just the KV cache. Other uses of memory, such as retrieving and storing database records, will ultimately affect an agent’s efficiency over the long term. 

Those who follow the AI chip world last week argued that just as DeepSeek AI’s efficiency didn’t slow AI investment last year, neither will TurboQuant.

Vivek Arya, a Merrill Lynch banker who follows AI chips, wrote to his clients who were worried about DRAM maker Micron Technology that TurboQuant will simply make more efficient use of AI. The “6x improvement in memory efficiency [will] likely [lead] to 6x increase in accuracy (model size) and/or context length (KV cache allocation), rather than 6x decrease in memory,” wrote Arya.

Also: AI agents of chaos? New research shows how bots talking to bots can go sideways fast

What TurboQuant can do, though, is make some individual instances of AI more economical, especially for local deployment. 

For example, a swelling KV cache and longer context windows may prove less of a burden when running some AI models on limited hardware budgets. That will be a relief for users of OpenClaw who want their MacBook Neo or Mac mini to serve as a budget local AI server. 





Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


A WD Black SN850P SSD on a blue background

WD/ZDNET

High SSD prices got you down? Right now during Best Buy’s Tech Fest sale, you can save up to $2,800 on the WD Black SN850P storage drive. And while it’s officially licensed for use with PlayStation 5 consoles, it’s easy to reconfigure for use in gaming laptops and desktops for a boost in storage capacity. 

Also: The best Amazon Spring Sale deals: Save on streaming, Apple, Samsung, and more

Available in capacities from 1TB to 8TB, the WD Black SN850P can double, or even quadruple, your available storage space, giving you plenty of room for large game downloads, save files, screenshots, highlight reels, and more. With read and write speeds up to 7300 and 6600 MB/s, respectively, you’ll get much faster loading times than traditional HDDs as well as quicker access to your favorite apps, games, and programs.

Also: SSD vs HDD: What’s the difference, and which should you buy?

The integrated heatsink helps keep everything running at optimal temperatures to prevent data loss or corruption due to overheating. It can also be removed for easier installation in smaller PCs. 

By using flash memory rather than traditional mechanical platters, the WD Black SN850P can provide you with years of reliable data access with much less risk of internal damage due to shocks and bumps.

How I rated this deal 

Prices for RAM and SSD storage drives have skyrocketed as AI companies buy up available stock to power LLMs. And while this particular model is licensed for use with the PS5, you can quickly reconfigure it for use in laptops and desktop PCs. The 2TB model is marked down to $400, bringing it closer to pre-AI pricing, and the 8TB version is almost $2,800 off. While it’s still very expensive, it’s the lowest price I’ve seen on a high-end SSD in a long time. That’s why I gave this deal a 5/5 Editor’s rating.

Deals are subject to sell out or expire anytime, though ZDNET remains committed to finding, sharing, and updating the best product deals for you to score the best savings. Our team of experts regularly checks in on the deals we share to ensure they are still live and obtainable. We’re sorry if you’ve missed out on this deal, but don’t fret — we’re constantly finding new chances to save and sharing them with you at ZDNET.com


Show more

We aim to deliver the most accurate advice to help you shop smarter. ZDNET offers 33 years of experience, 30 hands-on product reviewers, and 10,000 square feet of lab space to ensure we bring you the best of tech. 

In 2025, we refined our approach to deals, developing a measurable system for sharing savings with readers like you. Our editor’s deal rating badges are affixed to most of our deal content, making it easy to interpret our expertise to help you make the best purchase decision.

At the core of this approach is a percentage-off-based system to classify savings offered on top-tech products, combined with a sliding-scale system based on our team members’ expertise and several factors like frequency, brand or product recognition, and more. The result? Hand-crafted deals chosen specifically for ZDNET readers like you, fully backed by our experts. 

Also: How we rate deals at ZDNET in 2026


Show more





Source link