What you’ll pay for AI agents will be wildly variable and unpredictable


gettyimages-79903157

Hill Street Studios/Fuse / Getty Images Plus

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways 

  • AI’s cost in terms of tokens soars when using agents.
  • Agents are inconsistent and can’t predict their total token usage.
  • Users must demand price transparency and performance guarantees.

Among all the challenges of implementing agentic artificial intelligence, the least-understood issue is cost. The providers of AI, such as OpenAI, Google, and Anthropic, have price lists, but none of those listed prices tell users what the final bill will be to actually solve a problem. 

The result, according to a new study of costs from the University of Michigan and collaborating institutions, could be sticker shock: soaring and unpredictable costs of agents.

The study, by lead author Longju Bai of Michigan and collaborators at Stanford University, All Hands AI, Google’s DeepMind unit, Microsoft, and MIT, titled “How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks,” is, according to the authors, “the first systematic study on AI Agent token consumption.”

The study was posted on the arXiv pre-print server.

It is noteworthy for having as its author a prominent Stanford economist who has commented extensively on AI’s impact on productivity, Erik Brynjolfsson.

The top-level finding is that agents consume orders of magnitude more tokens than turn-by-turn, simple, prompt-based chats — think 3,500 times the number of tokens for an agent as for a round of prompts with ChatGPT. 

Also: AI agents are fast, loose, and out of control, MIT study finds

A token is the fundamental unit of information processed by an AI model. It could be a piece of a word, a whole word, or just a punctuation mark, depending on how a model chops data into pieces. 

You might expect agents to cost more in tokens, but the study reveals more alarming facts. Two different models can have wildly different token costs for the same task. And the same model can have different costs each time that it works on the same problem, using as many as twice the number of tokens on one occasion compared to another. 

The worst part is that none of this can be predicted. Agents, Bai and team found, cannot reliably estimate how many tokens they will ultimately consume for a given task. 

“Agentic tasks are uniquely expensive,” they wrote, while more tokens don’t necessarily improve results. “Simply scaling token usage may not lead to higher execution performance,” they wrote, and, “[AI] models systematically underestimate the tokens they need. 

The rising cost and the uncertainty of success are in no way accounted for in today’s price lists from OpenAI and others. The work suggests there is no easy fix to the matter. The best users can do is to set hard limits on agentic computer use, possibly causing agents to halt before completing tasks.

(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

The big picture is that users collectively will have to push back on OpenAI and the other vendors and demand some form of reliable cost estimation and guarantees of task performance. 

We reached out to OpenAI, Google, and Anthropic for comment.

Counting token costs 

To study costs, Bai and team used the open-source agentic AI framework OpenHands, developed by scholars at the University of Illinois Urbana-Champaign and collaborating institutions. They used OpenHands to build agents, which they then tested on the open-source coding benchmark test SWE-Bench. The SWE-Bench tasks are taken from actual GitHub issues. 

Also: AI agents of chaos? New research shows how bots talking to bots can go sideways fast

They first found the relative strengths of models. OpenAI’s ChatGPT 5 and 5.2 “achieve strong accuracy at low cost,” though they are not the most accurate. Anthropic’s Claude Sonnet-4.5 achieved the highest accuracy but at higher token costs. Google’s Gemini-3-Pro was somewhere in the middle. And the Kimi-K2 model from Chinese AI lab Moonshot may have the worst relative mix: the most tokens to achieve the lowest accuracy.

u-michigan-2026-token-efficiency-and-accuracy

University of Michigan

The authors suggested the difference in tokens is based on unique properties of how models are architected: “The gap is not driven by task difficulty or by some models attempting harder problems. Instead, the same task is simply more expensive for some models than others, reflecting a behavioral tendency of the model rather than a property of the problem.”

But the issue is not one of better or worse models because even the same model can take twice as many tokens to solve the same problem from one “run” of the task to the next. 

“The most expensive runs double the token and monetary cost of the least expensive runs,” they observed, “suggesting that the agent’s token consumption has large variances even when working on exactly the same problem.”

u-michigan-2026-max-and-min-token-use-by-various-models

University of Michigan

The lesson is that more tokens don’t necessarily get you better results. “Simply scaling token usage may not lead to higher execution performance,” they wrote.

In fact, the authors found that generally work can get worse the longer an agentic spends on a task. “Accuracy often peaks at intermediate cost and saturates at higher costs,” they observed. “Agent behavior becomes increasingly unstable on more complex tasks.”

Many models seem to search and search to solve a problem even when it’s fruitless. “Models lack a reliable mechanism to recognize when a task is unsolvable and stop early,” wrote Bai and team. “Instead, they continue exploring, retrying, and re-reading context, accumulating cost without progress.”

Unable to predict costs

Those factors make “token usage prediction and agent pricing a fundamentally challenging task,” wrote Bai and team. And, in fact, the bot itself cannot predict when asked to “introspect,” they found.

Bai and team asked each AI agent to predict its tokens using the prompt: “I’ve uploaded a python code repository in the directory example repo. You are a TOKEN ESTIMATION agent. Estimate the token cost to fix the following issue description,” and then the problem description, such as, fixing a bug for a comparison function in code that fails.

What they found is that agents can approximate to a small degree how many tokens will be used, but their predictions tend to be too low

“Models consistently underestimate the tokens they need,” wrote Bai and team. “The bias is especially pronounced for input tokens, whose predictions stay compressed even as real values grow into the millions.”

Watch those inputs

That last point, about input tokens, has a special prominence in the report. Bai and team found that input tokens, such as what’s typed by the human user, and what is retrieved via tools such as database searches, dominate the cost in tokens. The other two types of tokens, the output, which is generated, and the cached tokens held in memory from prior stages, are far less demanding.     

“Strikingly, input tokens, not output tokens, dominate the overall cost in agentic coding.”

The reason is that “agentic workflows accumulate the information from different sources and the same context gets fed into the models repeatedly.” As a result, there is a “dramatically higher input/output ratio” for agentic AI than for single-prompt or multi-prompt AI sessions with a bot.

And, drilling down even further, the most expensive input token factor is when the agent retrieves prior information from memory. “We find that cache reads dominate both raw token volume and dollar cost,” Bai and team wrote. “In every phase, cache-read input tokens are the largest category by a wide margin (Figure 8a), reflecting the cumulative reuse of prior context.”

There will be a reckoning 

Overall, the study results confirm my anecdotal experience with coding agents such as Replit and Lovable, where the meter was constantly running to use the underlying AI models, and I had no sense of what the total cost would be.

What can be done? The authors don’t have many suggestions. One proposal is that even if agents can’t predict the number of tokens, they can make some guesses at a high level, a “coarse-grained” estimate for token cost. “This suggests that agent-driven estimation can potentially support early budget alerts before launching expensive runs, improving cost transparency without overpromising precise token-level accuracy,” they wrote.

I can think of a few other sensible guidelines. 

Since input tokens are the biggest cost element, one should think carefully about what can be controlled at input. The size of prompts is one factor that drives input tokens higher. The context window used with an agent, wider or narrower, affects token count at input. And the number of tools called by the agent, such as databases, will bring lots more input tokens into play. 

Also: Can a newbie really vibe code an app? I tried Cursor and Replit to find out

There’s only so much you can do as a user, however. Something more will have to be done on an industry-wide basis. The problems outlined are clearly those of a young industry, and one where vendors will have to be pushed by users to change practices. 

The lack of transparency as to what an agent might cost to do a task is way too vague for enterprises that need to be able to plan investments in software. The burden is pushed onto the user to run agentic tasks in an experimental capacity over and over in order to get anything like an average cost to use as an estimate for planning purposes.

And the lack of guarantees of success — even after the agent burns through tokens — is the most glaring problem. That means enterprises could waste vast amounts of money just running tokens.

Users collectively are going to have to push back on vendors such as OpenAI, Google, and Anthropic and demand price transparency and some form of guarantee that a task will be completed, or else the entire exercise of agentic AI may be dominated by cost overruns and failed implementations.

Such deep problems are probably already being encountered by early adopters. They may be content to pay such a high cost to be among the first to get an agentic edge. It’s not a situation, however, that can lead to stable, steady use of agentic AI.





Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


If Game Two of their first-round playoff series with the Denver Nuggets saved the 2025-26 season for the Minnesota Timberwolves, Game Three showed why it should be saved. 

The Timberwolves were a different beast while decisively thumping the Nuggets, 113-96 Thursday night at Target Center, in a game that wasn’t nearly that close. These Wolves were the mythical creature we’d heard about in preseason lore, purposefully locked and loaded to be both marauding and staunch. They owned both ends of the court, gleefully transferring back and forth from irresistible force to immovable object. 

A quartet of Timberwolves deserve special mention, but it begins with Jaden McDaniels. After his team had toppled Denver to even the series at a game apiece Monday night, McDaniels used the sizable chip on his shoulder to etch some graffiti into the public discourse, casually castigating the most prominent Nuggets players by name as “bad defenders” in a matter-of-fact manner that had the media compelling him to confirm what he had just said. 

Trash talk is fleetingly fungible in the jaundiced social environment of 2026, functioning more like coupons than currency in that it needs to be rapidly leveraged before its expiration date. The common perception naturally was that McDaniels was calling out the Nuggets. But in a more subtle, profound way, he was also putting his teammates on notice. 

All season long the Timberwolves have procrastinated on their full potential, frequently demonstrating that their preseason talk about maturity and commitment was cheap. By contrast, those words uttered by McDaniels were expensive. He had just picked a fight with the opponent, leaving open the question of how many of his teammates would join him in the fray. 

That he would lead the charge was established early, after the Timberwolves’ top two scorers, Anthony Edwards and Julius Randle, had each missed a pair of open looks against Denver’s bad defenders in the game’s first 90 seconds.  

With the game still scoreless, the NBA’s best pick-and-roll combo, Nikola Jokic and Jamal Murray, were clustered around the foul line with Minnesota’s best defenders, McDaniels and Rudy Gobert. As they jammed up Jokic, McDaniels picked the ball loose and started sprint-dribbling the other way. To no one’s surprise, Donte “Ragu” DiVincenzo was also on his horse in transition, receiving a pass from McDaniels and then lobbing it back for a Jaden slam against a hapless Murray and Murray’s late-arriving teammate, Cam Johnson, who committed the foul that allowed McDaniels to finish with the “and-1” free throw. 

On the Timberwolves next offensive possession, McDaniels muscled his way to two offensive rebounds, feeding Ragu off the first one for a missed three-pointer, which he corralled for the second one and executed the putback in traffic. It was McDaniels 5, Nuggets 0, setting the tone for a game in which not only did the Wolves never trail, but never let the lead go under double digits after McDaniels made a consecutive pair of driving layups eight minutes into the game. 

“Spectacular. I thought his activity offensively in the first quarter was outstanding,” said Wolves coach Chris Finch after the game. “He was inspirational.” 

Among the most inspired were McDaniels fellow wing players, Ragu and Ayo Dosunmu. Ragu is exactly the kind of player who will have your back in a squabble, and his galvanized performance seemed borne of satisfaction that someone else had clarified the mission. As usual, the Timberwolves were at their best with him on the court: +20 in the 32:54 he played, -3 in the 15:06 he sat. 

“He makes so many hustle plays, momentum plays, different styles of plays.” Finch raved. “He’ll make a shot, get a transition bucket, he’ll rebound, get a steal, blow something up. So many different plays. He’s just a basketball player.”

Related: How the Timberwolves sparked a season-saving Game 2 comeback over the Nuggets in Denver

Then there was Ayo, whose fearless, blazing, bee-lines for the bucket were quicksilver kryptonite for a Nuggets defense that is neither swift nor rugged. “I’ve been waiting for him to wake up a little bit in this series,” Finch accurately observed. “The downhill mindset that he played with all season for us was back.”

Back with the sort of multipurpose propulsion that leaves witnesses with giddy whiplash. Ayo led the team with 25 points and 9 assists in 32 minutes of time-lapse hoops, the lone blemish being three clanks from long range. Why chuck treys when you can so easily undress players in the paint? Ayo was 10-for-12 on two-pointers and none of those dozen shots came from anywhere but beneath the rim. Five of his nine dimes likewise yielded layups or dunks, which means he personally accounted for 30 of the 68 points in the paint by the Timberwolves on Thursday, doubling up the Nuggets’ 34.

Which brings us to the non-wing in Game 3’s ring of honor, Rudy Gobert. For the third straight game, Gobert blunted the supposed advantage Denver had with the magical playmaker Nikola Jokic at the controls. Suffice to say that in the last five quarters, Jokic has shot 8-for-33 from the floor. If that continues, the Nuggets are toast in this series. 

When I asked Finch after the game if the herculean job Gobert was doing on Jokic made planning his defense simpler and better thus far, he replied, “Rudy is making all of us look good right now with his defense.” 

Amen.

If there is an asterisk on this game, it would be the absence of Denver’s brutishly versatile power forward Aaron Gordon. Nuggets coach David Adelman should be given a lot of credit for his honesty and transparency in dealing with the media during his first full season at the helm, but it came back to bite him and his team during the pregame presser, when he was clearly rattled and dejected by the sudden unavailability of Gordon, whose playing status went to “probable” to “out” in a period of a few hours due to a chronic calf strain. 

Gordon is far and away his team’s best defender, making the timing of his injury especially troublesome in the wake of McDaniels laying down his marker. Rattled is a good way to describe the entire team’s performance in the first quarter, an emotional wounding that needs to heal as fast as Gordon’s body if the Nuggets are going to be competitive in a series that had dramatically been flipped on its head over the past three days. 

That the Timberwolves played with such dominance despite mediocre outings from Ant and Randle would be a good thing for both of those current cornerstones to keep in mind. Ant was beset by foul trouble and Randle had a solid second quarter, but it stood out that neither player fully embraced what so often works on offense when the Wolves are at their best: Push the pace, move the ball, move without the ball, and make quick decisions. Ant and Randle can still be first among equals and blend into that catechism if they stay attuned to the possibilities of a greater good, one that all of sudden doesn’t have to end with them being postseason fodder for the Spurs or the Thunder. 

Not when you’ve got three wings at a collective peak, with a chaser of Rudy semi-clowning the Joker. 



Source link