The cost of intelligence is falling fast. A model that matches GPT-4's release-day performance costs roughly a fortieth of what GPT-4 cost two years ago, and the trend is continuing.
The cost of frontier intelligence is not falling. Top-tier reasoning models cost roughly what they cost a year ago on the rate card, and substantially more in practice. Reasoning tokens have changed the unit economics of a single call. A model that used to produce a 500-token answer now produces 50,000 tokens of internal thinking on top of the answer, all billed at the output rate.
Both trends are real, and they explain something most founders are noticing. AI is getting cheaper. Their AI bill is not.
The reason is partly that workloads are getting heavier. Agents take more tokens than chat. Reasoning models take more tokens than non-reasoning ones. Context windows have stretched and teams have filled them. The Jevons paradox is real and it's playing out across the industry.
But the deeper reason is that most AI products are running on the wrong models for the work they're doing. Sometimes the model is more powerful than the task requires and the bill is larger than it should be. Sometimes the model is too weak for the task and the product feels unreliable. Both come from the same root cause: a workload gets assigned to a model during development, the assignment works well enough to ship, and nobody revisits it.
This article isn't for teams treating tokens per dollar as their north star metric. Tokenmaxxing optimizes the wrong unit. The unit that matters is the workload, and the question is whether each workload is running where it should.
A better way to think about this is to treat every workload in a product as belonging to one of three tiers, with the assignment as something that moves over time.
Three tiers
Tier 1 is frontier intelligence. Opus, GPT-5, Gemini Ultra. These earn their keep on problems with a large solution space and real reasoning demands. Generating production code. Solving a mathematical problem. Working through an ambiguous brief where the answer isn't obvious until you've thought about it. A single Tier 1 call can cost twenty to a hundred times more than a Tier 3 call, which is fine when the task warrants the cost.
Tier 2 is the workhorse. Sonnet, GPT-mini, Gemini Flash. Most production traffic belongs here. The work involves structured generation with some judgment, multi-turn conversations where the model tracks state, or synthesis over a small retrieved context. A customer support reply that needs the right tone and the right facts. A meeting summary that decides what's worth including. A scoped agent that calls two or three tools in a known sequence.
Tier 3 is specialized. Borderline extractive tasks where the input goes in, a structured output comes out, and the definition of a correct answer is clear. Classifying a document into one of twelve categories. Extracting a geolocation from an address string. Tagging the entities in a paragraph. These workloads run cheaply on a small model trained for the job, often for a fraction of the cost of any general-purpose API.
The most common pattern in over-budget AI products is Tier 3 work running on Tier 1 hardware. The reverse is also common and harder to spot, which makes it the more dangerous of the two.
Routing first, distillation later
The reflex when an AI bill grows too large is to look for a clever optimization. Prompt compression. Caching. A distillation project. Those are real levers. They come second.
The first question is whether each workload is running on the right tier. Most teams default to a frontier model for everything because that's what they reached for during development, and the choice never got revisited. A reranking call running on GPT-5 when a tuned cross-encoder would do the same job for one percent of the cost. An entity extraction step using Opus when a fine-tuned 7B model would produce indistinguishable output. A classification gate calling a reasoning model when a simpler classifier would answer faster and cheaper.
The opposite mistake costs more in the long run. A team picks a cheap model early because the workload felt simple, ships it, and then starts hearing customer complaints about quality. Hallucinated facts in a summary. A support agent that gets the tone wrong on a sensitive ticket. A classification step that handles the obvious cases correctly but mangles the edge cases that matter most. The team treats these as prompt engineering problems and spends weeks tuning around them, when the actual fix is moving the workload up a tier.
Worse is when nobody complains. A common version of this is text-to-SQL running on an underpowered model. The queries it produces look plausible. They run. Some return wrong numbers, quietly, and feed dashboards that nobody is checking against ground truth. The model isn't dumb in isolation, it's that the task requires understanding the schema and the organization's context data, and the cheap model has neither. (How to do text-to-SQL well is its own post.) The cost of getting this wrong is paid in decisions made on bad data, which is harder to detect and more expensive than any inference bill.
There is no playbook for getting these calls right. Observability matters, evaluation harnesses matter, but the decision of where a workload belongs is still a judgment call about the shape of the task, the cost of being wrong, and the variance in the inputs. Good machine learning has always been as much art as science, and the people who get these calls right do it on intuition built up over years of watching models succeed and fail in production. Arkanis has been doing this work a long time. That intuition is most of what the audit is, and it's the part that can't be replaced by a dashboard.
A routing audit catches both patterns in days. The savings on overspend and the reliability fixes on underspend usually come out of the same pass through the logs. Distillation is the harder follow-on, for workloads where even the right tier is expensive. Run the frontier model long enough to log inputs and outputs, then train a small specialist to reproduce the behavior. The specialist runs at a fraction of the cost.
A real example
A recent engagement involved a GraphRAG system. GraphRAG is worth understanding even if you haven't run into it, because it solves a problem most products eventually hit.
Most AI products that talk to documents use retrieval-augmented generation, or RAG. The system searches for passages that look relevant to the user's question, passes those passages to a language model, and the model writes an answer using them. RAG is a reasonable default for a lot of products.
The limitation shows up at scale. Imagine a company sitting on 50,000 customer support tickets from the past year, and a product team asking "what are the top five issues driving cancellations?" No single ticket answers that question. The answer is distributed across thousands of tickets, and no language model has a context window large enough to read them all at once. A standard RAG system pulls the ten or twenty tickets that look most similar to the question, the model summarizes those, and you get a confident answer based on 0.04% of the data.
GraphRAG is built for this. It reads the corpus once, extracts the entities and the relationships between them into a knowledge graph, and writes summaries over regions of that graph. Questions get answered by performing a lossy search across those summaries instead of trying to cram every relevant document into a prompt. The result reflects the whole dataset rather than the slice that happened to fit in the window. For any product whose value comes from synthesizing across a corpus, this is the technique that makes it possible.
The catch is the build cost. Constructing the graph means running every chunk of every document through a frontier model to extract entities and relationships. For a company indexing millions of documents, the indexing bill runs into six figures a month before a single user question gets asked.
What worked was logging enough of those extractions to train a small specialist model on the documents that mattered. The specialist now handles extraction at a fraction of the original cost, with output quality indistinguishable from the frontier version.
The work was domain-specific. When the data shifts, say a move into a new vertical or a new customer with documents that look nothing like the training set, the specialist gets less reliable and someone has to retrain it. That maintenance is real work. It's part of why AI-native companies, even at scale, often look more like services companies on their balance sheet than the pure software companies they're compared to. Margins improve through operational discipline, not through code that ships once and runs forever.
Building that discipline is what separates AI companies running at software-like margins from the ones running at services margins across the board. Both are real businesses. The path is a choice.
Why this is hard to see from the inside
The framework is simple. Applying it to a product you built takes a kind of visibility most teams don't have.
API bills arrive as totals rather than workload-level cost breakdowns. Logs sit in different places for different providers. Quality issues get attributed to prompt engineering when the real problem is model fit. The team that built the product is too close to the code to sort it cleanly, and the finance side doesn't have the technical visibility to ask the right questions. Most companies operate with a vague sense that something is off, and no clear path to acting on it.
A two-week audit
Arkanis runs a two-week engagement that produces three things.
A complete map of every workload in the product, sorted by cost and by reliability. Some workloads are running on more model than they need. Others are running on less than they should. Both are findable in the logs.
A graduation plan for the top workloads. For each: which tier the work should run on, what it costs to make the move, what it saves or stabilizes once shipped, and how long the implementation takes.
The first changes, shipped during the engagement. Arkanis is a hands-on team. Meaningful improvements typically appear in the first few days, not at the end in a slide deck.
Two weeks of this work has saved clients hundreds of thousands of dollars in annualized inference cost and pulled brittle workloads back into reliable territory. The audit pays for itself before it ends.
If your AI bill is growing faster than your revenue, if your product is taking reliability hits that prompt tuning hasn't fixed, if your gross margin is becoming a fundraising problem, or if you suspect there are workloads that should move tiers in either direction but you don't have the in-house capacity to sort them out, get in touch.