Stop Burning Tokens: How to Get More From Your AI Spend

Most advice about cutting AI costs tells you to use it less. Fewer requests, tighter limits, lock it down. That's a tax on the thing that's making you faster, and it's the wrong place to start.

There's a much better lever, and almost nobody pulls it: the way you actually talk to the model. The same task, producing the same output, can cost wildly different amounts depending on how the request is built, and the gap is bigger than most people would believe, with no change to what the AI does.

This isn't about clever prompt-engineering tricks to get better answers. It's about not overpaying for the answers you're already getting. Here's where the money leaks, and how to stop it.

Output costs more than input

Here's the asymmetry nobody accounts for. With most providers, the tokens the model generates cost far more than the tokens you send in, often three to five times more. You're paying a premium for every word that comes out.

Which means a model that rambles is a model that's expensive. If you don't set a ceiling on the response, you're paying for it to pad, caveat, and restate the question before it gets to the point.

The fix is the simplest one on this list: cap your output. Set a sensible max output token limit on every call, and tell the model to be brief in the prompt. "Answer in one paragraph." "Return only the JSON." "No preamble." You'll be surprised how much shorter, and cheaper, the same answer gets when you stop letting it run on.

You're paying for the same context over and over

This is the big one, and it's invisible unless you go looking.

Models have no memory between calls. So to hold a conversation, most applications send the entire history back in every single request. Message one, message two, message three: by the tenth turn you're re-sending the previous nine every time, plus any documents or instructions that came with them.

You're not paying for ten messages. You're paying for the first message ten times, the second message nine times, and so on. The cost of a long conversation grows like a snowball, and the bill arrives long after anyone remembers why.

The fix is to stop resending things the model doesn't need. Summarise older turns instead of replaying them verbatim. Trim the context down to what's actually relevant to the current step. If you're feeding in documents, retrieve and send only the passages that matter rather than the whole file. The model rarely needs everything. It needs the right thing.

Cache the parts that repeat

If you run the same system prompt, the same instructions, or the same reference document on every request (and most applications do) you're paying full price to send identical text again and again.

Most major providers now offer prompt caching. You mark the part of the prompt that stays the same, and on subsequent calls it's served from cache at a fraction of the normal price. The repeated chunk still gets used; you just stop paying full whack for it every time.

This is the closest thing to free money on the list, and at small-business level almost nobody is using it yet. If you've got a long fixed system prompt or a knowledge base going into every call, caching it can take a meaningful slice straight off the bill with no change to behaviour at all.

Use the right model for the job

I've written about this at length before, so I'll keep it short: most businesses pick a capable, expensive model once and run everything through it forever.

But drafting an email, summarising a meeting, or answering a routine FAQ does not need a flagship model. A smaller, cheaper one produces identical output for a fraction of the cost. Match the model to the difficulty of the task, not to whatever was top of the list when you set it up. For a big chunk of routine work, you're paying premium rates for basic answers out of pure inertia.

Trim the bloat

Every word in your prompt costs money, and it costs it on every call. So the padding adds up fast.

Two places to look. First, your system prompt. These tend to grow over time as people bolt on instructions, and half of it ends up being polite filler the model doesn't need. Tighten it. Second, your examples. Few-shot examples are powerful, but each one is tokens you pay for every single time. Use the minimum number that gets the job done, and keep them short.

None of this is dramatic on its own. But trimming a verbose prompt that runs ten thousand times a day is the difference between a rounding error and a line item.

The catch: you can't optimise what you can't see

Here's the honest bit. Every technique above works, but "I think that's cheaper now" is not a number. You can do all of this and have no idea whether it actually moved the bill, because by default you can't see the impact of any single change.

So before you start optimising, get visibility. Know what each model, each integration, and each workflow is costing you now. Make a change. Watch the number move. That feedback loop is the difference between guessing and managing.

That's what SpendLil does. It sits between your application and your AI provider and tracks every call in real time: provider, model, tokens, cost, per key, per project, per team. Cap your output tokens and you'll see the average response cost drop. Switch to a cheaper model for routine work and you'll see exactly how much you saved. Optimise with the meter running, and you stop guessing.

The bottom line

You don't cut your AI bill by using less AI. You cut it by not overpaying for the AI you're already using.

Cap the output. Stop resending context. Cache what repeats. Right-size the model. Trim the prompt. Then measure, so you know it worked.

Same output. Same speed. A meaningfully smaller bill. Start there, the rest gets easier.

Not sure what you're spending?

Take the free AI Shadow Audit: a quick assessment that scores your AI spend visibility, compliance readiness, and data risk.

Take the audit →

Get the newsletter

Weekly updates on AI regulation, costs, and practical guides for UK businesses.

Subscribe →