GPT-4o vs Claude Sonnet vs Gemini Flash: What AI Actually Costs Per Task

Most businesses pick an AI provider once and never look at the bill again. A developer tries a model, it works, it becomes the default. Nobody compares. Nobody checks if a cheaper option would do the same job.

I ran the same tasks through three providers recently during a client demo. The cost differences were staggering.

The test

I sent identical prompts through OpenAI, Anthropic, and Google via SpendLil's gateway. Same questions, same complexity, same expected output. The only difference was the provider and model.

Here's what each request cost for a typical business task — roughly 500 tokens in, 1000 tokens out:

GPT-4o: £0.008
GPT-4o-mini: £0.0003
Claude Sonnet 4.5: £0.0048
Claude Haiku 4.5: £0.0015
Gemini 2.5 Flash: £0.00002
Gemini 2.0 Pro: £0.008

The most expensive option was 240 times the cost of the cheapest. For the same task. With comparable quality on most everyday use cases.

Where the cheap models genuinely work

Not every task needs a flagship model. For the majority of business AI use cases, the budget option produces output that's indistinguishable from the premium one.

Email drafts and replies. If you're using AI to draft customer emails, internal communications, or standard responses, GPT-4o-mini or Gemini Flash handle this comfortably. The output is clean, professional, and functional. Paying 240 times more for GPT-4o or Claude Sonnet to do the same thing is money wasted.

FAQ and customer support responses. Standard customer queries follow predictable patterns. A cheap model handles "What are your opening hours?" or "How do I reset my password?" just as well as a frontier model. Save the expensive tokens for the edge cases.

Data extraction and formatting. Pulling structured data from unstructured text, reformatting documents, extracting names and dates from emails — this is mechanical work. The cheap models are built for it.

Summarisation. Meeting notes, document summaries, article digests — again, the budget models produce clean summaries that are functionally identical to what you'd get from a model costing 15 times more.

Code completion and simple generation. Autocomplete, boilerplate generation, simple function writing, formatting code — GPT-4o-mini handles this without breaking a sweat. Your developers are likely running a premium model for work that doesn't need it.

Where you actually need the premium model

There are genuine use cases where the flagship models earn their price tag.

Complex multi-step reasoning. If you're asking the AI to analyse a legal contract, evaluate a business strategy, or work through a technical problem with multiple dependencies, the premium models produce noticeably better output. They hold context better, make fewer logical errors, and handle nuance that cheaper models miss.

Nuanced long-form writing. Blog posts, reports, proposals — anything where tone, structure, and persuasive quality matter. The cheap models can draft, but the premium models craft. If the output goes directly to a client or the public, it's worth paying for quality.

Advanced code generation. Building complex systems, debugging intricate issues, architecting solutions — this is where models like Claude Sonnet and GPT-4o justify their cost. The cheap models will attempt it but produce more errors that cost developer time to fix.

Analysis of large documents. Processing a 100-page contract, analysing a financial report, or synthesising information from multiple long documents — frontier models with large context windows handle this significantly better than their budget counterparts.

The 70/30 rule

In most businesses, roughly 70% of AI tasks are routine. Emails, summaries, data extraction, simple code, formatting, FAQ responses. These don't need a flagship model.

The remaining 30% — complex reasoning, nuanced writing, advanced code, large document analysis — that's where premium models earn their keep.

The problem is that most businesses run everything through the same model. 100% of tasks on a premium model when 70% of them would be handled just as well by something costing a fraction of the price.

The maths

Take a team of 10 people making 100 API calls per day. That's 1,000 calls daily, roughly 22,000 per month.

If everything runs through Claude Sonnet at £0.0048 per call, that's about £106 per month.

Switch the 70% routine tasks to Gemini Flash at £0.00002 per call, and keep the 30% complex tasks on Claude Sonnet. Your monthly cost drops to around £32. Same output quality for the work that matters. £74 per month saved.

Now scale that. A company with 50 people using AI? The savings are in the hundreds. Multiple teams across an organisation? Thousands.

And that's before Anthropic's stealth tokenizer increase. Opus 4.7 shipped with a new tokenizer that generates up to 35% more tokens for the same text. The price per token didn't change. Your bill per request did. If you weren't tracking usage before, you definitely won't notice this kind of hidden increase.

How to figure out what you're overspending

Step one: find out which models your team is actually using. Not which ones they think they're using — which ones are in the code, in the API calls, in the config files.

Step two: categorise your use cases. Which tasks are routine? Which are complex? Be honest — most people overestimate how many of their tasks need a premium model.

Step three: run the numbers. Take your monthly API usage, split it by the 70/30 rule, and calculate what you'd pay if routine tasks used a budget model. The difference is your potential saving.

Step four: start tracking. A one-off audit helps but usage patterns change. New features get built, new team members join, models get switched without anyone noticing. You need ongoing visibility to keep costs optimised.

How SpendLil helps

SpendLil sits between your application and your AI provider. One header added to your API calls and every request is tracked — provider, model, tokens, cost, and tags.

You can see exactly which models are being used for which tasks. You can compare costs across providers in real time. When pricing changes — like Anthropic's tokenizer increase — SpendLil updates its cost calculations so you see the real impact immediately.

It also shows you where you can save. If 70% of your API calls are going through an expensive model for simple tasks, that shows up in the dashboard. The recommendation is obvious: switch those calls to a cheaper model and keep the premium for where it matters.

Your API keys are never stored. Requests are never blocked. If SpendLil goes down, your AI keeps running. One header. That's it.

The bottom line

The AI model your team picked six months ago was probably the right choice at the time. But pricing has changed, cheaper alternatives have improved, and nobody's gone back to check.

The difference between running everything on the default model and matching the right model to the right task could be saving your business hundreds or thousands per month. And with providers quietly increasing costs through tokenizer changes and usage restrictions, the businesses tracking their spend will be the ones who adapt. Everyone else will just absorb the increase.

Start with the question most businesses have never asked: which model are we using, and does it need to be that one?

Not sure what you're spending?

Take the free AI Shadow Audit — a quick assessment that scores your AI spend visibility, compliance readiness, and data risk.

Take the audit →

Get the newsletter

Weekly updates on AI regulation, costs, and practical guides for UK businesses.

Subscribe →