You're Paying for the Wrong Thing

Your model isn't what's eating your budget. Your serving stack is — and most teams haven't looked at it since they shipped to production.

Here's a number that should reframe how you think about AI costs: inference is already 55% of AI infrastructure spending, up from 33% three years ago. By 2030, analysts expect it to hit 75–80%. Training gets all the press. Inference pays all the bills.

This matters because training and inference are fundamentally different problems. Training is a one-time event. Inference is a forever tax. Thousands of concurrent, latency-sensitive requests, around the clock. Most teams inherit a stack built for training and use it for production inference. The result: they benchmark new models obsessively and never look at how well they're serving the models they already have.

───

The proof is in what Midjourney did.

In mid-2025, Midjourney migrated its image generation fleet from NVIDIA A100/H100 clusters to Google Cloud TPU v6e pods. Same models. Same output volume. Different hardware, matched to the actual workload.

CEO David Holz disclosed in a Discord message that the migration took six weeks and cut monthly inference costs from $2.1M to under $700K — a 65% reduction with an eleven-day payback period. Just a quiet infrastructure decision that saved $17M+ annually.

They didn't find a cheaper model. They ran the same models on hardware designed for production inference instead of training. That's the whole story.

───

Why does this happen?

GPUs are excellent for training — batch-oriented, predictable workloads. Production inference is different: latency-constrained, concurrent, with consistent request shapes at scale. That's exactly what TPUs and ASICs are built for. When your traffic is consistent and predictable, specialist hardware beats general-purpose GPUs on cost-per-token by a significant margin.

───

One thing to do this week:

Pull your GPU utilization during peak inference load. Under 60%? You're leaving money on the table. Ask yourself:

• Are you using continuous batching?
• Is your hardware matched to inference, or did it start as a training cluster?

You don't need to migrate to TPUs tomorrow. But know what you're leaving on the table before you approve the next GPU spend request.

───

The companies pulling ahead on AI aren't running the best models. They're running good models extremely well.

───

Forward this to one person on your team who should be reading it. If someone forwarded this to you — subscribe at clustermind.io.

ClusterMind — independent analysis for AI infrastructure professionals. No vendor funding. No filler.

You're Paying for the Wrong Thing

Keep Reading

ClusterMind

ClusterMind