An AI feature that delights users in the demo can quietly become your biggest line item at scale. These levers cut LLM costs without degrading the experience.
1. Prompt Caching
If every request shares a large fixed prefix (a system prompt, a document, few-shot examples), cache it. Cached tokens cost a fraction of fresh ones — often a 90% reduction on the repeated portion. Keep the stable content first and the variable content last.
2. Route to the Right Model
Don't use your most powerful model for everything. Send simple classification and extraction to a small, cheap model; reserve the flagship for genuinely hard reasoning. A router that picks per request can slash spend.
3. Trim the Context
- Retrieve only the top few relevant chunks for RAG, not everything.
- Summarize or compact long conversation histories instead of resending them whole.
- Cap output with sensible max_tokens — runaway generations are pure waste.
4. Batch Non-Urgent Work
For analytics, tagging, or overnight processing that isn't latency-sensitive, batch APIs run the same requests at roughly half price.
Measure First
Log tokens per request and cost per feature before optimizing. You'll usually find one or two endpoints driving most of the bill — fix those first.
