In 2023 every product team wanted to 'add AI.' In 2025, many of those experiments have either shipped or quietly died. The ones that survived share a common trait: they solve a specific, measurable problem rather than being AI for the sake of it.
After integrating LLMs into over a dozen client products — ranging from document processing systems to customer support assistants to code review tools — we've developed a set of patterns that consistently produce reliable, cost-effective results.
Start with the Output, Not the Model
Before choosing an LLM or writing a prompt, define exactly what a good output looks like. Write 10 examples of input/output pairs that represent success. This forces you to think about edge cases early and gives you an evaluation set before you've written a single line of code.
Teams that skip this step end up chasing vague quality improvements with no way to measure whether they're making progress.
Prompt Engineering Is Engineering
System prompts are code. They should be version-controlled, reviewed, and tested like any other part of your application. We keep prompts in a dedicated file (or database row for dynamic prompts) and run them through a test suite of our 10+ example pairs before every deployment.
- Be explicit about output format — JSON schema, Markdown headings, bullet lists
- Include negative examples ('Do not include…') for recurring failure modes
- Set a persona and tone that matches your product's voice
- Keep system prompts under 1,000 tokens where possible — shorter prompts are faster and cheaper
Model Selection: GPT-4o vs Claude 3.5 Sonnet
We've shipped features using both. GPT-4o performs better on tasks that require following complex multi-step instructions. Claude 3.5 Sonnet tends to produce cleaner structured output and handles very long documents better. For most classification or extraction tasks, the cheaper tiers (GPT-4o mini, Claude Haiku) are sufficient and 10–20× cheaper per token.
Start with the cheapest model that meets your quality bar and upgrade only if you have measurable evidence that a more expensive model performs better on your specific task.
Streaming for Perceived Performance
For user-facing features, stream the response token-by-token rather than waiting for the full completion. A user watching text appear feels faster than staring at a spinner for 3 seconds. Both OpenAI and Anthropic APIs support streaming via server-sent events. In Next.js, the Edge Runtime and `ReadableStream` make this straightforward to implement.
Cost Control at Scale
- Cache deterministic prompts: if the same input produces the same output, cache the result in Redis with a TTL
- Summarise long documents before sending to the LLM — summarisation is cheap, full-document processing is expensive
- Set `max_tokens` explicitly to prevent runaway completions
- Log every API call with input tokens, output tokens, and cost — you can't optimise what you can't measure
- Use batch APIs for non-real-time workloads (OpenAI Batch API offers 50% discounts)
Handling Failures Gracefully
LLM APIs have higher latency and lower reliability than typical databases. Design your integrations to degrade gracefully: show the user a fallback UI if the LLM call times out, implement exponential backoff for rate limit errors, and never block your main application flow waiting for an AI response if the feature isn't critical-path.
Retrieval-Augmented Generation (RAG)
If your feature needs to answer questions about proprietary or recent data, RAG is almost always the right architecture. Embed your documents, store embeddings in a vector database (we've used pgvector, Pinecone, and Weaviate — pgvector is sufficient for most use cases under 1M documents), and retrieve relevant chunks at query time before sending them to the LLM.
RAG avoids the cost and complexity of fine-tuning, keeps your data out of training sets, and lets you update the knowledge base without redeploying the model.
The Features Worth Building
- Document extraction (invoices, contracts, medical records) — high accuracy, clear ROI
- Customer support triage — classify and route tickets before a human reviews them
- Internal search over company knowledge bases — RAG over documentation or CRM notes
- Code review assistance — summarise diffs, flag patterns, suggest tests
- Report generation — turn structured data into readable prose
The features not worth building (yet): open-ended chatbots with no guardrails, autonomous agents for consequential decisions, and any feature where you can't define what a good output looks like.
Conclusion
LLMs are genuinely powerful, but they're a component in a larger system — not a product by themselves. The teams shipping durable AI features treat prompts as code, measure quality rigorously, control costs from day one, and build for graceful degradation. If you're planning an AI integration and want a second opinion on your architecture, we're happy to help.