7 Things We Learned Building LLM Features in Production

We have spent the past two years building LLM-powered features into production software — not demos, not internal tools, but customer-facing products that handle real data and real consequences. Here is what we actually learned.

1. Prompt engineering is an engineering discipline, not a hobby

The difference between a prompt that works in a notebook and one that works reliably in production is version control, regression testing, and staged rollouts. Treat your prompts like code. Check them into source control. Write evals that catch regressions when you update them. If you cannot measure whether a prompt change made things better or worse, you cannot ship it safely.

We use a simple eval framework: a fixed set of inputs with expected outputs or expected output characteristics, run automatically on every prompt change before it hits production.

2. Structured output is non-negotiable for anything downstream

If your LLM output feeds into another system — a database write, a UI render, a business logic check — you need structured output. Every major model provider now supports JSON schema enforcement natively (Anthropic's tool use, OpenAI's structured outputs, Google's response schema). Use it. Parsing free-form text output and hoping it conforms to a shape is a production incident waiting to happen.

3. Latency will kill your UX before cost does

Most teams obsess over API cost and underestimate latency impact on conversion and retention. A 6-second wait for an AI-generated response in a form field or a chat interface is not acceptable to users. Streaming solves the perception problem even when the total time is the same — users tolerate a typing animation far better than a blank screen.

For operations where latency cannot be hidden by streaming, move to async: fire the LLM call, return a job ID, webhook or poll for the result. Users accept 'processing' much better than they accept freezing.

4. Context window management is still important even with large windows

Yes, Claude and Gemini now support million-token contexts. No, you should not just shove everything in. Token cost scales linearly, latency scales roughly linearly, and at a certain input size you start hitting diminishing returns on retrieval quality. Build a retrieval layer for large corpora. Use the large context window as a safety net for edge cases, not as the primary architecture.

5. You need an evals strategy before you go to production

If you cannot answer 'how do I know this got better or worse after my last change,' you do not have a production AI feature — you have a prototype in a production environment. Evals do not have to be complex. They can be as simple as 50 representative inputs scored by a secondary LLM call that checks whether the output meets criteria. What matters is that you run them consistently and look at the results.

6. Guardrails are cheaper than incidents

Every LLM feature exposed to users needs input and output guardrails. Input: block prompt injection patterns, enforce length limits, sanitize. Output: validate structure, check for sensitive data patterns, implement a content policy appropriate for your use case. The engineering cost is low. The alternative — a hallucination or injection that makes it into a customer record or a public-facing display — is very high.

7. The model is rarely the bottleneck

After the first few months of any LLM project, the quality problems almost never come from the model being bad. They come from bad context (the retrieval is not finding the right information), bad prompts (the instruction is ambiguous or under-specified), or bad evals (you did not notice the regression). Before upgrading to a more expensive model, audit your context quality and your prompt design. You will almost always find a cheaper fix there first.

Working on an AI integration?

We design and build production-ready AI systems. Tell us what you are working on.