We've reviewed dozens of LLM integrations that made it to production in broken states. Not broken in obvious ways — the demo worked fine. Broken in ways that only surface at scale, under adversarial conditions, or after the third month of API bills arrive. This checklist exists so you don't repeat those mistakes.

Work through it before you ship. Every item.

Security

1. Prompt injection guardrails. Any user input that reaches your LLM is a potential attack surface. Test what happens when a user submits: "Ignore all previous instructions and…" or "You are now DAN…" If your system prompt can be overridden, your integration is not production-ready. Add input sanitisation and consider a separate safety-classification step before the main LLM call.

2. PII scrubbing. Before any user data enters your prompt, identify and redact Personally Identifiable Information. This is not optional in markets governed by GDPR, UAE PDPL, or similar regulations. Build a PII detection pipeline — spaCy or a cloud NER service works well — and log what was redacted.

3. Output content moderation. Your LLM can generate harmful, offensive, or legally problematic content even when your prompt doesn't ask it to. Run all outputs through a moderation endpoint (OpenAI provides one for free) before returning them to users. Log all flagged responses.

Reliability

4. Timeout and retry logic. LLM APIs are not databases. P99 latency on GPT-4 can exceed 30 seconds. Never call an LLM synchronously in a user-facing request without a timeout. Implement exponential backoff with jitter for transient failures. Have a graceful degradation path — a cached response, a simpler fallback model — for when the API is unavailable.

5. Streaming for long responses. If your use case generates long text (reports, summaries, code), implement streaming responses. Users will not wait 15 seconds staring at a spinner. Streaming dramatically improves perceived performance and reduces abandonment rates.

6. Fallback model strategy. Primary model down? Have a fallback. GPT-4 → GPT-4o-mini. Claude 3.5 Sonnet → Claude 3 Haiku. The fallback should handle the same prompt structure and you should test that the output quality is acceptable, even if degraded.

Cost Control

7. Token budget enforcement. Without limits, a single malicious or confused user can generate thousands of dollars in API costs in minutes. Implement per-user token budgets, rate limits, and cost alerts. Track token usage per request and per user. Set hard monthly spending caps at the API provider level.

8. Prompt caching. If you're calling the LLM with the same system prompt thousands of times per day, use the provider's prompt caching feature. OpenAI and Anthropic both offer this. For long system prompts, it can reduce costs by 60–90%.

9. Response caching. For queries that are frequently identical — FAQs, static lookups, standard reports — cache the LLM response in Redis with an appropriate TTL. There is no reason to spend tokens answering "What are your business hours?" for the thousandth time.

Quality and Observability

10. Output validation. Never trust that the LLM followed your output schema. If you asked for JSON, parse it — and have a recovery path for when it's not valid JSON (it happens). If you asked for structured data, validate every field. Hallucinated field values in production are a support nightmare.

11. Evaluation dataset. Before shipping, build a set of 50–100 golden test cases: input prompts and expected outputs. Run your integration against this dataset before every deployment. LLM behaviour can drift between model versions — you need to catch it before users do.

12. Observability and tracing. Log every LLM call: the full prompt (sanitised of PII), the model used, the token counts, the latency, and the response. Use a tool like LangSmith, Weights & Biases, or even a simple structured log to a data warehouse. You cannot debug a production LLM system without this data. You cannot improve prompt quality without it either.

One More Thing

This checklist assumes you've already done the higher-level work: defined a clear use case, validated that an LLM is actually the right tool for the job, and established what "good" looks like for your outputs. If you haven't, talk to us first — we spend a lot of time helping teams avoid building the wrong AI thing very, very correctly.