Prompts are code — treat them that way
A prompt that works in the playground often breaks the moment real users ask slightly different questions. We have seen teams rewrite the same system prompt twelve times in a Slack thread because nobody tracked what changed or why output got worse.
We treat prompts like any other production dependency: versioned, tested, and reviewed before they ship. That means prompt templates stored in git (or a prompt registry), evaluation sets drawn from actual user questions, and automated checks that flag when a change drops accuracy below an agreed threshold.
Retrieval-augmented generation done properly
RAG is not just "embed your PDFs and hope." Chunk size, overlap, metadata filtering, and re-ranking all affect whether the model sees the right paragraph. We start by looking at the documents you have — product manuals, policy PDFs, API docs, ticket history — and design a chunking strategy around how people actually query them.
For a spice manufacturer we indexed recipes, QC procedures, and supplier contracts separately so production queries did not pull finance paragraphs. For a B2B marketplace we tagged chunks by product category so search stayed scoped. Small structural decisions like that matter more than swapping embedding models.
Evaluation before and after launch
We build a test set of 50–200 questions with expected answers or grading criteria. Every prompt revision runs against that set. When accuracy drops, we know which change caused it. After launch we add misfires from production logs to the set so the pipeline improves over time instead of drifting.