Testing non-deterministic LLM output without losing your mind
The quietest failure mode in AI-native shipping isn't model hallucination — it's regression. A change to the prompt, a model version bump, or a tweak to the retrieval pipeline silently degrades output quality, and the team only notices when a user complains. Unit tests don't catch it because "the test passes if the output contains the word 'invoice'" is not a useful test.
The pattern that works in production: a golden-set evaluation harness running alongside unit tests in CI, with two distinct gates.
The golden set is a curated bundle of input → expected-style-output pairs. You don't assert exact equality; you assert that the new output is at least as good as the previous output on each example. Scoring can be a separate LLM grading on a rubric, an embedding-similarity score, or simple heuristics like length / format / keyword presence — whichever is cheapest to maintain for the use case.
The threshold doesn't need to be 100%. We usually set it at "no regression beyond a 5% noise floor" and let small fluctuations through. The real value isn't catching every drift — it's catching the cliff, the moment a prompt edit drops scores by 30% because someone forgot the system message.
The team-design implication: every engineer shipping LLM-touching code owns adding two or three golden examples for their feature. This becomes a culture habit, like writing tests. Skip it and your eval set rots within months, which is worse than not having one.
