CavalierAI
Stood up a golden-set eval pipeline + CI gate in 4 weeks; the team's prompt-regression-shipped rate dropped from ~30% of releases to under 3%.
Client / Project
CavalierAI is a Series A startup building an AI-native legal research product. Nine engineers, three lighthouse customers, and a frontier-model prompt at the heart of their product that nobody could change without holding their breath.
Problem
Every prompt edit was a coin flip. The team would tune a system message to improve one customer's queries and silently degrade another customer's queries. The lighthouse customers had started keeping their own informal regression logs and bringing them to weekly check-ins. The CTO had pushed the team to "be more careful" — which translated into pre-merge anxiety, not better tests.
Approach
A 1-engineer Embedded Engineer for 4 weeks. The deliverable was an evaluation pipeline integrated into their existing GitHub Actions CI: a curated golden set of 150 input/expected-output pairs covering each customer's edge cases, a rubric-based LLM grader, and a CI gate that blocks merges when scores drop more than 5% on any sub-bucket.
The harder work was cultural. We paired with each engineer for a session to write three golden examples for a feature they had recently shipped, so the team experienced the eval set as something they owned, not infrastructure imposed on them. The eval set now grows by ~10 examples per week as part of normal feature work.
Outcome
Prompt regressions that historically shipped to production dropped from ~30% of releases to under 3% inside the first month. The lighthouse customers stopped keeping informal regression logs because the team started catching the regressions before they shipped. Less visibly important but more strategically valuable: the team started experimenting with prompt changes more aggressively because the eval set caught them when they went wrong.
