RAG vs. fine-tuning: a working heuristic
Every other engagement we start has a team that thinks they need fine-tuning. Almost none of them do. Here's the rule we use:
Use RAG when the answer lives in your data. Customer support tickets, product docs, internal wikis, regulatory PDFs — anything where the right answer is "look it up." Retrieval is the cheaper, faster, more debuggable path. You can add new data without retraining anything. When the model is wrong, you can see which chunks it grounded on.
Use fine-tuning when the answer lives in your style. Output format, tone, structured JSON schemas the base model gets subtly wrong, domain-specific reasoning patterns the model wasn't trained on. Style is the thing prompts can demonstrate but not consistently enforce — fine-tuning bakes the pattern in.
Use both, occasionally. A fine-tuned model that retrieves over fresh data is the right answer for a few specific cases — usually high-volume, latency-sensitive, structured-output systems. Most teams aren't there yet and shouldn't pretend they are.
The failure mode we see most often: a team spends two months building a fine-tuning pipeline because they thought their problem was "the model doesn't know our docs." It almost always isn't. Their problem was retrieval. Cheaper test: stand up a basic RAG pipeline in an afternoon, see if the answers improve. If they do, you have your answer.
