RAG or 1M context window — pick one, not both
A year ago, RAG was the default answer for any "ground the LLM in our docs" problem. Today, frontier models routinely accept 1M+ tokens of context. The honest engineering question is no longer how do I build retrieval, it's do I need retrieval at all.
The cost-per-query math has flipped for small corpora. If your knowledge base is under ~500k tokens and doesn't change every hour, you're often better off shoveling it into the prompt than running a vector store, embedding service, and a re-ranker. Less infrastructure to operate, fewer retrieval-quality failure modes, and the model's attention picks the relevant bits anyway.
RAG still wins when the corpus is genuinely large (millions to billions of tokens), when documents update faster than you can re-prompt, or when latency is tight enough that long-context inference doesn't fit your budget. It also wins when you need source attribution — "this came from doc X, page 4" — which long-context prompts struggle to surface reliably.
The mistake we see most often is teams reaching for RAG because that's what the tutorial taught, then spending three sprints maintaining a vector store for a 50-document FAQ. Long-context first, RAG when long-context breaks.
