RAG lets a language model answer questions using *your* documents — policies, docs, a knowledge base — without retraining the model. It's the most practical pattern for building AI that knows your business. Here's the whole pipeline.
The Four Stages
- Chunk your documents into passages (a few hundred tokens each).
- Embed each chunk into a vector and store it in a vector database.
- Retrieve the most relevant chunks for a user's question via similarity search.
- Generate an answer by giving those chunks to the LLM as context.
Indexing: Chunk and Embed
// Split docs, embed, and upsert into a vector store
const chunks = splitIntoChunks(document, { size: 500, overlap: 50 });
for (const chunk of chunks) {
const vector = await embed(chunk.text); // an embedding model
await vectorDB.upsert({ id: chunk.id, vector, text: chunk.text });
}Querying: Retrieve and Generate
const queryVector = await embed(userQuestion);
const top = await vectorDB.search(queryVector, { topK: 5 });
const context = top.map((c) => c.text).join("\n\n");
const answer = await llm.generate({
system: "Answer ONLY from the context. If it's not there, say you don't know.",
prompt: `Context:\n${context}\n\nQuestion: ${userQuestion}`,
});What Makes RAG Good or Bad
- Chunking strategy matters more than the model — overlap and sensible boundaries prevent lost context.
- Ground the model hard: instruct it to answer only from retrieved context to reduce hallucination.
- Add citations so users can verify — return the source chunk alongside each answer.
Start Simple
A basic RAG with good chunking beats a complex one with poor retrieval. Get the pipeline working end-to-end first, then add re-ranking and hybrid search.
