06.D

Interactive Retrieval-Augmented Generation (RAG)

RAG retrieves the most relevant snippets from a knowledge base and feeds them as grounded context to an LLM — so the answer is anchored in real source material, not the model's parametric memory. Pick a question and watch the full pipeline run: embed → k-NN search → top-k context → prompt assembly → LLM stream.
$rag.pipeline --interactive --corpus=jun.cao
>ready · select a question to run
// pipeline
step 1 / 6
01 · question
no question selected — click one above
02 · embedder svc (container)
POST embedder-svc:8080/embed
{ "text": "..." }
↓ n-dim query vector emitted
03 · k-NN search · vector index
awaiting query vector
04 · top-k context
awaiting search results
05 · prompt assembly
awaiting context
06 · llm answer · streaming
awaiting prompt
// what just happened

Your question was sent to a Dockerized embedder microservice that calls an embedding model and returns a query vector. We do k-nearest-neighbour search in a vector index over the corpus and pull the top-3 most relevant chunks. Those become the context for a structured prompt — system rules, retrieved snippets, then the user's question — which is sent to an LLM. The LLM streams back an answer that's grounded in the retrieved snippets, not just the model's parametric memory. That's RAG in one round-trip — the same shape we use in the production legal AI system at Visa Robot.

rag.pipeline --interactive