06.D

Interactive Retrieval-Augmented Generation (RAG)

RAG retrieves the most relevant snippets from a knowledge base and feeds them as grounded context to an LLM — so the answer is anchored in real source material, not the model's parametric memory. Pick a question and watch the full pipeline run: embed → k-NN search → top-k context → prompt assembly → LLM stream.

corpus · 12 chunks

$rag.pipeline --interactive --corpus=jun.cao

>ready · select a question to run

// pipeline

show full prompt

step 1 / 6

01 · question

no question selected — click one above

02 · embedder svc (container)

POST embedder-svc:8080/embed

{ "text": "..." }

↓ n-dim query vector emitted

03 · k-NN search · vector index

awaiting query vector

04 · top-k context

awaiting search results

05 · prompt assembly

awaiting context

06 · llm answer · streaming

awaiting prompt

// what just happened

Your question was sent to a Dockerized embedder microservice that calls an embedding model and returns a query vector. We do k-nearest-neighbour search in a vector index over the corpus and pull the top-3 most relevant chunks. Those become the context for a structured prompt — system rules, retrieved snippets, then the user's question — which is sent to an LLM. The LLM streams back an answer that's grounded in the retrieved snippets, not just the model's parametric memory. That's RAG in one round-trip — the same shape we use in the production legal AI system at Visa Robot.