Lesson 6: Building a RAG
In the previous lessons, we’ve covered the key building blocks: generating embeddings, running models locally with Ollama, storing data in a vector database, combining traditional storage with vector search, and splitting documents into chunks for better retrieval.
Now it’s time to put it all together and build something practical: a Retrieval-Augmented Generation (RAG) system that uses everything you’ve learned so far to answer questions based on your own data.
What is RAG?
Retrieval-Augmented Generation (or RAG for short) is a way to make a Large Language Model (LLM) answer questions about information it hasn’t been trained on directly.
An LLM can only reply based on:
- What it learned during training (its built-in knowledge, which can be outdated or incomplete).
- What you tell it right now (extra information you add to the prompt).
If you ask it something outside of those two sources, it will try to guess and that’s when you risk hallucinations (when an LLM makes things up). RAG fixes this by adding a retrieval step before we talk to the model:
-
User asks a question in natural language
Example: “What is the tank capacity?” -
Convert the question into an embedding
This is just a list of numbers that captures the meaning of the question. -
Search for similar embeddings in a vector database
The database (Qdrant in our case) finds chunks of text most closely related to the question. -
Build a new prompt with the retrieved text
We give the LLM both the user’s question and the found chunks — basically whispering the answer into its ear. -
LLM replies to the user
From the user’s perspective, it feels like the model “knows” the answer. In reality, it’s simply acting as a friendly interface, rephrasing the information we just gave it.
Implementation-wise, you create a prompt template that includes placeholders for the data retrieved from storage, like in an example below:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:
{context}
---
Answer the question based on the above context: {question}
"""
You then populate the {context} placeholder with data retrieved from storage and feed the LLM with an augmented prompt; in practice, you’re asking the model a question and at the same time supplying the answer you want it to use when responding.
Excercise 6
1. Make sure to download the LLM model
First, we need to make sure that the LLM and embedding models have been downloaded, and that the Ollama service is up and running.
ollama pull mistral
ollama pull bge-m3
ollama serve
2. Create the Query file
#!/usr/bin/env python3
import argparse
from typing import List, Tuple
from qdrant_client import QdrantClient
from qdrant_client.models import Filter # kept for future filters if needed
from langchain.prompts import ChatPromptTemplate
from langchain_ollama import OllamaEmbeddings
from langchain_ollama import OllamaLLM
QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "chunking"
EMBED_MODEL = "bge-m3"
LLM_MODEL = "mistral"
TOP_K = 5
PROMPT_TEMPLATE = """
Answer the question based only on the following context:
{context}
---
Answer the question based on the above context: {question}
"""
def main():
parser = argparse.ArgumentParser()
parser.add_argument("query_text", type=str, help="The query text.")
args = parser.parse_args()
query_rag(args.query_text)
def query_rag(query_text: str):
client = QdrantClient(url=QDRANT_URL)
embeddings = OllamaEmbeddings(model=EMBED_MODEL)
llm = OllamaLLM(model=LLM_MODEL)
query_vec = embeddings.embed_query(query_text)
hits = client.query_points(
collection_name=COLLECTION_NAME,
query=query_vec,
limit=TOP_K,
with_payload=True,
with_vectors=False,
).points
# --- Build context + sources ---
context_chunks: List[str] = []
sources: List[str] = []
for h in hits:
payload = h.payload or {}
text = payload.get("text", "")
context_chunks.append(text)
cid = payload.get("chunk_id")
sources.append(cid)
if not context_chunks:
print("No results found in Qdrant.")
return ""
context_text = "\n\n---\n\n".join(context_chunks)
prompt = ChatPromptTemplate.from_template(PROMPT_TEMPLATE).format(
context=context_text, question=query_text
)
response_text = llm.invoke(prompt)
print(f"Response: {response_text}\nSources: {sources}")
return response_text
if __name__ == "__main__":
main()