Lesson 13: Knowledge
Ground your agent's responses in your own documents using Retrieval-Augmented Generation (RAG).
- RAG Pipeline: How documents become searchable context.
- Document Conversion: Turning PDFs, Word docs, and URLs into text.
- Chunking Strategies: Fixed-size vs semantic chunking.
- Vector Search: Finding relevant content by meaning.
- Knowledge Integration: Connecting it all to your agent.
Why RAG?
LLMs know what they were trained on. They don't know:
- Your company's internal docs
- That PDF you downloaded yesterday
- Content behind a login wall
- Anything after their training cutoff
RAG solves this by retrieving relevant documents and injecting them into the prompt. The LLM generates responses grounded in your data, not just its training.
The RAG Pipeline
Ingestion (once per document):
Document → Convert to text → Chunk → Embed → Store in Qdrant
Query (every user message):
User question → Embed → Search Qdrant → Retrieve chunks → Inject into prompt → LLM responds
Let's break down each step.
Document Conversion
Before chunking, you need plain text. But documents come in many formats: PDF, Word, PowerPoint, HTML, even YouTube videos.
Tools like MarkItDown and Docling handle this conversion:
| Tool | Strengths |
|---|---|
| MarkItDown | Microsoft's tool. Handles Office formats, PDFs, HTML, YouTube transcripts, images (via OCR), audio (via transcription). Outputs clean Markdown. |
| Docling | IBM's tool. Strong on complex PDFs with tables, figures, and multi-column layouts. Preserves document structure. |
Both solve the same problem: take messy document formats and produce clean, parseable text. For most use cases, MarkItDown is simpler. For complex technical PDFs, Docling often does better.
Chunking: Fixed-Size vs Semantic
Once you have text, you need to split it into chunks small enough to embed and retrieve efficiently. Two approaches:
Fixed-Size Chunking
Split text every N characters with some overlap:
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
def chunk_text(text, size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start = end - overlap
return chunks
Simple and predictable. But it might split mid-sentence or separate related content.
Semantic Chunking
Uses embeddings to find natural topic boundaries. Compares consecutive sentences—when similarity drops below a threshold, that's a good place to split:
SemanticChunking(
embedder=embedder,
chunk_size=500,
similarity_threshold=0.5,
)
Chunks stay coherent because splits happen where meaning changes.
Which to Use?
For short, well-structured documents, the difference is often negligible. Fixed-size is faster and simpler.
For long documents with distinct sections (manuals, reports, books), semantic chunking tends to produce better retrieval because related content stays together.
In practice: start with fixed-size. If retrieval quality disappoints, try semantic.
Ingestion Scripts
Fixed-Size Ingestion
"""
Knowledge Ingestion Script
Usage: uv run tools/ingest_knowledge.py <source>
Examples:
uv run tools/ingest_knowledge.py https://example.com/doc.pdf
uv run tools/ingest_knowledge.py /path/to/file.docx
uv run tools/ingest_knowledge.py "https://youtube.com/watch?v=xxx"
Supported: PDF, Word, PowerPoint, Excel, HTML, YouTube, images, audio
"""
import sys
from dotenv import load_dotenv
from markitdown import MarkItDown
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.knowledge import Knowledge
from agno.vectordb.qdrant import Qdrant
from agno.db.postgres import PostgresDb
load_dotenv()
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
def chunk_text(text, size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start = end - overlap
return chunks
def main():
source = sys.argv[1]
print(f"Ingesting: {source}")
# Parse document
md = MarkItDown()
result = md.convert(source)
text = result.text_content
print(f"Parsed: {len(text)} chars")
# Chunk
chunks = chunk_text(text)
print(f"Chunks: {len(chunks)}")
# Setup knowledge base
embedder = OpenAIEmbedder(id="text-embedding-3-small")
vector_db = Qdrant(collection="knowledge-demo", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
knowledge_table="knowledge_contents",
)
knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)
# Add chunks
for i, chunk in enumerate(chunks):
knowledge.add_content(
name=f"{source}:chunk:{i}",
text_content=chunk,
metadata={"source": source, "chunk_index": i},
)
print(f"Done. Added {len(chunks)} chunks to knowledge base.")
main()
How it works:
- MarkItDown converts the source (PDF, URL, YouTube, etc.) to text
- Text is split into 500-character chunks with 50-char overlap
- Each chunk is embedded and stored in Qdrant
- Metadata tracks source and position for debugging
Semantic Ingestion
"""
Semantic Knowledge Ingestion Script
Usage: uv run tools/ingest_knowledge_semantic.py <source>
Examples:
uv run tools/ingest_knowledge_semantic.py https://example.com/doc.pdf
uv run tools/ingest_knowledge_semantic.py /path/to/file.pdf
Uses semantic chunking (splits at topic boundaries) instead of fixed-size chunks.
"""
import sys
from dotenv import load_dotenv
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.chunking.semantic import SemanticChunking
from agno.knowledge.knowledge import Knowledge
from agno.knowledge.reader.pdf_reader import PDFReader
from agno.vectordb.qdrant import Qdrant
from agno.db.postgres import PostgresDb
load_dotenv()
def main():
source = sys.argv[1]
print(f"Ingesting: {source}")
embedder = OpenAIEmbedder(id="text-embedding-3-small")
vector_db = Qdrant(collection="knowledge-semantic", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
knowledge_table="knowledge_semantic",
)
knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)
reader = PDFReader(
chunking_strategy=SemanticChunking(
embedder=embedder,
chunk_size=500,
similarity_threshold=0.5,
)
)
# skip_if_exists=True is default
if source.startswith("http"):
knowledge.add_content(url=source, reader=reader)
else:
knowledge.add_content(path=source, reader=reader)
print("Done.")
main()
How it works:
- PDFReader handles document parsing
- SemanticChunking analyzes sentence embeddings to find topic boundaries
- Splits occur where consecutive sentences have similarity below 0.5
- Chunks respect
chunk_sizeas a soft maximum
Note: This script uses Agno's built-in PDFReader. For other formats with semantic chunking, you'd need to adapt the approach.
The Agent
Once documents are ingested, connect the knowledge base to your agent:
"""
Lesson 13: Knowledge (RAG)
Agent searches vector database for relevant context before responding. Documents
are chunked, embedded, and stored in Qdrant. On each query, similar chunks are
retrieved and injected into the prompt as context.
Run: uv run 13-knowledge.py
Try: "Ingredients for Massaman curry" | "How to make Tom Yum"
Observe in Phoenix (http://localhost:6006):
- Vector search span before LLM call
- Retrieved chunks in context
- Embedding calls for query
Ingest: uv run tools/ingest_knowledge.py <source>
Examples: ./recipe.pdf | https://example.com | "https://youtube.com/watch?v=xxx"
Reset: uv run tools/reset_data.py
"""
import os
from dotenv import load_dotenv
from phoenix.otel import register
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.postgres import PostgresDb
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.knowledge import Knowledge
from agno.vectordb.qdrant import Qdrant
load_dotenv()
register(project_name="13-knowledge", auto_instrument=True, batch=True, verbose=True)
db = PostgresDb(db_url="postgresql+psycopg://ai:ai@localhost:5532/ai")
embedder = OpenAIEmbedder(id="text-embedding-3-small")
vector_db = Qdrant(collection="knowledge-demo", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
knowledge_table="knowledge_contents",
)
knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)
agent = Agent(
name="Knowledge Assistant",
model=OpenAIChat(id=os.getenv("OPENAI_MODEL_ID")),
instructions="You are a helpful assistant. Answer questions using the knowledge base. Be concise.",
knowledge=knowledge,
search_knowledge=True,
db=db,
user_id="demo-user",
enable_user_memories=True,
add_history_to_context=True,
num_history_runs=5,
markdown=True,
)
agent.cli_app(stream=True)
What's New
Knowledge setup:
embedder = OpenAIEmbedder(id="text-embedding-3-small")
vector_db = Qdrant(collection="knowledge-demo", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(...)
knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)
embedder: Converts text to vectors (same model used for ingestion and queries)vector_db: Qdrant stores and searches vectorscontents_db: Postgres stores the actual text contentKnowledge: Coordinates both
Agent integration:
knowledge=knowledge,
search_knowledge=True,
knowledge: The knowledge base to searchsearch_knowledge: Automatically search before each response
Semantic Variant
Same agent, different collection:
"""
Lesson 13b: Knowledge with Semantic Chunking
Unlike 13a's fixed-size chunks, semantic chunking uses embeddings to find
natural topic boundaries. Splits occur where meaning changes significantly,
keeping related content together.
Run: uv run 13-knowledge-semantic.py
Try: "Ingredients for Massaman curry" | "How to make Tom Yum"
Observe in Phoenix (http://localhost:6006):
- Chunks aligned to topic boundaries
- Compare retrieval quality vs 13a
Ingest: uv run tools/ingest_knowledge_semantic.py <pdf-url-or-path>
Reset: uv run tools/reset_data.py
"""
import os
from dotenv import load_dotenv
from phoenix.otel import register
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.postgres import PostgresDb
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.knowledge import Knowledge
from agno.vectordb.qdrant import Qdrant
load_dotenv()
register(project_name="13-knowledge-semantic", auto_instrument=True, batch=True, verbose=True)
db = PostgresDb(db_url="postgresql+psycopg://ai:ai@localhost:5532/ai")
embedder = OpenAIEmbedder(id="text-embedding-3-small")
vector_db = Qdrant(collection="knowledge-semantic", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
knowledge_table="knowledge_semantic",
)
knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)
agent = Agent(
name="Knowledge Assistant",
model=OpenAIChat(id=os.getenv("OPENAI_MODEL_ID")),
instructions="You are a helpful assistant. Answer questions using the knowledge base. Be concise.",
knowledge=knowledge,
search_knowledge=True,
db=db,
user_id="demo-user",
enable_user_memories=True,
add_history_to_context=True,
num_history_runs=5,
markdown=True,
)
agent.cli_app(stream=True)
The only differences: collection="knowledge-semantic" and knowledge_table="knowledge_semantic". Different storage, same agent pattern.
Try It
First, ingest some content:
# A PDF from the web
uv run tools/ingest_knowledge.py https://example.com/cookbook.pdf
# A local file
uv run tools/ingest_knowledge.py ./recipes.docx
# A YouTube video (extracts transcript)
uv run tools/ingest_knowledge.py "https://youtube.com/watch?v=dQw4w9WgXcQ"
Then query:
uv run 13-knowledge.py
> What ingredients do I need for Massaman curry?
Based on the knowledge base, Massaman curry requires:
- Chicken or beef
- Massaman curry paste
- Coconut milk
- Potatoes
- Peanuts
- Fish sauce, palm sugar
...
> How long does it take to cook?
According to the recipe, total cook time is about 45 minutes...
The agent retrieves relevant chunks from your documents and uses them to answer.
Observe in Phoenix
Open http://localhost:6006 and look at traces for 13-knowledge.
You'll see new spans:
- Embedding call: Your question gets converted to a vector
- Vector search: Qdrant finds similar chunks
- LLM call: Retrieved chunks appear in the context
Look at the LLM input—you'll see your instructions plus the retrieved document chunks, then the user's question. The LLM answers based on that context.
How Retrieval Works
When you ask "What ingredients do I need?":
- Question is embedded using
text-embedding-3-small - Qdrant finds chunks with similar embeddings (cosine similarity)
- Top-k chunks (default: 5) are retrieved
- Chunks are injected into the system prompt as context
- LLM generates a response grounded in that context
The quality depends on:
- Chunk quality: Do chunks contain coherent, complete information?
- Embedding model: Does it capture semantic meaning well?
- Top-k setting: Too few misses relevant content, too many adds noise
Key Concepts
| Concept | This Lesson |
|---|---|
| RAG | Retrieve relevant docs, augment the prompt, generate |
| Chunking | Splitting documents into embeddable pieces |
| Embedding | Converting text to vectors for similarity search |
| Vector DB | Qdrant stores and searches embeddings |
| Retrieval | Finding relevant chunks by semantic similarity |
What's Next
Your agent now has memory, tools, and knowledge. In Lesson 14, we combine multiple specialized agents into a team—a leader that coordinates specialists to handle complex tasks.