Skip to main content

Lesson 13: Knowledge

Ground your agent's responses in your own documents using Retrieval-Augmented Generation (RAG).

Topics Covered
  • RAG Pipeline: How documents become searchable context.
  • Document Conversion: Turning PDFs, Word docs, and URLs into text.
  • Chunking Strategies: Fixed-size vs semantic chunking.
  • Vector Search: Finding relevant content by meaning.
  • Knowledge Integration: Connecting it all to your agent.

Why RAG?

LLMs know what they were trained on. They don't know:

  • Your company's internal docs
  • That PDF you downloaded yesterday
  • Content behind a login wall
  • Anything after their training cutoff

RAG solves this by retrieving relevant documents and injecting them into the prompt. The LLM generates responses grounded in your data, not just its training.

The RAG Pipeline

Ingestion (once per document):
Document → Convert to text → Chunk → Embed → Store in Qdrant

Query (every user message):
User question → Embed → Search Qdrant → Retrieve chunks → Inject into prompt → LLM responds

Let's break down each step.

Document Conversion

Before chunking, you need plain text. But documents come in many formats: PDF, Word, PowerPoint, HTML, even YouTube videos.

Tools like MarkItDown and Docling handle this conversion:

ToolStrengths
MarkItDownMicrosoft's tool. Handles Office formats, PDFs, HTML, YouTube transcripts, images (via OCR), audio (via transcription). Outputs clean Markdown.
DoclingIBM's tool. Strong on complex PDFs with tables, figures, and multi-column layouts. Preserves document structure.

Both solve the same problem: take messy document formats and produce clean, parseable text. For most use cases, MarkItDown is simpler. For complex technical PDFs, Docling often does better.

Chunking: Fixed-Size vs Semantic

Once you have text, you need to split it into chunks small enough to embed and retrieve efficiently. Two approaches:

Fixed-Size Chunking

Split text every N characters with some overlap:

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

def chunk_text(text, size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start = end - overlap
return chunks

Simple and predictable. But it might split mid-sentence or separate related content.

Semantic Chunking

Uses embeddings to find natural topic boundaries. Compares consecutive sentences—when similarity drops below a threshold, that's a good place to split:

SemanticChunking(
embedder=embedder,
chunk_size=500,
similarity_threshold=0.5,
)

Chunks stay coherent because splits happen where meaning changes.

Which to Use?

For short, well-structured documents, the difference is often negligible. Fixed-size is faster and simpler.

For long documents with distinct sections (manuals, reports, books), semantic chunking tends to produce better retrieval because related content stays together.

In practice: start with fixed-size. If retrieval quality disappoints, try semantic.

Ingestion Scripts

Fixed-Size Ingestion

tools/ingest_knowledge.py
"""
Knowledge Ingestion Script

Usage: uv run tools/ingest_knowledge.py <source>

Examples:
uv run tools/ingest_knowledge.py https://example.com/doc.pdf
uv run tools/ingest_knowledge.py /path/to/file.docx
uv run tools/ingest_knowledge.py "https://youtube.com/watch?v=xxx"

Supported: PDF, Word, PowerPoint, Excel, HTML, YouTube, images, audio
"""

import sys
from dotenv import load_dotenv
from markitdown import MarkItDown
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.knowledge import Knowledge
from agno.vectordb.qdrant import Qdrant
from agno.db.postgres import PostgresDb

load_dotenv()

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50


def chunk_text(text, size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start = end - overlap
return chunks


def main():
source = sys.argv[1]
print(f"Ingesting: {source}")

# Parse document
md = MarkItDown()
result = md.convert(source)
text = result.text_content
print(f"Parsed: {len(text)} chars")

# Chunk
chunks = chunk_text(text)
print(f"Chunks: {len(chunks)}")

# Setup knowledge base
embedder = OpenAIEmbedder(id="text-embedding-3-small")
vector_db = Qdrant(collection="knowledge-demo", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
knowledge_table="knowledge_contents",
)
knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)

# Add chunks
for i, chunk in enumerate(chunks):
knowledge.add_content(
name=f"{source}:chunk:{i}",
text_content=chunk,
metadata={"source": source, "chunk_index": i},
)

print(f"Done. Added {len(chunks)} chunks to knowledge base.")


main()

How it works:

  1. MarkItDown converts the source (PDF, URL, YouTube, etc.) to text
  2. Text is split into 500-character chunks with 50-char overlap
  3. Each chunk is embedded and stored in Qdrant
  4. Metadata tracks source and position for debugging

Semantic Ingestion

tools/ingest_knowledge_semantic.py
"""
Semantic Knowledge Ingestion Script

Usage: uv run tools/ingest_knowledge_semantic.py <source>

Examples:
uv run tools/ingest_knowledge_semantic.py https://example.com/doc.pdf
uv run tools/ingest_knowledge_semantic.py /path/to/file.pdf

Uses semantic chunking (splits at topic boundaries) instead of fixed-size chunks.
"""

import sys
from dotenv import load_dotenv
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.chunking.semantic import SemanticChunking
from agno.knowledge.knowledge import Knowledge
from agno.knowledge.reader.pdf_reader import PDFReader
from agno.vectordb.qdrant import Qdrant
from agno.db.postgres import PostgresDb

load_dotenv()


def main():
source = sys.argv[1]
print(f"Ingesting: {source}")

embedder = OpenAIEmbedder(id="text-embedding-3-small")
vector_db = Qdrant(collection="knowledge-semantic", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
knowledge_table="knowledge_semantic",
)
knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)

reader = PDFReader(
chunking_strategy=SemanticChunking(
embedder=embedder,
chunk_size=500,
similarity_threshold=0.5,
)
)

# skip_if_exists=True is default
if source.startswith("http"):
knowledge.add_content(url=source, reader=reader)
else:
knowledge.add_content(path=source, reader=reader)

print("Done.")


main()

How it works:

  1. PDFReader handles document parsing
  2. SemanticChunking analyzes sentence embeddings to find topic boundaries
  3. Splits occur where consecutive sentences have similarity below 0.5
  4. Chunks respect chunk_size as a soft maximum

Note: This script uses Agno's built-in PDFReader. For other formats with semantic chunking, you'd need to adapt the approach.

The Agent

Once documents are ingested, connect the knowledge base to your agent:

13-knowledge.py
"""
Lesson 13: Knowledge (RAG)

Agent searches vector database for relevant context before responding. Documents
are chunked, embedded, and stored in Qdrant. On each query, similar chunks are
retrieved and injected into the prompt as context.

Run: uv run 13-knowledge.py
Try: "Ingredients for Massaman curry" | "How to make Tom Yum"

Observe in Phoenix (http://localhost:6006):
- Vector search span before LLM call
- Retrieved chunks in context
- Embedding calls for query

Ingest: uv run tools/ingest_knowledge.py <source>
Examples: ./recipe.pdf | https://example.com | "https://youtube.com/watch?v=xxx"
Reset: uv run tools/reset_data.py
"""

import os
from dotenv import load_dotenv
from phoenix.otel import register
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.postgres import PostgresDb
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.knowledge import Knowledge
from agno.vectordb.qdrant import Qdrant

load_dotenv()

register(project_name="13-knowledge", auto_instrument=True, batch=True, verbose=True)

db = PostgresDb(db_url="postgresql+psycopg://ai:ai@localhost:5532/ai")
embedder = OpenAIEmbedder(id="text-embedding-3-small")

vector_db = Qdrant(collection="knowledge-demo", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
knowledge_table="knowledge_contents",
)
knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)

agent = Agent(
name="Knowledge Assistant",
model=OpenAIChat(id=os.getenv("OPENAI_MODEL_ID")),
instructions="You are a helpful assistant. Answer questions using the knowledge base. Be concise.",
knowledge=knowledge,
search_knowledge=True,
db=db,
user_id="demo-user",
enable_user_memories=True,
add_history_to_context=True,
num_history_runs=5,
markdown=True,
)

agent.cli_app(stream=True)

What's New

Knowledge setup:

embedder = OpenAIEmbedder(id="text-embedding-3-small")
vector_db = Qdrant(collection="knowledge-demo", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(...)
knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)
  • embedder: Converts text to vectors (same model used for ingestion and queries)
  • vector_db: Qdrant stores and searches vectors
  • contents_db: Postgres stores the actual text content
  • Knowledge: Coordinates both

Agent integration:

knowledge=knowledge,
search_knowledge=True,
  • knowledge: The knowledge base to search
  • search_knowledge: Automatically search before each response

Semantic Variant

Same agent, different collection:

13-knowledge-semantic.py
"""
Lesson 13b: Knowledge with Semantic Chunking

Unlike 13a's fixed-size chunks, semantic chunking uses embeddings to find
natural topic boundaries. Splits occur where meaning changes significantly,
keeping related content together.

Run: uv run 13-knowledge-semantic.py
Try: "Ingredients for Massaman curry" | "How to make Tom Yum"

Observe in Phoenix (http://localhost:6006):
- Chunks aligned to topic boundaries
- Compare retrieval quality vs 13a

Ingest: uv run tools/ingest_knowledge_semantic.py <pdf-url-or-path>
Reset: uv run tools/reset_data.py
"""

import os
from dotenv import load_dotenv
from phoenix.otel import register
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.postgres import PostgresDb
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.knowledge import Knowledge
from agno.vectordb.qdrant import Qdrant

load_dotenv()

register(project_name="13-knowledge-semantic", auto_instrument=True, batch=True, verbose=True)

db = PostgresDb(db_url="postgresql+psycopg://ai:ai@localhost:5532/ai")
embedder = OpenAIEmbedder(id="text-embedding-3-small")

vector_db = Qdrant(collection="knowledge-semantic", url="http://localhost:6333", embedder=embedder)
contents_db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
knowledge_table="knowledge_semantic",
)

knowledge = Knowledge(vector_db=vector_db, contents_db=contents_db)

agent = Agent(
name="Knowledge Assistant",
model=OpenAIChat(id=os.getenv("OPENAI_MODEL_ID")),
instructions="You are a helpful assistant. Answer questions using the knowledge base. Be concise.",
knowledge=knowledge,
search_knowledge=True,
db=db,
user_id="demo-user",
enable_user_memories=True,
add_history_to_context=True,
num_history_runs=5,
markdown=True,
)

agent.cli_app(stream=True)

The only differences: collection="knowledge-semantic" and knowledge_table="knowledge_semantic". Different storage, same agent pattern.

Try It

First, ingest some content:

# A PDF from the web
uv run tools/ingest_knowledge.py https://example.com/cookbook.pdf

# A local file
uv run tools/ingest_knowledge.py ./recipes.docx

# A YouTube video (extracts transcript)
uv run tools/ingest_knowledge.py "https://youtube.com/watch?v=dQw4w9WgXcQ"

Then query:

uv run 13-knowledge.py
> What ingredients do I need for Massaman curry?
Based on the knowledge base, Massaman curry requires:
- Chicken or beef
- Massaman curry paste
- Coconut milk
- Potatoes
- Peanuts
- Fish sauce, palm sugar
...

> How long does it take to cook?
According to the recipe, total cook time is about 45 minutes...

The agent retrieves relevant chunks from your documents and uses them to answer.

Observe in Phoenix

Open http://localhost:6006 and look at traces for 13-knowledge.

You'll see new spans:

  1. Embedding call: Your question gets converted to a vector
  2. Vector search: Qdrant finds similar chunks
  3. LLM call: Retrieved chunks appear in the context

Look at the LLM input—you'll see your instructions plus the retrieved document chunks, then the user's question. The LLM answers based on that context.

How Retrieval Works

When you ask "What ingredients do I need?":

  1. Question is embedded using text-embedding-3-small
  2. Qdrant finds chunks with similar embeddings (cosine similarity)
  3. Top-k chunks (default: 5) are retrieved
  4. Chunks are injected into the system prompt as context
  5. LLM generates a response grounded in that context

The quality depends on:

  • Chunk quality: Do chunks contain coherent, complete information?
  • Embedding model: Does it capture semantic meaning well?
  • Top-k setting: Too few misses relevant content, too many adds noise

Key Concepts

ConceptThis Lesson
RAGRetrieve relevant docs, augment the prompt, generate
ChunkingSplitting documents into embeddable pieces
EmbeddingConverting text to vectors for similarity search
Vector DBQdrant stores and searches embeddings
RetrievalFinding relevant chunks by semantic similarity

What's Next

Your agent now has memory, tools, and knowledge. In Lesson 14, we combine multiple specialized agents into a team—a leader that coordinates specialists to handle complex tasks.