Skip to main content

Lesson 5: Chunking & LangChain

Topics Covered
  • Why large documents must be split into smaller chunks before embedding.
  • How LangChain helps load, split, and embed documents.
  • Storing both chunk text and embeddings directly in Qdrant for simplicity.
  • Assigning unique chunk IDs to avoid duplicates in the vector database.

In this lesson, we’ll take PDF documents, break them down into smaller chunks of text, turn those chunks into embeddings, and store them in Qdrant so we can search through them later.

Chunking

When we work with large documents, we can’t just feed the whole thing to an embedding model or LLM. Models have a context size limit. They can only "look at" a certain amount of text at once. If we try to process an entire book or report in one go, it won’t fit, and we’ll lose detail. Enter chunking.

Chunking means:

  • Splitting big text into smaller, manageable pieces (e.g., 800 characters each).
  • Adding some overlap between chunks (e.g., 80 characters) so that no sentence gets cut in a way that loses meaning.

Later, when a user searches, we can retrieve the most relevant chunks from our vector database, not entire files. This makes search faster and more accurate.

LangChain

LangChain is a library that helps connect different AI tools without writing all the glue code yourself. Think of it as a Swiss Army knife for AI applications.

It offers:

  • Document loaders – to read PDFs, HTML, CSV, etc.
  • Text splitters – to chunk text in smart ways.
  • Embedding tools – to turn text into vectors.
  • Integrations – with databases like Qdrant, Pinecone, Chroma, etc.

In this exercise, we will use LangChain’s:

  1. PyPDFDirectoryLoader – reads all PDFs from a folder.
  2. RecursiveCharacterTextSplitter – splits documents into chunks.
  3. OllamaEmbeddings – turns text into vectors (embeddings) using a local Ollama model.

In a "real" RAG setup, you normally don’t dump everything into one database. Instead, there’s usually a split between where you store vectors and where you store the actual chunk content.

A typical design pattern includes:

  • Vector database (Pinecone, Qdrant, Weaviate, Chroma, pgvector): holds your chunk embeddings and just enough metadata to find them again (IDs, maybe a source and page number).
  • Separate storage for the full text and rich metadata: could be a document store (MongoDB, DynamoDB), a relational DB (PostgreSQL, MySQL), object storage (S3, GCS), or even a search engine like Elasticsearch if you want hybrid vector + keyword search.

This separation is common because vector databases are built for similarity search, not for storing large blobs of text. It’s often cheaper to keep the big text somewhere else, especially if you’re paying for a hosted vector DB. It also gives you more flexibility when running queries or filtering by metadata, and lets you choose the best tool for each specific job.

But for this excercise we will be cutting some corners. We’re putting both the vectors and the chunk text straight into Qdrant. We're moving parts to a minimum so we can focus on the main idea: chunking text and storing it for semantic search.
Once you’re comfortable with that, you can move to a proper split architecture.


Excercise 5

Install dependeincies

uv add langchain langchain-community langchain-text-splitters langchain-ollama qdrant-client pypdf

Create embedings

#!/usr/bin/env python3
import argparse
import os
import uuid

from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain_ollama import OllamaEmbeddings

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "chunking"
DATA_PATH = "./data"
EMBED_MODEL = "bge-m3"

def load_documents():
document_loader = PyPDFDirectoryLoader(DATA_PATH)
return document_loader.load()


def split_documents(documents: list[Document]):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=80,
length_function=len,
is_separator_regex=False,
)
return text_splitter.split_documents(documents)


def add_to_qdrant(chunks: list[Document], client: QdrantClient, embeddings: OllamaEmbeddings):
# Ensure collection exists
ensure_collection(client)

# Calculate Page IDs.
chunks_with_ids = calculate_chunk_ids(chunks)

# Get existing document IDs from Qdrant
try:
collection_info = client.get_collection(COLLECTION_NAME)
if collection_info.points_count > 0:
# Get all existing points to check IDs
existing_points = client.scroll(
collection_name=COLLECTION_NAME,
limit=collection_info.points_count,
with_payload=True,
with_vectors=False
)[0]
existing_ids = {point.payload.get("chunk_id") for point in existing_points if point.payload}
else:
existing_ids = set()
except Exception:
existing_ids = set()

print(f"Number of existing documents in DB: {len(existing_ids)}")

# Only add documents that don't exist in the DB.
new_chunks = []
for chunk in chunks_with_ids:
if chunk.metadata["id"] not in existing_ids:
new_chunks.append(chunk)

if len(new_chunks):
print(f"👉 Adding new documents: {len(new_chunks)}")

# Generate embeddings for new chunks
texts = [chunk.page_content for chunk in new_chunks]
chunk_embeddings = embeddings.embed_documents(texts)

# Create points for Qdrant
points = []
for chunk, embedding in zip(new_chunks, chunk_embeddings):
point = PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload={
"chunk_id": chunk.metadata["id"],
"text": chunk.page_content,
"source": chunk.metadata.get("source", ""),
"page": chunk.metadata.get("page", 0),
"metadata": chunk.metadata
}
)
points.append(point)

# Upload to Qdrant
client.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"✅ Added {len(points)} new documents to Qdrant")
else:
print("✅ No new documents to add")


def calculate_chunk_ids(chunks):

# This will create IDs like "data/fileame.pdf:6:2"
# Page Source : Page Number : Chunk Index

last_page_id = None
current_chunk_index = 0

for chunk in chunks:
source = chunk.metadata.get("source")
page = chunk.metadata.get("page")
current_page_id = f"{source}:{page}"

# If the page ID is the same as the last one, increment the index.
if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0

# Calculate the chunk ID.
chunk_id = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id

# Add it to the page meta-data.
chunk.metadata["id"] = chunk_id

return chunks


def clear_database(client: QdrantClient):
try:
client.delete_collection(collection_name=COLLECTION_NAME)
print(f"✅ Deleted collection: {COLLECTION_NAME}")
except Exception as e:
print(f"Collection {COLLECTION_NAME} doesn't exist or couldn't be deleted: {e}")


def ensure_collection(client: QdrantClient):
"""Ensure the Qdrant collection exists with proper configuration."""
try:
# Check if collection exists
client.get_collection(COLLECTION_NAME)
print(f"✅ Collection {COLLECTION_NAME} already exists")
except Exception:
# Collection doesn't exist, create it
print(f"Creating collection: {COLLECTION_NAME}")

# Get vector size by probing with a sample embedding
embeddings = OllamaEmbeddings(model=EMBED_MODEL)
sample_embedding = embeddings.embed_query("sample text")
vector_size = len(sample_embedding)

client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE),
)
print(f"✅ Created collection {COLLECTION_NAME} with vector size {vector_size}")

def main():

# Check if the database should be cleared (using the --reset flag).
parser = argparse.ArgumentParser()
parser.add_argument("--reset", action="store_true", help="Reset the database.")
args = parser.parse_args()

# Initialize Qdrant client and embeddings
client = QdrantClient(url=QDRANT_URL)
embeddings = OllamaEmbeddings(model=EMBED_MODEL)

if args.reset:
print("✨ Clearing Database")
clear_database(client)

# Create (or update) the data store.
documents = load_documents()
chunks = split_documents(documents)
add_to_qdrant(chunks, client, embeddings)

if __name__ == "__main__":
main()