Lesson 2: Introducing Ollama
- What Ollama is and why you might want to use it.
- Choosing an embedding model in Ollama (e.g.,
bge-m3,nomic-embed-text). - Trade-offs in latency, memory usage, and model size.
- How to replace SentenceTransformers with an Ollama embedding call.
In the previous lesson, you learned what vectors and embeddings are, why vector size matters, and how to create embeddings using SentenceTransformers.
You also saw how to compare two vectors (understanding what it means for them to be close or far apart in meaning) and how to measure that similarity using cosine similarity, dot product, and Euclidean distance.
Now we’ll take this a step further by generating embeddings with Ollama, which runs the model as a separate service on your computer rather than inside your Python process.
What is Ollama?
Ollama is a tool for running AI models locally on your own computer. No internet connection or cloud service is required (apart from the initial download of a model).
If you’ve used Docker to run software in containers or Git to pull code from a repository, Ollama feels a bit similar:
- Like Docker, it pulls models (instead of containers) from a remote registry and runs them locally.
- Like Git, it manages versions so you can switch between models easily.
Ollama can run both LLMs (large language models) for chat and embedding models for turning text (but also videos, images, audio data, etc.) into vectors.
Ollama manages:
- Downloading and storing models.
- Running them efficiently on your CPU or GPU.
- Providing a simple local API for you to use in Python, JavaScript, or the command line.
This means you can, work offline, avoid sending your data to an external server and have predictable latency because the model runs entirely on your machine.
Choosing an Embedding Model
When picking a model, look for embedding support in the model description. The two popular choices are:
bge-m3: A multilingual embedding model that works well for many languages.nomic-embed-text: A compact, fast English-focused embedding model.
Selecting a model will depend on:
- Language: Do you need multi-language support or only English?
- Speed vs accuracy: Smaller models are faster but may capture less nuance.
- Memory: Large models need more RAM/VRAM.
Not every AI model is built to generate embeddings.
- Text generation models (like those used in chatbots) are trained to predict the next word in a sequence, producing human-readable text.
- Embedding models are trained to turn text into numerical vectors that capture meaning, so those vectors can be compared, clustered, or searched.
Because these tasks require different training objectives, a model that’s great at writing text won’t necessarily produce high-quality embeddings, and many simply don’t support it.
When using Ollama, always check the model’s description to confirm it supports embeddings before pulling it.
Performance Considerations
Running models locally means:
- Latency depends on your hardware; a GPU can be much faster.
- Memory usage grows with model size; big models might not fit in smaller machines.
- Cold start time happens when loading a model for the first time.
If you call the model repeatedly, keep it running rather than loading it each time.
Exercise 2
Now that we’ve got the theory sorted, let’s put Ollama to work. Think of it like swapping out your kitchen blender. The recipe (our Python code) stays mostly the same, but we’re changing the machine that does the mixing.
Instead of SentenceTransformers doing the heavy lifting inside Python, we’ll hand the job to Ollama - running outside Python as its own service -
and let it churn out embeddings for us.
1. Install the ollama Python client
To interact with Ollama from Python, we’ll need its client library. Install it (from within your project folder) with:
uv add ollama
2. Pull an embedding model
If you haven’t already installed Ollama on your computer, do it now.
Here Ollama is not a Python library (handled in the previous step), but a standalone program that runs as a background service on your machine.
Your Python code will talk to it locally over an API.
Download and install Ollama for your operating system here:
https://ollama.com/download
Once it’s installed and running, you can pull an embedding model from your terminal (not Python):
ollama pull nomic-embed-text
3. Calculate embeddings with Ollama
First, let’s look at which parts of our code need to change when we swap out SentenceTransformer for Ollama.
import ollama
import numpy as np
EMBED_MODEL = "nomic-embed-text"
batch_texts = [
"A cat sits on the sofa.",
"The dog sleeps on the rug.",
"I like pizza with cheese.",
"Airplanes fly in the sky."
]
embeddings = ollama.embed(model=EMBED_MODEL, input=batch_texts)["embeddings"]
print(np.array(embeddings).shape)
4. Run the script
uv run main.py
You should see the number of embeddings and their size, similar to our SentenceTransformers example, but now fully local.
5. Compare sentences using different similarity measures
Now, just as in the previous section on Vectors, we’ll compare:
- Two related sentences ("cat" vs "dog")
- Two unrelated sentences ("cat" vs "airplane")
import ollama
import numpy as np
from scipy.spatial.distance import cosine, euclidean
EMBED_MODEL = "nomic-embed-text"
# Input texts
texts = [
"A cat sits on the sofa.",
"The dog sleeps on the rug.",
"I like pizza with cheese.",
"Airplanes fly in the sky."
]
# Generate embeddings (batch)
embeddings = ollama.embed(model=EMBED_MODEL, input=texts)["embeddings"]
E = np.array(embeddings, dtype=np.float32)
print(E.shape) # (4, 768) → 4 vectors, 768 numbers each
# Helper for dot product
def dot_product(v1, v2):
return float(np.dot(v1, v2))
# Select vectors
cat, dog, plane = E[0], E[1], E[3]
# Compare Cat vs Dog
print("\n--- Cat vs Dog ---")
print("Cosine Similarity:", 1 - cosine(cat, dog))
print("Dot Product:", dot_product(cat, dog))
print("Euclidean Distance:", euclidean(cat, dog))
# Compare Cat vs Airplane
print("\n--- Cat vs Airplane ---")
print("Cosine Similarity:", 1 - cosine(cat, plane))
print("Dot Product:", dot_product(cat, plane))
print("Euclidean Distance:", euclidean(cat, plane))
In this Ollama example, the model sees "cat" and "dog" as fairly related. Their cosine similarity is about 0.55, which means they share a good amount of semantic direction in the vector space. The dot product matches closely because these embeddings are normalized. Their Euclidean distance (0.94) is relatively small, showing they are close in meaning. In contrast, "cat" and "airplane" have a cosine similarity of only 0.37 and a larger Euclidean distance (1.12), indicating weaker semantic overlap and greater separation in the embedding space.
These numbers differ from the ones we got with SentenceTransformer because the models themselves are different.
all-MiniLM-L6-v2 (SentenceTransformers) and nomic-embed-text (Ollama) are trained on different datasets, have different architectures, and produce embeddings of different sizes (384 vs 768 dimensions).
As a result, their internal "map" of meaning is not identical, so the similarity scores aren’t directly comparable across models - only within results from the same model.
Summary
You’ve now learned how to install Ollama, pull a local embedding model, and run it as a separate service on your computer.
Unlike SentenceTransformers, where the model runs entirely inside your Python process, Ollama keeps the model loaded in its own background service and exposes a local API.
This approach can be useful because:
- You can reuse the same model across multiple scripts without reloading it each time.
- You can switch between models easily, just like with Docker images.
- Your Python process stays lighter, since the heavy lifting happens in the Ollama service.
We also saw that similarity scores differ between Ollama’s nomic-embed-text and SentenceTransformers’ all-MiniLM-L6-v2 due to differences in architecture, training data, and embedding size.
Scores are only comparable within the same model - not across models.
Next: We’ll explore Vector Database Basics with Qdrant, where we’ll store and search our embeddings efficiently.