Lesson 1: Embeddings
- What embeddings are and how they represent meaning as vectors.
- Why vector size (dimensionality) matters for accuracy, memory, and speed.
- Differences between cosine similarity, dot product, and Euclidean distance.
- Using
SentenceTransformersto generate and compare embeddings in Python.
Vectors
Humans understand the world through patterns our brains have learned over time. We don’t store every detail of an object or word. Instead, we remember its essence: the way it looks, sounds, feels, or fits into a situation. When we hear "dog", we instantly connect it to images, sounds, and feelings without needing to recall every fact about dogs.
AI models also work with patterns, but in a different way. They don’t "see" words, images, or sounds the way we do. Instead, they turn them into vectors - lists of numbers that capture meaning in a mathematical environment.
You can think of a vector as a coordinate in a special space, where each axis is not "up" or "left" but a hidden property of meaning. If two pieces of text are close in this space, they are probably related in meaning.
Example:
- "cat" and "dog" → vectors close together.
- "cat" and "airplane" → vectors far apart.
Embedding
An embedding is simply the vector form of a piece of data. It’s created by a model that has been trained to position similar things close together and different things far apart in a vector space.
We can create embeddings from:
- Text
- Images
- Audio
For text, the process is:
- Take your text (e.g., "I like pizza").
- Pass it into an embedding model.
- Get back a vector, for example, 384 numbers between roughly -1 and 1.
Vector Sizes
The size of a vector means how many numbers it contains. It is also called its dimensionality.
In everyday life, we understand 2 dimensions as width and height, or 3 dimensions as width, height, and depth.
- A 2D vector might look like this:
[3.5, 1.2]. This can be drawn as a point on a flat surface. - A 3D vector might look like:
[3.5, 1.2, 4.7]. This can be placed somewhere inside a cube.
In embeddings, the number of dimensions is much larger: for example 384, 768, or over 1000 numbers.
Each "dimension" is not something we can name like width or height. Instead, it’s a learned feature that the model found useful for placing similar data close together.
These features are usually not human-interpretable. You can’t point to "dimension 37" and say this measures how many animals are mentioned.
The model just finds abstract patterns in the training data and uses hundreds of such patterns to position vectors in a multi-dimensional space.
- In 2D or 3D, you can only separate points in very simple ways. Many unrelated things would still end up close to each other.
- More dimensions = more "space" for the model to place unrelated items far apart and keep related items close.
- Think of each extra dimension as giving the model another way to tell two pieces of data apart, but at a level that’s usually too abstract for us to name.
Larger vectors can capture more detail, but:
- They take more memory to store.
- They are slower to search in a vector database (more on that later).
- All vectors in a collection must have the same size, because the database needs consistent math to compare them.
You generally don’t choose the size. It’s fixed by the embedding model you pick.
Measuring Similarity Between Vectors
Humans constantly compare things without thinking about it. We decide if two faces look alike, if two songs have a similar vibe, or if two dishes taste almost the same. We don’t compare every detail; instead, we focus on the features that matter most for the situation: shape, tone, flavor, or context.
AI models do something similar with vectors. They use mathematical measures to judge whether two vectors are close together or far apart; in other words, how similar or different they are in meaning.
Think of each vector as a point in space:
- If two points are close together, they probably represent similar content.
- If they are far apart, they probably mean very different things.
In vector space, "closeness" doesn’t always mean physical distance. It can be defined in different ways, depending on what aspects of the vectors we care about.
- Direction (do they point the same way?)
- Length (are they both long or short?)
- Actual straight-line distance between them.
These give us three common measures:
- Cosine similarity: looks only at the angle between vectors.
- Dot product: looks at both the angle and the length.
- Euclidean distance: looks at the straight-line distance.
Cosine Similarity
Imagine two arrows starting from the same point. Cosine similarity asks: are these arrows pointing in the same direction?
It doesn’t care how long the arrows are; just the angle between them.
If the arrows point in the same direction, the score is close to 1; their meanings are very similar.
If they are at a right angle (90°), the score is close to 0; they are unrelated.
If they point in opposite directions, the score is close to -1; their meanings are opposite.
Cosine Similarity is useful when we care about meaning and want to ignore size differences in the vectors.
Dot Product
The dot product is like asking: if I shine a light straight along one arrow, how much of the other arrow’s shadow lines up?
It measures how much two vectors overlap when projected onto each other.
It takes into account both the direction and the length of the arrows.
If vectors are long and point in the same direction, the score will be large.
If they’re short or point in very different directions, the score will be small or even negative.
The value can be positive, zero, or negative, and there is no fixed maximum unless the vectors are normalized.
If your vectors are normalized (same length), the dot product and cosine similarity will give the same ordering of results.
Euclidean Distance
Euclidean distance is the classic "ruler measurement". It asks: if I drew a straight line between these two points, how long would it be?
It is calculated as the square root of the sum of squared differences for each dimension.
Smaller values mean the points are closer together; larger values mean they are farther apart.
This measure is useful when the absolute position in the vector space matters, not just the direction; in other words, when you think of similarity as physical closeness.
-
Cosine similarity → compares the angle between vectors, ignoring length.
Use when you care about meaning and want to ignore differences in vector size. -
Dot product → compares both angle and length; same as cosine if vectors are normalized.
Use when vector length carries meaning or when working with normalized embeddings for speed. -
Euclidean distance → measures the straight-line distance between points in space.
Use when absolute position matters and you think of similarity as physical closeness.
Excercise 1
We will start by setting up a tiny project, then we’ll add code in small steps.
Throughout this course, we will be using uv - a fast Python package and project manager.
It handles creating virtual environments, adding dependencies, and running scripts, all in a single lightweight tool.
If you haven’t used it before, take a minute to check its documentation.
It replaces tools like pip and venv, making Python project setup simpler and faster.
1. Init a new project
uv init ai3c-vectors
cd ai3c-vectors
uv add sentence-transformers numpy
2. Load the SentenceTransformer Model
In the initial exercise we will be using Python’s SentenceTransformer. It is a library that makes it easy to turn text into numerical vectors (embeddings) using pre-trained transformer models.
It handles downloading models from Hugging Face, running them locally, and returning ready-to-use vectors, so you can focus on using the embeddings rather than building a model from scratch.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') # This model has 384-dim
3. Calculate an embedding
An embedding is the vector form of a piece of text.
Once we have our model loaded, we can pass it a list of sentences and get back a list of vectors — one vector for each sentence.
Each vector is a list of numbers that capture the meaning of the sentence in a way the model understands.
texts = [
"A cat sits on the sofa.",
"The dog sleeps on the rug.",
"I like pizza with cheese.",
"Airplanes fly in the sky."
]
embeddings = model.encode(texts)
print(embeddings.shape) # (4, 384) → 4 sentences, 384 numbers each
Once you have saved the file, execute it using uv:
source .venv/bin/activate
uv run main.py
If the model is not already downloaded, the first run may take longer while it’s being fetched.
4. Compare sentences using different similarity measures
We’ll compare:
- Two related sentences ("cat" vs "dog")
- Two unrelated sentences ("cat" vs "airplane")
import numpy as np
from scipy.spatial.distance import cosine, euclidean
from sentence_transformers import SentenceTransformer
# Load the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2') # 384-dimensional embeddings
# Input texts
texts = [
"A cat sits on the sofa.",
"The dog sleeps on the rug.",
"I like pizza with cheese.",
"Airplanes fly in the sky."
]
# Generate embeddings
embeddings = model.encode(texts)
print(embeddings.shape)
# Helper for dot product
def dot_product(v1, v2):
return np.dot(v1, v2)
# Compare Cat vs Dog
cat = embeddings[0]
dog = embeddings[1]
plane = embeddings[3]
print("\n--- Cat vs Dog ---")
print("Cosine Similarity:", 1 - cosine(cat, dog))
print("Dot Product:", dot_product(cat, dog))
print("Euclidean Distance:", euclidean(cat, dog))
# Compare Cat vs Airplane
print("\n--- Cat vs Airplane ---")
print("Cosine Similarity:", 1 - cosine(cat, plane))
print("Dot Product:", dot_product(cat, plane))
print("Euclidean Distance:", euclidean(cat, plane))
In the example above the yielded results are:
uv run main.py
(4, 384)
--- Cat vs Dog ---
Cosine Similarity: 0.34279263
Dot Product: 0.34279266
Euclidean Distance: 1.1464792490005493
--- Cat vs Airplane ---
Cosine Similarity: 0.017234027
Dot Product: 0.017234035
Euclidean Distance: 1.4019743204116821
When we compare the results, the cat-dog pair shows a cosine similarity of about 0.34, meaning the model detects some shared context; both involve animals in a home setting, but the connection is not very strong.
The Euclidean distance of roughly 1.15 confirms that, while they are not identical in meaning, they sit relatively close in the embedding space.
In contrast, the cat-airplane pair has a cosine similarity near zero, showing almost no semantic relationship between the two sentences, and a larger Euclidean distance of about 1.40, placing them farther apart in the space.
The dot product values here are nearly the same as cosine similarity because the vectors have similar lengths. This aligns with our intuition: "cat" and "dog" are more related in meaning than "cat" and "airplane."
Summary
In this section, we learned what vectors and embeddings are, why vector size matters, and how to measure similarity between vectors using cosine similarity, dot product, and Euclidean distance.
We also set up a small Python project with SentenceTransformers, generated embeddings for a list of sentences, inspected their shape and size, and compared them using three similarity measures.
This gives us a foundation for working with text in a way that AI models understand - as numbers in a high-dimensional space - and for comparing their meaning mathematically.
Next: We’ll explore local embeddings with Ollama.
You’ll learn how to run embedding models directly on your machine, choose a model that supports embeddings (like bge-m3 or nomic-embed-text), and understand trade-offs in latency, memory use, and model size.