Lesson 7: Multi-Dimensional Data Visualization
- Why we need dimensionality reduction.
- How to use UMAP to visualize data.
- How to use t-SNE, another popular technique, and compare it with UMAP.
The Challenge of High Dimensions
High-dimensional data is hard to see. In this lesson, you’ll learn how to project it into 2D using dimensionality reduction techniques, compare common methods (PCA, t-SNE, UMAP), and read clusters, distances, and outliers without fooling yourself.
As we've discussed before, in machine learning, we often work with datasets that have hundreds or even thousands of features. For example, a single text sentence can be represented as a 1024-dimensional vector. Since we can’t directly "look" at that many dimensions, we use dimensionality reduction to compress the data into 2 or 3 dimensions while preserving as much of the original structure as possible.
These visualizations help us answer questions like:
- Does my data form natural clusters?
- Are there any outliers in my dataset?
- What is the overall shape or structure of my data?
In this lesson, we’ll focus on two of the most powerful modern techniques: UMAP and t-SNE.
Excercise 7
Setting Up Your Workspace
First, let's get our tools ready. We need a Python environment with a few key libraries.
Just as before, we'll use uv, a fast Python package manager. If you don't have it, you can install it following the official uv installation guide.
Once uv is installed, open your terminal and run this command:
uv add umap-learn ollama matplotlib scikit-learn
Set up Ollama
We need an embedding model to convert text to numbers. Assuming that you already have Ollama installed pull the mxbai-embed-large model that we will use for generating our embedding. This model is a good fit if you need a compact, high-performing English embedding model with fast, space-efficient deployment.
ollama pull mxbai-embed-large
With your environment ready, we can move on to the fun part.
From Text to Numbers - Creating Embeddings
We’ve already seen how text can be converted into numerical vectors called embeddings, where sentences with similar meanings end up close together in vector space.
Now, we’ll put that into practice by taking a small set of sentences from three categories - animals, fruits, and vehicles - and see if our visualization methods can separate them.
import ollama
import numpy as np
print("Generating embeddings for our sentences...")
# Sentences from different categories to demonstrate clustering
sentences = [
# Animals
"The cat sat on the mat.", "Dogs are known for their loyalty.",
"The lion is the king of the jungle.", "Elephants have long trunks.",
# Fruits
"An apple a day keeps the doctor away.", "Bananas are a great source of potassium.",
"Oranges are rich in vitamin C.", "Strawberries are a popular summer fruit.",
# Vehicles
"The car drove down the street.", "The airplane flew high in the sky.",
"The train chugged along the tracks.", "The bicycle is an eco-friendly mode of transport."
]
# Generate an embedding for each sentence
embeddings = [
ollama.embeddings(model='mxbai-embed-large', prompt=s)['embedding']
for s in sentences
]
# Convert to a NumPy array for high-performance math
embeddings = np.array(embeddings)
# For the rest of the course, we will use this 'embeddings' variable.
print(f"Successfully created embeddings with shape: {embeddings.shape}")
Once you save the file, execute it with our trusted uv:
uv run main.py
If you’ve properly installed all dependencies, activated the virtual environment, and downloaded the Ollama model with the service running, you should see something like this:
Generating embeddings for our sentences...
Successfully created embeddings with shape: (12, 1024)
The output shape (12, 1024) tells us we have 12 sentences, each represented by a 1024-dimensional vector. Now, let's visualize this!
Visualizing with UMAP
Uniform Manifold Approximation and Projection (UMAP) is a fantastic modern algorithm known for its speed and its ability to preserve both the local and global structure of the data. Think of UMAP as creating a "map" of your high-dimensional data in a low-dimensional space (like 2D). It tries to ensure that points that are close in 1024-D space are also close on the 2D map.
We’ll reuse the embeddings from the previous step and run UMAP with:
n_neighbors=5: looks at 5 nearest points for each sample. Smaller values focus on local detail; larger values lean toward global layout.min_dist=0.3: controls how tightly points can pack. Lower values make clusters denser; higher values spread them out.random_state=42: fixes the random seed so your plot is reproducible.
We’ll also color by category (Animal, Fruit, Vehicle) and add short text labels to each point.
import ollama
import numpy as np
import umap.umap_ as umap
import matplotlib.pyplot as plt
print("Generating embeddings for our sentences...")
sentences = [
# Animals
"The cat sat on the mat.", "Dogs are known for their loyalty.",
"The lion is the king of the jungle.", "Elephants have long trunks.",
# Fruits
"An apple a day keeps the doctor away.", "Bananas are a great source of potassium.",
"Oranges are rich in vitamin C.", "Strawberries are a popular summer fruit.",
# Vehicles
"The car drove down the street.", "The airplane flew high in the sky.",
"The train chugged along the tracks.", "The bicycle is an eco-friendly mode of transport."
]
embeddings = [
ollama.embeddings(model='mxbai-embed-large', prompt=s)['embedding']
for s in sentences
]
embeddings = np.array(embeddings)
print(f"Successfully created embeddings with shape: {embeddings.shape}")
print("Running UMAP on the embeddings...")
# 1. Initialize UMAP
reducer = umap.UMAP(n_neighbors=5, min_dist=0.3, random_state=42)
# 2. Fit and transform the data
embedding_2d_umap = reducer.fit_transform(embeddings)
# 3. Plot the results
labels = ['Animal'] * 4 + ['Fruit'] * 4 + ['Vehicle'] * 4
categories = sorted(list(set(labels)))
colors = plt.get_cmap('viridis')(np.linspace(0, 1, len(categories)))
plt.figure(figsize=(12, 9))
for i, category in enumerate(categories):
idx = [j for j, label in enumerate(labels) if label == category]
plt.scatter(embedding_2d_umap[idx, 0], embedding_2d_umap[idx, 1], label=category, color=colors[i], s=60)
for i, txt in enumerate(sentences):
plt.annotate(txt.split()[1], (embedding_2d_umap[i, 0], embedding_2d_umap[i, 1]), alpha=0.8)
plt.title('UMAP Projection of Sentence Embeddings')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.legend()
plt.grid(True)
plt.show()
Run the script with uv. It will generate a plot where UMAP forms clear, well-separated clusters; that’s one of its key strengths.
An Alternative Approach - t-SNE
t-SNE (t-Distributed Stochastic Neighbor Embedding) is another widely-used visualization algorithm. It is particularly good at revealing the fine-grained local structure of data.
t-SNE's main goal is to ensure that points that are close neighbors in high-dimensional space are also close neighbors in the 2D plot. It is less concerned with preserving the global structure.
perplexity: This parameter is loosely related ton_neighborsin UMAP. It helps determine how many neighbors each point considers. A typical range is 5 to 50. For our small dataset, we'll use a low value.
Let's run t-SNE on the exact same embeddings and compare the results.
import ollama
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
print("Generating embeddings for our sentences...")
sentences = [
# Animals
"The cat sat on the mat.", "Dogs are known for their loyalty.",
"The lion is the king of the jungle.", "Elephants have long trunks.",
# Fruits
"An apple a day keeps the doctor away.", "Bananas are a great source of potassium.",
"Oranges are rich in vitamin C.", "Strawberries are a popular summer fruit.",
# Vehicles
"The car drove down the street.", "The airplane flew high in the sky.",
"The train chugged along the tracks.", "The bicycle is an eco-friendly mode of transport."
]
embeddings = [
ollama.embeddings(model='mxbai-embed-large', prompt=s)['embedding']
for s in sentences
]
embeddings = np.array(embeddings)
print(f"Successfully created embeddings with shape: {embeddings.shape}")
print("Running t-SNE on the embeddings...")
# 1. Initialize t-SNE
tsne_reducer = TSNE(n_components=2, perplexity=5, random_state=42, init='pca', learning_rate='auto')
# 2. Fit and transform the data
embedding_2d_tsne = tsne_reducer.fit_transform(embeddings)
# 3. Plot the results
labels = ['Animal'] * 4 + ['Fruit'] * 4 + ['Vehicle'] * 4
categories = sorted(list(set(labels)))
colors = plt.get_cmap('viridis')(np.linspace(0, 1, len(categories)))
plt.figure(figsize=(12, 9))
for i, category in enumerate(categories):
idx = [j for j, label in enumerate(labels) if label == category]
plt.scatter(embedding_2d_tsne[idx, 0], embedding_2d_tsne[idx, 1], label=category, color=colors[i], s=60)
for i, txt in enumerate(sentences):
plt.annotate(txt.split()[1], (embedding_2d_tsne[i, 0], embedding_2d_tsne[i, 1]), alpha=0.8)
plt.title('t-SNE Projection of Sentence Embeddings')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.legend()
plt.grid(True)
plt.show()
The t-SNE plot also shows distinct clusters. However, notice that the distance between the clusters (e.g., the gap between 'Animal' and 'Fruit') doesn't have a specific meaning in t-SNE. UMAP often does a better job of preserving this inter-cluster distance.
UMAP vs. t-SNE: Key Differences
| Feature | UMAP | t-SNE |
|---|---|---|
| Speed | Faster, especially on large datasets. | Slower, can be computationally expensive. |
| Global Structure | Better at preserving global structure. | Focuses mainly on local structure. |
| Cluster Density | Tends to produce more compact clusters. | Clusters can be more varied in shape/size. |
| Use Case | Great for general-purpose visualization. | Excellent for inspecting fine-grained clusters. |
Summary
You've learned how to turn abstract text into visual, explorable maps. We turned a small set of sentences into vectors with an embedding model and then plotted them in 2D.
UMAP gave us three clear groups (animals, fruits, vehicles) and, with a fixed seed, the plot is repeatable. We ran t-SNE on the same vectors for comparison. It also formed groups, but the spacing between groups should not be read as global distance. In practice, use UMAP when you want speed and a layout that keeps some sense of the overall shape; use t-SNE when you care most about fine local structure. Keep in mind the axes have no meaning and layouts can rotate or flip between runs.
Feel free to play around: try a quick PCA check, test another embedding model (e.g., bge-m3), and scale up the dataset.