Lesson 6: What is a Vector Database?
- The Semantic Gap: Why traditional databases fail at unstructured data.
- Vector Embeddings: Representing data as arrays of numbers.
- Embedding Models: How data is transformed (Clip, GloVe, Wav2vec).
- Vector Indexing: Efficient search with ANN (HNSW, IVF).
A picture is worth a thousand words, but how do we store it in a database so that a computer understands its meaning?
1. The Semantic Gap
Traditional relational databases store structured data (text, dates, numbers). If you store an image of a sunset, you might save the binary file and some tags (sunset, orange).
But how do you query for "images with a similar mood"? Or "mountain landscapes"? Basic tags miss the semantic context. This disconnect between how computers store data and how humans understand it is called the Semantic Gap.
2. Vector Embeddings
Vector databases bridge this gap by representing unstructured data (images, text, audio) as Vector Embeddings. An embedding is essentially a long array of numbers where each position represents a learned feature.
Example (Simplified):
| Feature | Mountain Picture | Beach Picture | Meaning |
|---|---|---|---|
| Dimension 1 | 0.91 | 0.12 | Elevation changes (High vs Low) |
| Dimension 2 | 0.15 | 0.08 | Urban elements (Few buildings in both) |
| Dimension 3 | 0.83 | 0.89 | Warm colors (Sunset present in both) |
In reality, embeddings have hundreds or thousands of dimensions. Items that are semantically similar are positioned close together in this multi-dimensional vector space.
3. Embedding Models
Embeddings are created by passing data through specialized neural networks called Embedding Models.
- Images: Clip
- Text: GloVe
- Audio: Wav2vec
As data passes through the model's layers, it extracts progressively more abstract features (edges -> objects -> meaning). The final output is a high-dimensional vector capturing the essential characteristics.
4. Vector Indexing
Comparing a query vector to every vector in a database of millions is too slow. Vector Indexing uses ANN (Approximate Nearest Neighbor) algorithms to trade a tiny amount of accuracy for huge speed.
- HNSW (Hierarchical Navigable Small World): Creates multi-layered graphs connecting similar vectors.
- IVF (Inverted File Index): Divides vector space into clusters and only searches relevant ones.
5. Summary
Vector Databases are the core of RAG (Retrieval Augmented Generation). They provide a place to store unstructured data and—crucially—a way to retrieve it semantically.