Skip to main content

Lesson 6: What is a Vector Database?

Topics Covered
  • The Semantic Gap: Why traditional databases fail at unstructured data.
  • Vector Embeddings: Representing data as arrays of numbers.
  • Embedding Models: How data is transformed (Clip, GloVe, Wav2vec).
  • Vector Indexing: Efficient search with ANN (HNSW, IVF).

A picture is worth a thousand words, but how do we store it in a database so that a computer understands its meaning?

1. The Semantic Gap

Traditional relational databases store structured data (text, dates, numbers). If you store an image of a sunset, you might save the binary file and some tags (sunset, orange).

But how do you query for "images with a similar mood"? Or "mountain landscapes"? Basic tags miss the semantic context. This disconnect between how computers store data and how humans understand it is called the Semantic Gap.

2. Vector Embeddings

Vector databases bridge this gap by representing unstructured data (images, text, audio) as Vector Embeddings. An embedding is essentially a long array of numbers where each position represents a learned feature.

Example (Simplified):

FeatureMountain PictureBeach PictureMeaning
Dimension 10.910.12Elevation changes (High vs Low)
Dimension 20.150.08Urban elements (Few buildings in both)
Dimension 30.830.89Warm colors (Sunset present in both)

In reality, embeddings have hundreds or thousands of dimensions. Items that are semantically similar are positioned close together in this multi-dimensional vector space.

3. Embedding Models

Embeddings are created by passing data through specialized neural networks called Embedding Models.

  • Images: Clip
  • Text: GloVe
  • Audio: Wav2vec

As data passes through the model's layers, it extracts progressively more abstract features (edges -> objects -> meaning). The final output is a high-dimensional vector capturing the essential characteristics.

4. Vector Indexing

Comparing a query vector to every vector in a database of millions is too slow. Vector Indexing uses ANN (Approximate Nearest Neighbor) algorithms to trade a tiny amount of accuracy for huge speed.

  • HNSW (Hierarchical Navigable Small World): Creates multi-layered graphs connecting similar vectors.
  • IVF (Inverted File Index): Divides vector space into clusters and only searches relevant ones.

5. Summary

Vector Databases are the core of RAG (Retrieval Augmented Generation). They provide a place to store unstructured data and—crucially—a way to retrieve it semantically.