Word Embeddings

Introduction

Word embeddings are one of the most important breakthroughs in Natural Language Processing. They transform words from discrete symbols into continuous vector representations that capture semantic meaning. This allows machines to understand that "king" is to "queen" as "man" is to "woman", or that "Paris" and "France" have a relationship similar to "London" and "England".

In this module, you'll learn how word embeddings work, explore the Word2Vec algorithm, and see how these dense vector representations enable machines to understand language semantics.

The Problem with One-Hot Encoding

Traditional approaches represent words as one-hot vectors:

Vocabulary: ["cat", "dog", "king", "queen"]

"cat"   = [1, 0, 0, 0]
"dog"   = [0, 1, 0, 0]
"king"  = [0, 0, 1, 0]
"queen" = [0, 0, 0, 1]

Problems:

High Dimensionality: Vector size = vocabulary size (often 10,000+)
No Semantic Information: All words are equally distant from each other
Sparse Representations: Mostly zeros, inefficient storage
No Generalization: Can't capture relationships between words

What are Word Embeddings?

Word embeddings are dense, low-dimensional vector representations of words that capture semantic and syntactic relationships:

"cat"   = [0.2, -0.4, 0.7, ..., 0.1]  (50-300 dimensions)
"dog"   = [0.3, -0.3, 0.6, ..., 0.2]  (similar to "cat")
"king"  = [0.5, 0.8, -0.1, ..., 0.4]
"queen" = [0.4, 0.7, -0.2, ..., 0.3]  (similar to "king")

Key Properties:

Dense: Every dimension has a meaningful value
Low-Dimensional: Typically 50-300 dimensions vs. 10,000+ for one-hot
Semantic: Similar words have similar vectors
Learned: Automatically discovered from text data

The Distributional Hypothesis

Word embeddings are based on the distributional hypothesis:

"You shall know a word by the company it keeps" - J.R. Firth

Words that appear in similar contexts tend to have similar meanings:

"The cat sat on the mat"
"The dog sat on the mat"

Since "cat" and "dog" appear in similar contexts, their embeddings should be similar.

Word2Vec: Skip-gram Model

Word2Vec is a popular algorithm for learning word embeddings. The Skip-gram model predicts context words given a target word:

Architecture

Input: "cat"
↓
Embedding Layer (word → vector)
↓
Output Layer (predict context words)
↓
Predictions: ["the", "sat", "on"]

Training Process

Given the sentence: "The cat sat on the mat"

For target word "cat" with window size 2:

Context words: "The", "sat"
Goal: Maximize probability of context words given "cat"

Objective Function

The model learns to maximize:

P(context | target) = P("The" | "cat") × P("sat" | "cat")

By adjusting word vectors to make this probability high.

Negative Sampling

Training on all words is computationally expensive. Negative sampling makes it efficient:

Positive Sample

Target: "cat"
Context: "sat"
Label: 1 (these words appear together)

Negative Samples

Target: "cat"
Random words: "elephant", "computer", "ocean"
Label: 0 (these words don't appear together)

The model learns to:

Give high scores to actual word pairs
Give low scores to random word pairs

This is much faster than computing probabilities over the entire vocabulary.

Semantic Relationships

Word embeddings capture fascinating semantic relationships through vector arithmetic:

Analogy Relationships

king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
walking - walk + swim ≈ swimming

Similarity

Words with similar meanings have similar vectors:

cosine_similarity("cat", "dog") = 0.85
cosine_similarity("cat", "computer") = 0.12

Clustering

Related words cluster together in embedding space:

Animals: cat, dog, elephant, lion
Countries: France, Italy, Spain, Germany
Verbs: run, walk, jump, swim

Cosine Similarity

We measure word similarity using cosine similarity:

similarity = (A · B) / (||A|| × ||B||)

Where:

A · B = dot product of vectors
||A|| = magnitude of vector A

Range: -1 (opposite) to +1 (identical)

Example:

vec("cat") = [0.2, 0.8, 0.3]
vec("dog") = [0.3, 0.7, 0.4]

similarity = 0.98 (very similar!)

Dimensionality and Visualization

Embedding Dimensions

Typical sizes:

50-100: Fast, good for small datasets
200-300: Standard, good balance
500-1000: Captures more nuance, slower

Visualization

We can't visualize 300-dimensional space, so we use dimensionality reduction:

PCA (Principal Component Analysis): Linear projection
t-SNE: Non-linear, preserves local structure
UMAP: Fast, preserves global structure

These project high-dimensional embeddings to 2D or 3D for visualization.

Training Parameters

Window Size

How many words to consider on each side:

Window = 2: "The [cat] sat on"
Window = 5: "... The [cat] sat on the ..."

Small window (1-2): Captures syntactic relationships
Large window (5-10): Captures semantic relationships

Embedding Dimension

Higher: More expressive, captures subtle relationships
Lower: Faster, less memory, may lose information

Minimum Count

Filter rare words:

High (5+): Smaller vocabulary, faster training
Low (1-2): Keeps rare words, larger vocabulary

Learning Rate

High (0.1): Fast convergence, may overshoot
Low (0.001): Slow but stable convergence

Epochs

Number of passes through the corpus:

Few (1-5): Fast, may underfit
Many (10-20): Better quality, slower

Interactive Demo

Use the controls to experiment with word embeddings:

Choose a Corpus: Select different text domains
Adjust Embedding Dimension: See how dimensionality affects representation
Modify Window Size: Observe impact on semantic vs. syntactic relationships
Train the Model: Watch embeddings evolve over epochs
Explore the Space: Click on words to find similar words
Visualize in 2D/3D: See how words cluster by meaning

Observe:

How similar words cluster together
Semantic relationships in the vector space
Training loss convergence
Effect of hyperparameters

Use Cases

Text Classification

Use pre-trained embeddings as features:

Sentiment analysis
Spam detection
Topic categorization

Information Retrieval

Find similar documents:

Search engines
Recommendation systems
Document clustering

Machine Translation

Align words across languages:

Cross-lingual embeddings
Transfer learning
Zero-shot translation

Named Entity Recognition

Capture entity relationships:

Person names
Locations
Organizations

Question Answering

Understand semantic similarity:

Match questions to answers
Find relevant passages
Semantic search

Best Practices

1. Use Pre-trained Embeddings

For most tasks, start with pre-trained embeddings:

Word2Vec: Google News (3M words, 300d)
GloVe: Wikipedia + Gigaword (6B tokens)
FastText: Handles out-of-vocabulary words

2. Fine-tune on Your Domain

Adapt embeddings to your specific domain:

Medical text
Legal documents
Social media
Technical documentation

3. Handle Out-of-Vocabulary Words

Strategies for unknown words:

Use subword embeddings (FastText)
Average character embeddings
Use a special UNK token

4. Normalize Embeddings

Normalize vectors to unit length:

Makes cosine similarity = dot product
Faster computation
More stable training

5. Consider Context

Modern approaches use contextual embeddings:

BERT: Different embeddings for different contexts
ELMo: Deep contextualized representations
GPT: Transformer-based embeddings

Limitations

Polysemy

Same word, different meanings:

"bank" (financial) vs "bank" (river)
"apple" (fruit) vs "Apple" (company)

Solution: Contextual embeddings (BERT, ELMo)

Bias

Embeddings can reflect societal biases:

Gender bias: "doctor" → male, "nurse" → female
Racial bias: Associations with names
Cultural bias: Western-centric relationships

Solution: Debiasing techniques, careful evaluation

Static Representations

Word2Vec gives one vector per word:

Can't capture multiple meanings
No context awareness

Solution: Contextual embeddings

Rare Words

Words with few occurrences have poor embeddings:

Noisy representations
Unreliable similarities

Solution: Increase minimum count, use subword embeddings

Advanced Topics

GloVe (Global Vectors)

Alternative to Word2Vec:

Uses global word co-occurrence statistics
Matrix factorization approach
Often performs similarly to Word2Vec

FastText

Extension of Word2Vec:

Represents words as bags of character n-grams
Handles out-of-vocabulary words
Better for morphologically rich languages

Contextual Embeddings

Modern approaches:

ELMo: Bidirectional LSTM
BERT: Transformer encoder
GPT: Transformer decoder

These generate different embeddings based on context.

Cross-lingual Embeddings

Align embeddings across languages:

Shared vector space
Zero-shot translation
Cross-lingual transfer learning

Summary

Word embeddings revolutionized NLP by providing dense, semantic representations of words:

Dense Vectors: Low-dimensional representations (50-300d) vs. sparse one-hot encoding
Semantic Meaning: Similar words have similar vectors
Vector Arithmetic: Captures analogies and relationships
Learned from Data: Automatically discovered from text corpora
Transfer Learning: Pre-trained embeddings work across tasks

Key algorithms:

Word2Vec: Skip-gram and CBOW models
GloVe: Global co-occurrence statistics
FastText: Subword information

Modern evolution:

Contextual Embeddings: BERT, ELMo, GPT
Different vectors for different contexts
State-of-the-art performance

Word embeddings are the foundation of modern NLP, enabling machines to understand and process human language with unprecedented accuracy!

Word Embeddings

Interactive Exploration

Controls

Input

Model Architecture

Training

Preprocessing

Visualization

Quiz

Quiz Coming Soon

Sign in to Continue