Word Embeddings
Learn how word embeddings capture semantic relationships in vector space
Word Embeddings
Introduction
Word embeddings are one of the most important breakthroughs in Natural Language Processing. They transform words from discrete symbols into continuous vector representations that capture semantic meaning. This allows machines to understand that "king" is to "queen" as "man" is to "woman", or that "Paris" and "France" have a relationship similar to "London" and "England".
In this module, you'll learn how word embeddings work, explore the Word2Vec algorithm, and see how these dense vector representations enable machines to understand language semantics.
The Problem with One-Hot Encoding
Traditional approaches represent words as one-hot vectors:
Vocabulary: ["cat", "dog", "king", "queen"]
"cat" = [1, 0, 0, 0]
"dog" = [0, 1, 0, 0]
"king" = [0, 0, 1, 0]
"queen" = [0, 0, 0, 1]
Problems:
- High Dimensionality: Vector size = vocabulary size (often 10,000+)
- No Semantic Information: All words are equally distant from each other
- Sparse Representations: Mostly zeros, inefficient storage
- No Generalization: Can't capture relationships between words
What are Word Embeddings?
Word embeddings are dense, low-dimensional vector representations of words that capture semantic and syntactic relationships:
"cat" = [0.2, -0.4, 0.7, ..., 0.1] (50-300 dimensions)
"dog" = [0.3, -0.3, 0.6, ..., 0.2] (similar to "cat")
"king" = [0.5, 0.8, -0.1, ..., 0.4]
"queen" = [0.4, 0.7, -0.2, ..., 0.3] (similar to "king")
Key Properties:
- Dense: Every dimension has a meaningful value
- Low-Dimensional: Typically 50-300 dimensions vs. 10,000+ for one-hot
- Semantic: Similar words have similar vectors
- Learned: Automatically discovered from text data
The Distributional Hypothesis
Word embeddings are based on the distributional hypothesis:
"You shall know a word by the company it keeps" - J.R. Firth
Words that appear in similar contexts tend to have similar meanings:
"The cat sat on the mat"
"The dog sat on the mat"
Since "cat" and "dog" appear in similar contexts, their embeddings should be similar.
Word2Vec: Skip-gram Model
Word2Vec is a popular algorithm for learning word embeddings. The Skip-gram model predicts context words given a target word:
Architecture
Input: "cat"
↓
Embedding Layer (word → vector)
↓
Output Layer (predict context words)
↓
Predictions: ["the", "sat", "on"]
Training Process
Given the sentence: "The cat sat on the mat"
For target word "cat" with window size 2:
- Context words: "The", "sat"
- Goal: Maximize probability of context words given "cat"
Objective Function
The model learns to maximize:
P(context | target) = P("The" | "cat") × P("sat" | "cat")
By adjusting word vectors to make this probability high.
Negative Sampling
Training on all words is computationally expensive. Negative sampling makes it efficient:
Positive Sample
- Target: "cat"
- Context: "sat"
- Label: 1 (these words appear together)
Negative Samples
- Target: "cat"
- Random words: "elephant", "computer", "ocean"
- Label: 0 (these words don't appear together)
The model learns to:
- Give high scores to actual word pairs
- Give low scores to random word pairs
This is much faster than computing probabilities over the entire vocabulary.
Semantic Relationships
Word embeddings capture fascinating semantic relationships through vector arithmetic:
Analogy Relationships
king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
walking - walk + swim ≈ swimming
Similarity
Words with similar meanings have similar vectors:
cosine_similarity("cat", "dog") = 0.85
cosine_similarity("cat", "computer") = 0.12
Clustering
Related words cluster together in embedding space:
- Animals: cat, dog, elephant, lion
- Countries: France, Italy, Spain, Germany
- Verbs: run, walk, jump, swim
Cosine Similarity
We measure word similarity using cosine similarity:
similarity = (A · B) / (||A|| × ||B||)
Where:
- A · B = dot product of vectors
- ||A|| = magnitude of vector A
Range: -1 (opposite) to +1 (identical)
Example:
vec("cat") = [0.2, 0.8, 0.3]
vec("dog") = [0.3, 0.7, 0.4]
similarity = 0.98 (very similar!)
Dimensionality and Visualization
Embedding Dimensions
Typical sizes:
- 50-100: Fast, good for small datasets
- 200-300: Standard, good balance
- 500-1000: Captures more nuance, slower
Visualization
We can't visualize 300-dimensional space, so we use dimensionality reduction:
- PCA (Principal Component Analysis): Linear projection
- t-SNE: Non-linear, preserves local structure
- UMAP: Fast, preserves global structure
These project high-dimensional embeddings to 2D or 3D for visualization.
Training Parameters
Window Size
How many words to consider on each side:
Window = 2: "The [cat] sat on"
Window = 5: "... The [cat] sat on the ..."
- Small window (1-2): Captures syntactic relationships
- Large window (5-10): Captures semantic relationships
Embedding Dimension
- Higher: More expressive, captures subtle relationships
- Lower: Faster, less memory, may lose information
Minimum Count
Filter rare words:
- High (5+): Smaller vocabulary, faster training
- Low (1-2): Keeps rare words, larger vocabulary
Learning Rate
- High (0.1): Fast convergence, may overshoot
- Low (0.001): Slow but stable convergence
Epochs
Number of passes through the corpus:
- Few (1-5): Fast, may underfit
- Many (10-20): Better quality, slower
Interactive Demo
Use the controls to experiment with word embeddings:
- Choose a Corpus: Select different text domains
- Adjust Embedding Dimension: See how dimensionality affects representation
- Modify Window Size: Observe impact on semantic vs. syntactic relationships
- Train the Model: Watch embeddings evolve over epochs
- Explore the Space: Click on words to find similar words
- Visualize in 2D/3D: See how words cluster by meaning
Observe:
- How similar words cluster together
- Semantic relationships in the vector space
- Training loss convergence
- Effect of hyperparameters
Use Cases
Text Classification
Use pre-trained embeddings as features:
- Sentiment analysis
- Spam detection
- Topic categorization
Information Retrieval
Find similar documents:
- Search engines
- Recommendation systems
- Document clustering
Machine Translation
Align words across languages:
- Cross-lingual embeddings
- Transfer learning
- Zero-shot translation
Named Entity Recognition
Capture entity relationships:
- Person names
- Locations
- Organizations
Question Answering
Understand semantic similarity:
- Match questions to answers
- Find relevant passages
- Semantic search
Best Practices
1. Use Pre-trained Embeddings
For most tasks, start with pre-trained embeddings:
- Word2Vec: Google News (3M words, 300d)
- GloVe: Wikipedia + Gigaword (6B tokens)
- FastText: Handles out-of-vocabulary words
2. Fine-tune on Your Domain
Adapt embeddings to your specific domain:
- Medical text
- Legal documents
- Social media
- Technical documentation
3. Handle Out-of-Vocabulary Words
Strategies for unknown words:
- Use subword embeddings (FastText)
- Average character embeddings
- Use a special UNK token
4. Normalize Embeddings
Normalize vectors to unit length:
- Makes cosine similarity = dot product
- Faster computation
- More stable training
5. Consider Context
Modern approaches use contextual embeddings:
- BERT: Different embeddings for different contexts
- ELMo: Deep contextualized representations
- GPT: Transformer-based embeddings
Limitations
Polysemy
Same word, different meanings:
- "bank" (financial) vs "bank" (river)
- "apple" (fruit) vs "Apple" (company)
Solution: Contextual embeddings (BERT, ELMo)
Bias
Embeddings can reflect societal biases:
- Gender bias: "doctor" → male, "nurse" → female
- Racial bias: Associations with names
- Cultural bias: Western-centric relationships
Solution: Debiasing techniques, careful evaluation
Static Representations
Word2Vec gives one vector per word:
- Can't capture multiple meanings
- No context awareness
Solution: Contextual embeddings
Rare Words
Words with few occurrences have poor embeddings:
- Noisy representations
- Unreliable similarities
Solution: Increase minimum count, use subword embeddings
Advanced Topics
GloVe (Global Vectors)
Alternative to Word2Vec:
- Uses global word co-occurrence statistics
- Matrix factorization approach
- Often performs similarly to Word2Vec
FastText
Extension of Word2Vec:
- Represents words as bags of character n-grams
- Handles out-of-vocabulary words
- Better for morphologically rich languages
Contextual Embeddings
Modern approaches:
- ELMo: Bidirectional LSTM
- BERT: Transformer encoder
- GPT: Transformer decoder
These generate different embeddings based on context.
Cross-lingual Embeddings
Align embeddings across languages:
- Shared vector space
- Zero-shot translation
- Cross-lingual transfer learning
Further Reading
Seminal Papers
- "Efficient Estimation of Word Representations in Vector Space" (Mikolov et al., 2013) - Original Word2Vec paper
- "Distributed Representations of Words and Phrases and their Compositionality" (Mikolov et al., 2013) - Negative sampling
- "GloVe: Global Vectors for Word Representation" (Pennington et al., 2014)
- "Enriching Word Vectors with Subword Information" (Bojanowski et al., 2017) - FastText
Tools and Libraries
- Gensim: Python library for Word2Vec and Doc2Vec
- FastText: Facebook's library for efficient text classification and embeddings
- Hugging Face Transformers: Modern contextual embeddings
- spaCy: Industrial-strength NLP with pre-trained embeddings
Pre-trained Embeddings
Online Resources
Summary
Word embeddings revolutionized NLP by providing dense, semantic representations of words:
- Dense Vectors: Low-dimensional representations (50-300d) vs. sparse one-hot encoding
- Semantic Meaning: Similar words have similar vectors
- Vector Arithmetic: Captures analogies and relationships
- Learned from Data: Automatically discovered from text corpora
- Transfer Learning: Pre-trained embeddings work across tasks
Key algorithms:
- Word2Vec: Skip-gram and CBOW models
- GloVe: Global co-occurrence statistics
- FastText: Subword information
Modern evolution:
- Contextual Embeddings: BERT, ELMo, GPT
- Different vectors for different contexts
- State-of-the-art performance
Word embeddings are the foundation of modern NLP, enabling machines to understand and process human language with unprecedented accuracy!