Hierarchical Clustering
Learn how hierarchical clustering builds a tree of clusters using agglomerative or divisive approaches
Hierarchical Clustering
Introduction
Hierarchical clustering is a powerful unsupervised learning technique that builds a hierarchy of clusters, creating a tree-like structure called a dendrogram. Unlike K-means, which requires you to specify the number of clusters upfront, hierarchical clustering reveals the natural grouping structure at multiple levels of granularity.
Think of it like a family tree or organizational chart - you can see relationships at different levels, from individual members to large groups, all in one visualization.
What You'll Learn
By the end of this module, you will:
- Understand agglomerative (bottom-up) and divisive (top-down) approaches
- Learn about different linkage criteria and when to use them
- Explore dendrograms for visualizing cluster hierarchies
- Recognize how to determine the optimal number of clusters
- Understand the computational complexity and scalability
Two Approaches to Hierarchical Clustering
A dendrogram showing the hierarchical structure of clusters
Agglomerative (Bottom-Up)
Most common approach:
- Start with each point as its own cluster (n clusters)
- Repeatedly merge the two closest clusters
- Continue until all points are in one cluster
- Cut the dendrogram at desired level to get k clusters
Analogy: Building a pyramid from individual blocks
Divisive (Top-Down)
Less common approach:
- Start with all points in one cluster
- Repeatedly split the cluster into two sub-clusters
- Continue until each point is its own cluster
- Cut the dendrogram at desired level to get k clusters
Analogy: Breaking down a large organization into departments
Note: Agglomerative is more commonly used because it's computationally more efficient.
The Agglomerative Algorithm
Step-by-Step Process
Step 1: Initialize
- Treat each data point as a single cluster
- Calculate pairwise distances between all points
- Create a distance matrix
Step 2: Find Closest Pair
- Identify the two clusters with minimum distance
- Use the chosen linkage criterion
Step 3: Merge
- Combine the two closest clusters into one
- Update the distance matrix
Step 4: Repeat
- Continue steps 2-3 until only one cluster remains
- Record the merge history
Step 5: Create Dendrogram
- Visualize the merge history as a tree
- Height represents distance at which clusters merged
Step 6: Cut the Tree
- Choose a height or number of clusters
- Extract final cluster assignments
Linkage Criteria
Different linkage methods measure cluster distance differently
The linkage criterion determines how to measure distance between clusters. This is the most important choice in hierarchical clustering.
Single Linkage (Minimum)
Definition: Distance between closest points in two clusters
d(A, B) = min{d(a, b) : a ∈ A, b ∈ B}
Characteristics:
- Creates long, chain-like clusters
- Sensitive to noise and outliers
- Can connect distant clusters through intermediate points
- Good for non-convex shapes
When to use:
- Data has elongated or irregular shapes
- You want to detect connected components
Drawback: "Chaining effect" - can create unnatural long chains
Complete Linkage (Maximum)
Definition: Distance between farthest points in two clusters
d(A, B) = max{d(a, b) : a ∈ A, b ∈ B}
Characteristics:
- Creates compact, spherical clusters
- Less sensitive to outliers than single linkage
- Tends to break large clusters
- Prefers clusters of similar size
When to use:
- You want compact, well-separated clusters
- Data has spherical cluster shapes
Drawback: Can split natural clusters if they're not compact
Average Linkage (Mean)
Definition: Average distance between all pairs of points
d(A, B) = (1/|A||B|) Σ Σ d(a, b)
Characteristics:
- Compromise between single and complete
- More robust than single linkage
- Less biased toward compact clusters than complete
- Generally good default choice
When to use:
- You want balanced clusters
- You're unsure which linkage to use
- Data has moderate noise
Advantage: Balanced approach, works well in many scenarios
Ward's Method (Minimum Variance)
Definition: Minimize within-cluster variance when merging
Merge clusters that minimize: Σ(within-cluster sum of squares)
Characteristics:
- Creates compact, evenly-sized clusters
- Minimizes variance within clusters
- Most popular choice in practice
- Similar to K-means objective
When to use:
- You want compact, spherical clusters
- You want clusters of similar size
- You're familiar with K-means
Advantage: Often produces the best results for well-separated, spherical clusters
Note: Only works with Euclidean distance
The Dendrogram
A dendrogram showing hierarchical relationships and merge distances
A dendrogram is a tree diagram that shows the hierarchical relationship between clusters.
Reading a Dendrogram
Vertical Axis (Height):
- Represents the distance at which clusters merge
- Higher merges = more dissimilar clusters
- Lower merges = more similar clusters
Horizontal Axis:
- Individual data points (leaves)
- Order is chosen for visualization clarity
Branches:
- Show which clusters merged
- Length indicates dissimilarity
Cutting the Tree:
- Horizontal line across dendrogram
- Height determines number of clusters
- Points below the line in same branch = same cluster
Interpreting Dendrograms
Long vertical lines:
- Large distance between merges
- Suggests natural separation
- Good place to cut the tree
Short vertical lines:
- Small distance between merges
- Clusters are similar
- Less clear separation
Balanced tree:
- Clusters of similar size
- Hierarchical structure is clear
Unbalanced tree:
- Some clusters much larger than others
- May indicate outliers or noise
Determining the Number of Clusters
Unlike K-means, hierarchical clustering doesn't require specifying k upfront. You can use the dendrogram to decide:
Visual Inspection
Look for:
- Long vertical lines (large gaps)
- Natural breaks in the tree
- Consistent cluster sizes
Cut at:
- Height with largest gap
- Level that makes domain sense
Elbow Method
Plot within-cluster variance vs. number of clusters:
- Look for "elbow" where improvement slows
- Similar to K-means elbow method
Silhouette Analysis
Calculate silhouette score for different k:
- Measures how well points fit their clusters
- Range: -1 (poor) to +1 (excellent)
- Choose k with highest average silhouette
Gap Statistic
Compare within-cluster variance to random data:
- Measures how much better clustering is than random
- Choose k where gap is largest
Domain Knowledge
Consider:
- What makes sense for your problem?
- Are there natural groupings?
- What's actionable?
Distance Metrics
Hierarchical clustering works with any distance metric:
Euclidean Distance (Most Common)
d = √[Σ(xᵢ - yᵢ)²]
When to use: Continuous features, spherical clusters
Manhattan Distance
d = Σ|xᵢ - yᵢ|
When to use: Grid-like data, less sensitive to outliers
Cosine Distance
d = 1 - (x·y)/(||x|| ||y||)
When to use: Text data, high-dimensional sparse data
Correlation Distance
d = 1 - correlation(x, y)
When to use: Time series, gene expression data
Advantages of Hierarchical Clustering
- No Need to Specify k: Explore different numbers of clusters
- Dendrogram Visualization: See relationships at all levels
- Deterministic: Same result every time (unlike K-means)
- Any Distance Metric: Works with custom similarity measures
- Hierarchical Structure: Reveals nested relationships
- No Random Initialization: No need to run multiple times
- Flexible: Can cut tree at any level
Limitations of Hierarchical Clustering
- Computational Complexity: O(n²) space, O(n³) time for naive implementation
- Not Scalable: Difficult for large datasets (>10,000 points)
- Sensitive to Noise: Outliers can distort the hierarchy
- Irreversible Decisions: Once merged, clusters can't be split
- Linkage Choice: Results depend heavily on linkage method
- Memory Intensive: Must store distance matrix
- No Probabilistic Model: No uncertainty estimates
Computational Complexity
Time Complexity
Naive algorithm:
- O(n³): For each of n-1 merges, search n² distances
Optimized algorithms:
- O(n² log n): Using priority queues and efficient data structures
Space Complexity
- O(n²): Must store distance matrix
- Can be prohibitive for large datasets
Scalability
Practical limits:
- Works well: < 1,000 points
- Manageable: 1,000 - 10,000 points
- Challenging: > 10,000 points
For large datasets:
- Use K-means or DBSCAN instead
- Or sample data first, then cluster
Handling Different Data Types
Numerical Features
- Use Euclidean or Manhattan distance
- Scale features if needed
- Most common scenario
Categorical Features
- Use Hamming distance or Jaccard similarity
- Or encode as binary features
- Consider Gower distance for mixed types
Mixed Features
- Use Gower distance (handles mixed types)
- Or normalize and use weighted distance
- Or encode categoricals as numerical
Text Data
- Use cosine similarity
- Or Jaccard similarity for sets
- TF-IDF vectors work well
Comparison with K-Means
Hierarchical Clustering
- Pros: No need to specify k, deterministic, reveals structure
- Cons: Slow, not scalable, sensitive to noise
- Best for: Small datasets, exploratory analysis, hierarchical data
K-Means
- Pros: Fast, scalable, simple
- Cons: Need to specify k, random initialization, assumes spherical clusters
- Best for: Large datasets, known number of clusters, spherical clusters
When to Choose Which
Use Hierarchical when:
- Dataset is small (< 10,000 points)
- You don't know the number of clusters
- You want to explore the hierarchy
- You need deterministic results
Use K-Means when:
- Dataset is large
- You know the number of clusters
- Speed is important
- Clusters are roughly spherical
Real-World Applications
Hierarchical clustering is used in many domains:
- Biology: Phylogenetic trees, gene expression analysis
- Marketing: Customer segmentation with hierarchy
- Document Organization: Topic hierarchies, document trees
- Image Segmentation: Hierarchical image regions
- Social Networks: Community detection with levels
- Taxonomy: Creating classification hierarchies
Tips for Better Results
- Choose Linkage Carefully: Ward's is often best for spherical clusters
- Scale Features: Especially important for Euclidean distance
- Remove Outliers: They can distort the hierarchy
- Use Appropriate Distance: Match metric to data type
- Visualize Dendrogram: Helps understand structure
- Try Multiple Linkages: Compare results
- Consider Sampling: For large datasets, cluster a sample first
- Validate Results: Use silhouette or other metrics
Advanced Techniques
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
- Scalable hierarchical clustering
- Uses tree structure for efficiency
- Can handle large datasets
- O(n) time complexity
Agglomerative Clustering with Connectivity Constraints
- Add constraints on which clusters can merge
- Useful for spatial data
- Ensures clusters are spatially connected
Hierarchical DBSCAN
- Combines hierarchical and density-based approaches
- Handles varying density clusters
- More robust to noise
Interpreting Results
Cluster Profiles
For each cluster, examine:
- Size: How many points?
- Centroid: Average feature values
- Spread: Within-cluster variance
- Characteristics: What makes it unique?
Cluster Validation
Internal metrics:
- Silhouette score
- Davies-Bouldin index
- Calinski-Harabasz index
External metrics (if labels available):
- Adjusted Rand Index
- Normalized Mutual Information
Actionability
Ask:
- Do clusters make domain sense?
- Are they actionable?
- Do they reveal insights?
- Can you explain them to stakeholders?
Summary
Hierarchical clustering is a powerful technique that:
- Builds a hierarchy of clusters from bottom-up or top-down
- Uses linkage criteria to determine cluster distances
- Creates dendrograms for visualization
- Doesn't require specifying the number of clusters upfront
- Works with any distance metric
- Is deterministic but computationally expensive
Understanding hierarchical clustering provides insights into data structure and relationships at multiple levels of granularity.
Next Steps
After mastering hierarchical clustering, explore:
- DBSCAN: Density-based clustering for arbitrary shapes
- Gaussian Mixture Models: Probabilistic clustering
- Spectral Clustering: Graph-based clustering
- Ensemble Clustering: Combining multiple clustering results
- Cluster Validation: Advanced metrics and techniques