Support Vector Machines (SVM)

Introduction

Support Vector Machines (SVM) are powerful supervised learning algorithms used for classification and regression. SVMs find the optimal boundary between classes by maximizing the margin - the distance between the decision boundary and the nearest data points from each class.

The key insight of SVM is elegant: among all possible boundaries that separate the classes, choose the one that is furthest from the nearest points of both classes. This "maximum margin" approach often leads to better generalization on unseen data.

What You'll Learn

By the end of this module, you will:

Understand how SVM finds the maximum margin hyperplane
Learn about support vectors and their critical role
Explore kernel functions for handling non-linear data
Recognize the impact of the C parameter on the margin
Understand the kernel trick and its computational magic

The Core Concept: Maximum Margin

SVM finds the hyperplane that maximizes the margin between classes

Linear Separability

For linearly separable data, many lines could separate the two classes. Which one is best?

SVM's answer: The line that maximizes the margin.

The Margin

The margin is the distance between the decision boundary and the nearest data points from each class.

Decision Boundary (Hyperplane): The line (or plane in higher dimensions) that separates classes
Margin: The "buffer zone" on either side of the boundary
Support Vectors: The data points closest to the boundary that define the margin

Why Maximum Margin?

Maximizing the margin provides:

Better Generalization: More room for error on new data
Robustness: Less sensitive to small perturbations
Unique Solution: Only one maximum margin hyperplane exists
Theoretical Guarantees: Strong mathematical foundations

How SVM Works

For Linearly Separable Data

Step 1: Define the Hyperplane

In 2D, the decision boundary is a line:

w₁x₁ + w₂x₂ + b = 0

In n dimensions, it's a hyperplane:

w·x + b = 0

Where:

w: Weight vector (perpendicular to the hyperplane)
b: Bias term (shifts the hyperplane)
x: Feature vector

Step 2: Define the Margin

Points on the margin boundaries satisfy:

w·x + b = +1  (positive class margin)
w·x + b = -1  (negative class margin)

The margin width is:

margin = 2 / ||w||

Step 3: Maximize the Margin

To maximize the margin, minimize ||w||:

minimize: (1/2)||w||²
subject to: yᵢ(w·xᵢ + b) ≥ 1 for all i

Where yᵢ ∈ {-1, +1} is the class label.

Step 4: Solve the Optimization Problem

This is a convex quadratic programming problem with a unique solution. The solution depends only on the support vectors - the points on the margin boundaries.

Making Predictions

To classify a new point x:

f(x) = sign(w·x + b)

If f(x) > 0: Predict positive class
If f(x) < 0: Predict negative class
If f(x) = 0: On the decision boundary

Support Vectors

Support vectors are the critical points that define the decision boundary

Support vectors are the training points that lie on the margin boundaries. They are the most important points because:

They Define the Boundary: Only support vectors determine the hyperplane
Other Points Don't Matter: Points far from the margin don't affect the solution
Sparse Representation: Usually only a small fraction of training points are support vectors
Memory Efficiency: Only need to store support vectors for predictions

Properties

Removing non-support vectors doesn't change the model
Adding points far from the margin doesn't change the model
Support vectors are the "critical" examples
Fewer support vectors = simpler model

Soft Margin SVM: The C Parameter

Real-world data is rarely perfectly separable. Soft margin SVM allows some misclassifications.

The C Parameter

C controls the tradeoff between:

Maximizing the margin (simpler model)
Minimizing classification errors (better training accuracy)

minimize: (1/2)||w||² + C × Σξᵢ

Where ξᵢ are slack variables allowing violations.

Impact of C

Small C (e.g., C = 0.01)

Wide margin: More regularization
More violations allowed: Some points can be misclassified or inside the margin
Simpler model: Less overfitting
Lower training accuracy: May underfit
Better generalization: Often better on test data

Large C (e.g., C = 100)

Narrow margin: Less regularization
Fewer violations: Tries to classify all training points correctly
Complex model: More overfitting risk
Higher training accuracy: May overfit
Worse generalization: May perform poorly on test data

Optimal C

Found through cross-validation
Balances bias and variance
Depends on the dataset

Kernel Functions: Handling Non-linear Data

The kernel trick maps data to higher dimensions where it becomes linearly separable

For non-linearly separable data, SVM uses the kernel trick to implicitly map data to a higher-dimensional space where it becomes linearly separable.

The Kernel Trick

Instead of explicitly computing the high-dimensional mapping φ(x), kernels compute the inner product directly:

K(x, x') = φ(x)·φ(x')

This is computationally efficient and allows infinite-dimensional spaces!

Common Kernels

Linear Kernel

K(x, x') = x·x'

When to use:

Data is linearly separable
High-dimensional data (text, images)
Fast training and prediction
Interpretable model

RBF (Radial Basis Function) Kernel

K(x, x') = exp(-γ||x - x'||²)

When to use:

Non-linear relationships
No prior knowledge about data structure
Most popular choice
Works well in many scenarios

Gamma parameter:

Small γ: Smooth, simple boundary (high bias)
Large γ: Complex, wiggly boundary (high variance)
Default: 1 / (n_features × variance)

Polynomial Kernel

K(x, x') = (x·x' + c)^d

Where d is the degree.

When to use:

Polynomial relationships
Image processing
Specific domain knowledge suggests polynomial features

Degree parameter:

d = 1: Linear kernel
d = 2: Quadratic features
d = 3+: Higher-order interactions

Choosing a Kernel

Start with RBF: Works well for most problems
Try Linear: If you have many features or need speed
Try Polynomial: If you suspect polynomial relationships
Use Cross-Validation: Compare performance

Hyperparameters

C (Regularization)

Range: 0.01 to 100 (typically)
Small C: More regularization, wider margin
Large C: Less regularization, narrower margin
Tuning: Use cross-validation

Gamma (RBF/Polynomial)

Range: 0.001 to 10 (typically)
Small γ: Smooth boundary, far-reaching influence
Large γ: Complex boundary, local influence
Default: 'scale' = 1 / (n_features × X.var())
Tuning: Use cross-validation

Degree (Polynomial)

Range: 2 to 5 (typically)
Higher degree: More complex, slower
Tuning: Start with 2 or 3

Advantages of SVM

Effective in High Dimensions: Works well even when n_features > n_samples
Memory Efficient: Only uses support vectors
Versatile: Different kernels for different data
Robust: Maximum margin provides good generalization
Theoretically Sound: Strong mathematical foundations
Works with Small Datasets: Doesn't require massive amounts of data

Limitations of SVM

Slow Training: O(n²) to O(n³) for large datasets
Kernel Choice: Requires selecting appropriate kernel
Hyperparameter Tuning: C and gamma need careful tuning
No Probability Estimates: Doesn't naturally output probabilities (can be added)
Sensitive to Feature Scaling: Requires normalization/standardization
Black Box with Kernels: Non-linear kernels are less interpretable
Memory for Large Datasets: Kernel matrix can be huge