Decision Trees

Introduction

Decision trees are one of the most intuitive and interpretable machine learning algorithms. They make predictions by learning a series of if-then-else decision rules from the training data, creating a tree-like structure that mimics human decision-making.

Imagine diagnosing a patient: "If temperature > 38°C, then check for cough. If cough is present, then likely flu." This is exactly how decision trees work - they ask a series of questions about the features to arrive at a prediction.

What You'll Learn

By the end of this module, you will:

Understand how decision trees learn rules from data
Learn about splitting criteria (Gini impurity and entropy)
Explore how tree depth affects model complexity
Recognize the importance of pruning to prevent overfitting
Understand how to interpret feature importance

How Decision Trees Work

Example of a decision tree structure with nodes and branches

The Tree Structure

A decision tree consists of:

Root Node: The top node representing the entire dataset
Internal Nodes: Decision points that split the data based on feature values
Branches: Connections representing the outcome of a decision
Leaf Nodes: Terminal nodes that contain the final prediction

Building the Tree

The algorithm builds the tree top-down using a greedy approach:

Step 1: Start with All Data

Begin at the root with all training examples

Step 2: Find Best Split

For each feature, evaluate all possible split points
Choose the split that best separates the classes
Use a criterion like Gini impurity or entropy

Step 3: Create Child Nodes

Split the data into two groups based on the chosen feature and threshold
Create left and right child nodes

Step 4: Recurse

Repeat steps 2-3 for each child node
Continue until a stopping criterion is met:
- Maximum depth reached
- Minimum samples in node
- All samples in node have same class
- No further improvement possible

Step 5: Assign Class Labels

At each leaf node, assign the majority class of samples in that node

Making Predictions

To classify a new example:

Start at the root node
Follow the decision rules down the tree
At each node, go left or right based on the feature value
Reach a leaf node and return its class label

Splitting Criteria

Comparison of Gini impurity and entropy splitting criteria

Gini Impurity

Measures the probability of incorrectly classifying a randomly chosen element:

Gini = 1 - Σ(pᵢ)²

Where pᵢ is the proportion of samples belonging to class i.

Interpretation:

Gini = 0: Pure node (all samples same class)
Gini = 0.5: Maximum impurity (binary classification, 50-50 split)

Example:

Node with 100 samples: 70 class A, 30 class B
Gini = 1 - (0.7² + 0.3²) = 1 - (0.49 + 0.09) = 0.42

Advantages:

Computationally efficient
Favors larger partitions
Works well in practice

Entropy (Information Gain)

Measures the amount of disorder or uncertainty:

Entropy = -Σ(pᵢ × log₂(pᵢ))

Information Gain is the reduction in entropy after a split:

Information Gain = Entropy(parent) - Weighted Average Entropy(children)

Interpretation:

Entropy = 0: Pure node (no uncertainty)
Entropy = 1: Maximum uncertainty (binary, 50-50 split)

Example:

Node with 100 samples: 70 class A, 30 class B
Entropy = -(0.7 × log₂(0.7) + 0.3 × log₂(0.3)) ≈ 0.88

Advantages:

Theoretically grounded in information theory
Can produce slightly different trees than Gini
Often performs similarly to Gini

Gini vs. Entropy

In practice, both criteria usually produce similar trees:

Gini: Slightly faster to compute (no logarithm)
Entropy: More sensitive to changes in node probabilities
Choice: Often doesn't matter much; try both

Controlling Tree Complexity

Overfitting occurs when trees become too complex

Maximum Depth

Limits how deep the tree can grow:

Shallow trees (depth 1-3): Simple rules, may underfit
Medium trees (depth 4-8): Balanced complexity
Deep trees (depth 10+): Complex rules, may overfit

Impact:

Deeper trees can model more complex patterns
But they're more likely to memorize training data
Find the sweet spot through cross-validation

Minimum Samples to Split

Minimum number of samples required to split a node:

Low values (2-5): More splits, complex tree
High values (20-50): Fewer splits, simpler tree

Purpose:

Prevents creating nodes with very few samples
Reduces overfitting
Improves generalization

Minimum Samples per Leaf

Minimum number of samples required in a leaf node:

Ensures each prediction is based on sufficient data
Prevents overly specific rules
Typically set to 1-5 for classification

Maximum Features

Number of features to consider when looking for the best split:

All features: Considers every feature (default)
Square root: √(number of features) - common for classification
Log2: log₂(number of features)

Purpose:

Adds randomness to reduce overfitting
Useful in ensemble methods (Random Forests)

Pruning

Pruning removes parts of the tree that don't improve performance:

Pre-pruning (Early Stopping)

Stop growing the tree early using stopping criteria:

Maximum depth
Minimum samples to split
Minimum impurity decrease

Advantages:

Computationally efficient
Prevents overfitting during training

Post-pruning

Grow a full tree, then remove branches:

Cost-complexity pruning
Reduced error pruning

Advantages:

Can find better trees
More flexible

Process:

Grow a full tree
Evaluate each subtree
Remove branches that don't improve validation performance

Feature Importance

Decision trees naturally provide feature importance scores:

Importance = (samples at node / total samples) × impurity decrease

Interpretation:

Higher values = more important features
Sum of all importances = 1.0
Features used near the root are typically more important

Uses:

Feature selection
Understanding model behavior
Explaining predictions

Advantages of Decision Trees

Interpretability: Easy to understand and visualize
No Feature Scaling: Works with features on different scales
Handles Mixed Data: Works with numerical and categorical features
Non-linear Relationships: Can model complex, non-linear patterns
Feature Importance: Automatically ranks feature importance
Fast Predictions: O(log n) prediction time
Few Hyperparameters: Relatively easy to tune

Limitations of Decision Trees

Overfitting: Prone to creating overly complex trees
Instability: Small changes in data can create very different trees
Bias Toward Dominant Classes: Can be biased in imbalanced datasets
Axis-Aligned Splits: Can only create rectangular decision boundaries
Greedy Algorithm: May not find the globally optimal tree
High Variance: Single trees can be unstable

Overfitting in Decision Trees

Decision trees are prone to overfitting because they can:

Create a leaf for every training example
Memorize noise in the training data
Build overly complex rules

Signs of Overfitting:

Perfect or near-perfect training accuracy
Much lower test accuracy
Very deep tree with many leaves
Leaf nodes with very few samples

Solutions:

Limit tree depth
Increase minimum samples per split/leaf
Use pruning
Use ensemble methods (Random Forests, Gradient Boosting)

Decision Boundaries

Decision trees create axis-aligned decision boundaries:

Splits are always parallel to feature axes
Creates rectangular regions in feature space
Can approximate any boundary with enough splits
But may need many splits for diagonal boundaries

Example:

To separate classes with a diagonal line
Tree needs many horizontal and vertical splits
Creates a "staircase" approximation

Handling Categorical Features

Decision trees naturally handle categorical features:

Binary Features:

Split: "Is feature = value?"

Multi-class Features:

Split: "Is feature in {value1, value2, ...}?"

Ordinal Features:

Treat as numerical with natural ordering

Nominal Features:

Consider all possible subsets (computationally expensive)
Or use one-vs-rest splits

Handling Missing Values

Decision trees can handle missing values:

Surrogate Splits:

Find alternative features that create similar splits
Use when primary feature is missing

Separate Category:

Treat missing as a separate category
Let the tree learn how to handle it

Imputation:

Fill missing values before training
Use mean, median, or mode

Real-World Applications

Decision trees are used in many domains:

Medical Diagnosis: Symptom-based disease classification
Credit Scoring: Loan approval decisions
Customer Segmentation: Marketing target identification
Fraud Detection: Identifying suspicious transactions
Manufacturing: Quality control and defect detection
Recommendation Systems: Content filtering rules

When to Use Decision Trees

Decision trees work best when:

You need an interpretable model
You have mixed feature types (numerical and categorical)
You don't want to spend time on feature scaling
You need to explain predictions to non-technical stakeholders
You have non-linear relationships
You want to identify important features

When NOT to Use Decision Trees

Avoid single decision trees when:

You need the highest possible accuracy (use ensembles instead)
Your data has many irrelevant features
You have very high-dimensional data
You need stable predictions (small data changes shouldn't affect model)
You have limited training data (prone to overfitting)

Comparison with Other Algorithms

Decision Trees vs. KNN

Trees: Rule-based, interpretable, fast prediction
KNN: Distance-based, no training, slow prediction

Decision Trees vs. Logistic Regression

Trees: Non-linear, no assumptions, interpretable rules
Logistic Regression: Linear, probabilistic, smooth boundaries

Decision Trees vs. Neural Networks

Trees: Simple, interpretable, limited capacity
Neural Networks: Complex, powerful, black box

Ensemble Methods

Single decision trees are often combined into ensembles:

Random Forests

Train many trees on random subsets of data and features
Average predictions for better performance
Reduces overfitting and variance

Gradient Boosting

Train trees sequentially, each correcting previous errors
Creates powerful models
Used in XGBoost, LightGBM, CatBoost

AdaBoost

Weighted combination of weak trees
Focuses on hard-to-classify examples

Tips for Better Results

Start Simple: Begin with shallow trees (depth 3-5)
Use Cross-Validation: Find optimal depth and parameters
Prune Aggressively: Simpler trees generalize better
Check Feature Importance: Remove irrelevant features
Handle Imbalance: Use class weights for imbalanced data
Visualize the Tree: Understand what rules it learned
Consider Ensembles: Random Forests often outperform single trees
Monitor Training vs. Test: Watch for overfitting

Interpreting Decision Trees

Reading the Tree

Root: Most important feature for initial split
Depth: Complexity of decision rules
Leaf Values: Final predictions and confidence

Understanding Predictions

Trace the path from root to leaf
See which features and thresholds were used
Understand why a prediction was made

Feature Importance

Which features are used most?
Which features appear near the root?
Which features are never used?

Summary

Decision trees are powerful, interpretable algorithms that:

Learn if-then-else rules from data
Create tree structures for classification
Use splitting criteria like Gini impurity or entropy
Require careful tuning to prevent overfitting
Provide natural feature importance rankings
Form the basis for powerful ensemble methods

Understanding decision trees is essential for machine learning, as they're widely used both alone and as building blocks in ensemble methods like Random Forests and Gradient Boosting.

Next Steps

After mastering decision trees, explore:

Random Forests: Ensemble of decision trees
Gradient Boosting: Sequential tree ensembles (XGBoost, LightGBM)
Pruning Techniques: Advanced tree optimization
Feature Engineering: Creating better features for trees
Interpretability: SHAP values and tree explanations

Decision Trees

Interactive Exploration

Controls

Data

Model Parameters

Visualization

Quiz

Quiz Coming Soon

Sign in to Continue