Feature Encoding

intermediate

Feature Encoding

Introduction

Feature encoding is the process of converting categorical variables into numerical format that machine learning algorithms can understand. Most ML algorithms work with numerical data, so categorical features like "Red", "Green", "Blue" or "Small", "Medium", "Large" need to be transformed into numbers.

Why Feature Encoding Matters

Machine learning algorithms require numerical input, but real-world data often contains categorical information:

  • Product categories: Electronics, Clothing, Books
  • Geographic regions: North, South, East, West
  • Education levels: High School, Bachelor's, Master's, PhD
  • Yes/No responses: True, False

The encoding method you choose can significantly impact model performance and interpretability.

One-Hot Encoding

One-Hot Encoding ExampleOne-hot encoding creates binary features for each category

One-hot encoding creates a binary feature for each unique category value.

Example:

  • Color: Red, Green, BlueColor_Red, Color_Green, Color_Blue
  • Red → 1, 0, 0
  • Green → 0, 1, 0
  • Blue → 0, 0, 1

When to use:

  • Nominal categorical variables (no inherent order)
  • When you want to avoid imposing artificial ordering
  • With tree-based algorithms that handle sparse features well

Pros:

  • No artificial ordering imposed
  • Works well with linear models
  • Interpretable results

Cons:

  • Creates many features (curse of dimensionality)
  • Sparse representation
  • Can cause multicollinearity

Label Encoding

Label encoding assigns a unique integer to each category.

Example:

  • Size: Small, Medium, Large0, 1, 2

When to use:

  • Ordinal categorical variables (natural ordering exists)
  • When memory/computation is limited
  • With tree-based algorithms

Pros:

  • Memory efficient
  • Simple and fast
  • Preserves ordinal relationships

Cons:

  • Imposes artificial ordering on nominal variables
  • Can mislead distance-based algorithms
  • May suggest mathematical relationships that don't exist

Ordinal Encoding

Ordinal vs NominalDifferent levels of measurement require different encoding approaches

Similar to label encoding but with explicit ordering specification.

Example:

  • Education: High School, Bachelor's, Master's, PhD0, 1, 2, 3

When to use:

  • Variables with clear hierarchical order
  • When the order matters for the prediction
  • To preserve meaningful relationships

Binary Encoding

Binary encoding combines label encoding with binary representation.

Process:

  1. Apply label encoding
  2. Convert integers to binary
  3. Split binary digits into separate features

Example:

  • 8 categories → 3 binary features (2³ = 8)
  • Category 5 → Binary 101 → 1, 0, 1

When to use:

  • High cardinality categorical variables
  • When one-hot creates too many features
  • Memory-constrained environments

Pros:

  • More compact than one-hot
  • Handles high cardinality well
  • Reduces dimensionality

Cons:

  • Less interpretable
  • May not work well with all algorithms
  • Can create artificial relationships

Target Encoding (Mean Encoding)

Target Encoding ProcessTarget encoding uses statistical relationships with the target variable

Target encoding replaces categories with the mean target value for that category.

Example:

  • Category A: Average target = 0.8 → A becomes 0.8
  • Category B: Average target = 0.3 → B becomes 0.3

When to use:

  • High cardinality categorical variables
  • When categories have strong relationship with target
  • In supervised learning problems

Pros:

  • Captures relationship with target
  • Handles high cardinality efficiently
  • Often improves model performance

Cons:

  • Risk of overfitting
  • Requires target variable
  • Can cause data leakage if not done carefully

Choosing the Right Encoding Method

Decision Tree for EncodingDecision process for choosing encoding methods

Decision Framework:

  1. Is there natural ordering?
    • Yes → Ordinal or Label Encoding
    • No → Continue to step 2
  2. How many unique categories?
    • Few (< 10) → One-Hot Encoding
    • Many (> 50) → Target or Binary Encoding
    • Medium → Consider algorithm and memory constraints
  3. What algorithm are you using?
    • Tree-based → Label or Target Encoding
    • Linear models → One-Hot Encoding
    • Neural networks → One-Hot or Embedding
  4. Do you have target variable?
    • Yes → Consider Target Encoding
    • No → Use unsupervised methods

Key Takeaways

  1. Match encoding to data type: Nominal vs ordinal variables need different approaches
  2. Consider algorithm requirements: Some algorithms handle categorical features better than others
  3. Watch for overfitting: Target encoding can leak information if not done properly
  4. Balance interpretability vs performance: Simple encodings are more interpretable
  5. Handle high cardinality carefully: Too many categories can create problems

Interactive Exploration

Use the controls to:

  • Switch between different encoding methods
  • Adjust the number of categories per feature
  • Observe how encoding affects the feature space
  • Compare the dimensionality and sparsity of different methods
  • Explore the trade-offs between interpretability and efficiency

Sign in to Continue

Sign in with Google to save your learning progress, quiz scores, and bookmarks across devices.

Track your progress across all modules
Save quiz scores and bookmarks
Sync learning data across devices