Feature Selection

intermediate

Feature Selection

Introduction

Feature selection is the process of selecting a subset of relevant features from the original feature set. It's a crucial preprocessing step that can improve model performance, reduce overfitting, decrease training time, and enhance model interpretability by removing irrelevant, redundant, or noisy features.

Why Feature Selection Matters

In real-world datasets, not all features contribute equally to the prediction task:

  • Irrelevant features add noise and can hurt model performance
  • Redundant features provide the same information as other features
  • High dimensionality can lead to the curse of dimensionality
  • Computational cost increases with more features

Feature selection helps address these issues by identifying the most informative features.

Types of Feature Selection Methods

Feature Selection CategoriesOverview of feature selection method categories

1. Filter Methods

Evaluate features independently of the learning algorithm using statistical measures.

Characteristics:

  • Fast and computationally efficient
  • Independent of the learning algorithm
  • Good for initial feature screening
  • May miss feature interactions

2. Wrapper Methods

Use the learning algorithm itself to evaluate feature subsets.

Characteristics:

  • Consider feature interactions
  • Algorithm-specific
  • Computationally expensive
  • Risk of overfitting

3. Embedded Methods

Perform feature selection as part of the model training process.

Characteristics:

  • Balance between filter and wrapper methods
  • Algorithm-specific
  • Efficient
  • Consider feature interactions

Filter Methods

Variance Threshold

Variance ThresholdFeatures with low variance provide little information

Removes features with variance below a threshold.

When to use:

  • Remove constant or quasi-constant features
  • Initial feature screening
  • When you have many features with little variation

Formula: Var(X) = E[(X - μ)²]

Pros:

  • Very fast
  • Simple to understand
  • Good for removing obviously useless features

Cons:

  • Doesn't consider relationship with target
  • May remove useful low-variance features
  • Threshold selection can be arbitrary

Correlation-Based Selection

Selects features based on their correlation with the target variable.

When to use:

  • Linear relationships between features and target
  • Continuous target variables
  • Quick feature ranking

Formula: r = Σ[(xi - x̄)(yi - ȳ)] / √[Σ(xi - x̄)²Σ(yi - ȳ)²]

Pros:

  • Easy to interpret
  • Fast computation
  • Good for linear relationships

Cons:

  • Only captures linear relationships
  • Doesn't consider feature interactions
  • May select redundant features

Mutual Information

Mutual InformationMutual information measures statistical dependence

Measures the mutual dependence between features and target.

When to use:

  • Non-linear relationships
  • Both continuous and categorical variables
  • When correlation is insufficient

Formula: I(X;Y) = Σ p(x,y) log(p(x,y)/(p(x)p(y)))

Pros:

  • Captures non-linear relationships
  • Works with categorical variables
  • Information-theoretic foundation

Cons:

  • Requires discretization for continuous variables
  • Computationally more expensive
  • Sensitive to binning strategy

Statistical Tests (F-score, Chi-square)

Use statistical tests to measure feature-target relationships.

F-score (ANOVA):

  • For continuous features and categorical targets
  • Measures variance between groups vs within groups

Chi-square:

  • For categorical features and categorical targets
  • Tests independence between variables

When to use:

  • When you need statistical significance
  • Appropriate data types (continuous/categorical)
  • Formal hypothesis testing

Wrapper Methods

Recursive Feature Elimination (RFE)

RFE ProcessRFE iteratively removes the least important features

Recursively eliminates features by training models and removing the least important features.

Process:

  1. Train model on all features
  2. Rank features by importance
  3. Remove least important feature
  4. Repeat until desired number of features

When to use:

  • When you have a specific target number of features
  • With algorithms that provide feature importance
  • When computational cost is acceptable

Pros:

  • Considers feature interactions
  • Algorithm-specific selection
  • Systematic approach

Cons:

  • Computationally expensive
  • Risk of overfitting
  • Requires multiple model training

Forward/Backward Selection

Forward Selection:

  • Start with no features
  • Add features one by one
  • Select feature that improves performance most

Backward Selection:

  • Start with all features
  • Remove features one by one
  • Remove feature that hurts performance least

When to use:

  • When you want to optimize for specific performance metric
  • With smaller feature sets
  • When you need interpretable selection process

Embedded Methods

L1 Regularization (Lasso)

L1 vs L2 RegularizationL1 regularization promotes sparsity

Uses L1 penalty to drive feature weights to zero.

Formula: Loss = MSE + λΣ|wi|

When to use:

  • With linear models
  • When you want automatic feature selection
  • When interpretability is important

Pros:

  • Automatic feature selection
  • Built into model training
  • Handles multicollinearity

Cons:

  • Limited to L1-compatible algorithms
  • May select arbitrary features from correlated groups
  • Requires hyperparameter tuning

Tree-based Feature Importance

Uses feature importance from tree-based models.

Methods:

  • Gini importance (Random Forest)
  • Permutation importance
  • SHAP values

When to use:

  • With tree-based algorithms
  • Non-linear relationships
  • Feature interactions matter

Pros:

  • Captures non-linear relationships
  • Considers feature interactions
  • Built into many algorithms

Cons:

  • Algorithm-specific
  • Can be biased toward high-cardinality features
  • May not generalize to other algorithms

Choosing the Right Method

Feature Selection Decision TreeDecision framework for feature selection

Decision Framework:

  1. Dataset size:
    • Small → Wrapper methods
    • Large → Filter methods
    • Medium → Embedded methods
  2. Computational resources:
    • Limited → Filter methods
    • Abundant → Wrapper methods
  3. Relationship type:
    • Linear → Correlation, F-score
    • Non-linear → Mutual information, tree-based
  4. Target variable:
    • Continuous → Correlation, F-score, mutual information
    • Categorical → Chi-square, mutual information
  5. Algorithm choice:
    • Linear models → L1 regularization
    • Tree-based → Built-in importance
    • Any → Filter methods first

Best Practices

Feature Selection PipelineProper cross-validation with feature selection

  1. Apply after data splitting: Avoid data leakage
  2. Use cross-validation: Get robust feature selection
  3. Combine methods: Use filter → wrapper/embedded
  4. Consider domain knowledge: Don't ignore expert insights
  5. Monitor performance: Ensure selection improves results
  6. Handle multicollinearity: Remove redundant features
  7. Validate on test set: Confirm generalization

Common Pitfalls

  1. Data leakage: Selecting features using entire dataset
  2. Overfitting: Too aggressive selection on small datasets
  3. Ignoring interactions: Using only univariate methods
  4. Threshold sensitivity: Arbitrary cutoff selection
  5. Algorithm mismatch: Using inappropriate selection method

Key Takeaways

  1. Feature selection improves models by removing noise and reducing dimensionality
  2. Choose method based on data characteristics and computational constraints
  3. Combine multiple approaches for robust selection
  4. Validate properly to avoid overfitting and data leakage
  5. Consider domain knowledge alongside statistical measures
  6. Monitor performance to ensure selection helps your specific task

Interactive Exploration

Use the controls to:

  • Switch between different selection methods
  • Adjust thresholds and parameters
  • Compare original vs selected feature sets
  • Observe how different methods rank features
  • Explore the trade-off between feature count and information retention

Sign in to Continue

Sign in with Google to save your learning progress, quiz scores, and bookmarks across devices.

Track your progress across all modules
Save quiz scores and bookmarks
Sync learning data across devices