Feature selection is an important step in the machine learning pipeline. It helps improve model accuracy, reduce overfitting, and speed up training time by selecting only the most relevant features (columns) from your dataset.

What and Why of Feature Selection in Machine Learning?

What is Feature Selection?

Feature selection is choosing a subset of relevant features (predictors) for use in model construction.

Why is Feature Selection Important?

BenefitDescription
Improved AccuracyRemoving irrelevant or noisy features can improve prediction performance
Faster TrainingFewer features mean faster computation and reduced model complexity.
Less OverfittingReduces the chance of the model learning noise instead of the pattern.
Better InterpretabilitySimpler models with fewer features are easier to understand and explain.

Common Feature Selection Techniques for Machine Learning

There are 3 major categories of common feature selection:

Filter Methods

  • Based on statistical tests.
  • Independent of any ML model.
  • Examples: Correlation, Chi-squared test, ANOVA F-test

Wrapper Methods

  • Use a predictive model to score feature subsets.
  • Example: Recursive Feature Elimination (RFE)

Embedded Methods

  • Feature selection is done during model training.
  • Examples: Lasso (L1 Regularization), Tree-based methods (like Random Forest)

Code Example: Using Filter and Embedded Methods

Use Case: Selecting Features from Breast Cancer Dataset

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
features = data.feature_names
# Filter Method: Select top 5 features using ANOVA F-test
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
selected_features = features[selector.get_support()]
print("Selected Features (Filter method):", selected_features)
# Train model with selected features
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
print("Accuracy with selected features:", accuracy_score(y_test, y_pred))

Sample Output:

Selected Features (Filter method): [‘mean concave points’ ‘worst perimeter’ ‘worst concave points’ ‘worst radius’ ‘worst area’]

Accuracy with selected features: 0.9561

Embedded Method Example: Using Feature Importances from Random Forest

model = RandomForestClassifier()
model.fit(X, y)
# Get feature importances
importances = model.feature_importances_
indices = np.argsort(importances)[::-1][:5] # top 5
top_features = features[indices]
print("Top 5 Features (Embedded method):", top_features)

Conclusion

Feature selection is needed to build efficient and accurate machine learning models. By choosing only the most relevant features, you can:

  • Reduce time it takes to train AI models
  • Boost the model’s performance
  • Reduce the risk of overfitting
  • Make the results easier to interpret