Feature Engineering#
🚀 Feature Engineering for Business Students#
From Raw Data → ₹25 LPA Offer Letter
(15-minute cheat-sheet – copy-paste into your resume project)
graph TD
A[Raw CSV] --> B[Feature Engineering]
B --> C[Model accuracy: 62% → 89%]
C --> D[₹8L → ₹25L package]
🎯 Why 90% of Toppers Fail Interviews#
They code models.
You will deliver ₹48 crore profit impact.
The 5 Golden Feature Types (Memorize This Table)#
Type |
Example |
Best Transformation |
Code (1-liner) |
|---|---|---|---|
1. Numerical |
|
Log / Scale / Bin |
|
2. Categorical |
|
Target + Frequency + One-Hot |
|
3. Date/Time |
|
15 new features! |
see below |
4. Text |
|
Sentiment + length |
|
5. ID columns |
|
Aggregations! |
|
🔥 Date/Time = Your Secret Weapon (Goldman Sachs uses this)#
Result: One column → 20% accuracy jump in churn model
💰 Target Encoding (Used by Flipkart & Amazon)#
🧮 Feature Interactions (McKinsey’s favourite)#
🎨 Binning for Non-Linear Relationships#
One-Click Feature Engineering#
original: The original time series data.
minmax: The data scaled using the Min-Max Scaler, which transforms values to a range between 0 and 1.
Mathematically, for each value \(x\), the scaled value \(x'\) is:
\[ x' = \frac{x - \min(X)}{\max(X) - \min(X)} \]where \(\min(X)\) and \(\max(X)\) are the minimum and maximum values in the dataset \(X\).
robust: The data scaled using the Robust Scaler, which is less sensitive to outliers. It scales the data based on the interquartile range (IQR).
Mathematically, the scaled value \(x'\) is:
\[ x' = \frac{x - \mathrm{median}(X)}{\mathrm{IQR}(X)} \]where \(\mathrm{IQR}(X) = Q_3(X) - Q_1(X)\) is the difference between the 75th and 25th percentiles.
standard: The data scaled using the StandardScaler (Z-score normalization), which centers the data around zero with a standard deviation of one.
Mathematically, the scaled value \(x'\) is:
\[ x' = \frac{x - \mu}{\sigma} \]where \(\mu\) is the mean of the data and \(\sigma\) is the standard deviation.
maxabs: The data scaled using the MaxAbsScaler, which scales each value by the maximum absolute value in the dataset, resulting in a range of \([-1, 1]\).
Mathematically, the scaled value \(x'\) is:
\[ x' = \frac{x}{\max(|X|)} \]where \(\max(|X|)\) is the maximum of the absolute values in the dataset \(X\).
power_yj: The data transformed using the Yeo-Johnson Power Transformer. This transformation aims to make the data more normally distributed and can handle both positive and negative values. The transformation is defined piecewise and depends on an estimated parameter \(\lambda\).quantile_uniform: The data transformed using the Quantile Transformer with a uniform output distribution. This maps the data to the percentiles of a uniform distribution, resulting in values between 0 and 1. It can help to reduce the impact of outliers and non-linearities.
Before vs After (Show This in Interviews)#
Feature Set |
Logistic Regression |
Random Forest |
XGBoost |
|---|---|---|---|
Raw data |
62% |
71% |
74% |
+ Engineered |
88% |
91% |
93% |
“My feature engineering alone increased revenue prediction accuracy by 19%, preventing ₹12.4 crore inventory loss”
— Your LinkedIn headline tomorrow
Next Mission → Deploy Your Model#
Click here → Model Deployment for Business
# Your code here
🔤 String Methods#
Series has powerful string processing methods in the .str attribute:
import numpy as np
import pandas as pd
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
print("Original series:")
print(s)
print("\nLowercase:")
s.str.lower()
🏷️ Working with Categorical Data#
Pandas can include categorical data in a DataFrame:
df = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
print("Original DataFrame:")
df
Converting to Categorical Type#
df["grade"] = df["raw_grade"].astype("category")
print("Categorical column:")
df["grade"]
Renaming Categories#
new_categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.rename_categories(new_categories)
print("After renaming:")
df["grade"]
Reordering Categories#
df["grade"] = df["grade"].cat.set_categories(
["very bad", "bad", "medium", "good", "very good"]
)
print("After reordering:")
df["grade"]
Sorting by Category Order#
print("Sorted by category order:")
df.sort_values(by="grade")
Grouping by Categorical Column#
# Grouping with observed=False shows all categories including empty ones
df.groupby("grade", observed=False).size()