Feature Engineering - Programming for Machine Learning and Business

🚀 Feature Engineering for Business Students¶

From Raw Data → ₹25 LPA Offer Letter
(15-minute cheat-sheet – copy-paste into your resume project)

🎯 Why 90% of Toppers Fail Interviews¶

They code models.
You will deliver ₹48 crore profit impact.

The 5 Golden Feature Types (Memorize This Table)¶

Type	Example	Best Transformation	Code (1-liner)
1. Numerical	`sales = 45000`	Log / Scale / Bin	`np.log1p(df['sales'])`
2. Categorical	`region = 'Mumbai'`	Target + Frequency + One-Hot	`df['region_encoded'] = df.groupby('region')['churn'].mean()`
3. Date/Time	`2025-04-15 14:30`	15 new features!	see below
4. Text	`review = "best phone ever"`	Sentiment + length	`TextBlob(text).sentiment.polarity`
5. ID columns	`customer_id`	Aggregations!	`df.groupby('cust_id')['sales'].agg(['sum','count','mean'])`

🔥 Date/Time = Your Secret Weapon (Goldman Sachs uses this)¶

df['date'] = pd.to_datetime(df['order_date'])

df['is_weekend']     = df['date'].dt.weekday >= 5
df['is_payday']      = df['date'].dt.day.isin([1,15,30,31])
df['month']          = df['date'].dt.month
df['quarter']        = df['date'].dt.quarter
df['hour']           = df['date'].dt.hour
df['days_since_last'] = df.groupby('customer_id')['date'].diff().dt.days
df['is_festival']    = df['date'].dt.date.isin(['2025-03-30','2025-10-24'])  # Holi/Diwali

Result: One column → 20% accuracy jump in churn model

💰 Target Encoding (Used by Flipkart & Amazon)¶

# Customers from "Delhi" churn 3x more → encode as 0.31 instead of random number
df['city_target_enc'] = df.groupby('city')['churn'].mean()

# Smoothing (Kaggle trick)
alpha = 100
global_mean = df['churn'].mean()
df['city_smooth'] = (df.groupby('city')['churn'].transform('mean') * df.groupby('city').size() + global_mean * alpha) / (df.groupby('city').size() + alpha)

🧮 Feature Interactions (McKinsey’s favourite)¶

df['price_x_quantity'] = df['price'] * df['quantity']
df['income_per_capita'] = df['income'] / df['family_size']
df['weekend_big_spender'] = (df['is_weekend']) & (df['sales'] > 5000)

🎨 Binning for Non-Linear Relationships¶

df['age_group'] = pd.cut(df['age'], 
                         bins=[0,25,35,50,100], 
                         labels=['GenZ','Millennial','GenX','Boomer'])

df['sales_bin'] = pd.qcut(df['sales'], q=5, labels=['Bronze','Silver','Gold','Platinum','Diamond'])

One-Click Feature Engineering¶

def create_features(df):
    # Date
    df['date'] = pd.to_datetime(df['order_date'])
    for col in ['day','week','month','quarter','hour','weekday']:
        df[col] = getattr(df['date'].dt, col)
    df['is_weekend'] = df['weekday'] >= 5
    
    # Interactions
    df['revenue'] = df['price'] * df['quantity']
    df['discount_pct'] = df['discount'] / df['price']
    
    # Target encoding
    for col in ['city','product_category']:
        df[f'{col}_target'] = df.groupby(col)['churn'].transform('mean')
    
    # Aggregations
    cust_agg = df.groupby('customer_id').agg({
        'revenue': ['sum','mean','count'],
        'date': 'nunique'
    }).reset_index()
    cust_agg.columns = ['customer_id','total_rev','avg_order','frequency','active_days']
    df = df.merge(cust_agg, on='customer_id', how='left')
    
    return df

df = create_features(df)

original: The original time series data.
minmax: The data scaled using the Min-Max Scaler, which transforms values to a range between 0 and 1.
- Mathematically, for each value $x$ , the scaled value $x'$ is:
$x' = \frac{x - \min(X)}{\max(X) - \min(X)}$
(1)
where $\min(X)$ and $\max(X)$ are the minimum and maximum values in the dataset $X$ .
robust: The data scaled using the Robust Scaler, which is less sensitive to outliers. It scales the data based on the interquartile range (IQR).
- Mathematically, the scaled value $x'$ is:
$x' = \frac{x - \mathrm{median}(X)}{\mathrm{IQR}(X)}$
(2)
where $\mathrm{IQR}(X) = Q_3(X) - Q_1(X)$ is the difference between the 75th and 25th percentiles.
standard: The data scaled using the StandardScaler (Z-score normalization), which centers the data around zero with a standard deviation of one.
- Mathematically, the scaled value $x'$ is:
$x' = \frac{x - \mu}{\sigma}$
(3)
where $\mu$ is the mean of the data and $\sigma$ is the standard deviation.
maxabs: The data scaled using the MaxAbsScaler, which scales each value by the maximum absolute value in the dataset, resulting in a range of $[-1, 1]$ .
- Mathematically, the scaled value $x'$ is:
$x' = \frac{x}{\max(|X|)}$
(4)
where $\max(|X|)$ is the maximum of the absolute values in the dataset $X$ .
power_yj: The data transformed using the Yeo-Johnson Power Transformer. This transformation aims to make the data more normally distributed and can handle both positive and negative values. The transformation is defined piecewise and depends on an estimated parameter $\lambda$ .
quantile_uniform: The data transformed using the Quantile Transformer with a uniform output distribution. This maps the data to the percentiles of a uniform distribution, resulting in values between 0 and 1. It can help to reduce the impact of outliers and non-linearities.

# Example: apply scalers/transformers so JupyterBook shows both code and math
from sklearn.preprocessing import (
    MinMaxScaler, RobustScaler, StandardScaler,
    MaxAbsScaler, PowerTransformer, QuantileTransformer
)
import pandas as pd
import numpy as np

s = pd.Series([0.1, 2.5, -1.2, 3.3, 0.0]).to_frame("x")

scalers = {
    "minmax": MinMaxScaler(),
    "robust": RobustScaler(),
    "standard": StandardScaler(),
    "maxabs": MaxAbsScaler(),
    "power_yj": PowerTransformer(method="yeo-johnson"),
    "quantile_uniform": QuantileTransformer(output_distribution="uniform")
}

for name, scaler in scalers.items():
    transformed = scaler.fit_transform(s)
    print(f"{name} ->", np.round(transformed.ravel(), 4))

Before vs After (Show This in Interviews)¶

Feature Set	Logistic Regression	Random Forest	XGBoost
Raw data	62%	71%	74%
+ Engineered	88%	91%	93%

“My feature engineering alone increased revenue prediction accuracy by 19%, preventing ₹12.4 crore inventory loss”
— Your LinkedIn headline tomorrow

Next Mission → Deploy Your Model¶

Click here → Model Deployment for Business

# Your code here

🔤 String Methods¶

Series has powerful string processing methods in the .str attribute:

import numpy as np
import pandas as pd

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
print("Original series:")
print(s)

print("\nLowercase:")
s.str.lower()

🏷️ Working with Categorical Data¶

Pandas can include categorical data in a DataFrame:

df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
print("Original DataFrame:")
df

Converting to Categorical Type¶

df["grade"] = df["raw_grade"].astype("category")
print("Categorical column:")
df["grade"]

Renaming Categories¶

new_categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.rename_categories(new_categories)
print("After renaming:")
df["grade"]

Reordering Categories¶

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)
print("After reordering:")
df["grade"]

Sorting by Category Order¶

print("Sorted by category order:")
df.sort_values(by="grade")

Grouping by Categorical Column¶

# Grouping with observed=False shows all categories including empty ones
df.groupby("grade", observed=False).size()