Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

🚀 Feature Engineering for Business Students

From Raw Data → ₹25 LPA Offer Letter
(15-minute cheat-sheet – copy-paste into your resume project)


🎯 Why 90% of Toppers Fail Interviews

They code models.
You will deliver ₹48 crore profit impact.


The 5 Golden Feature Types (Memorize This Table)

TypeExampleBest TransformationCode (1-liner)
1. Numericalsales = 45000Log / Scale / Binnp.log1p(df['sales'])
2. Categoricalregion = 'Mumbai'Target + Frequency + One-Hotdf['region_encoded'] = df.groupby('region')['churn'].mean()
3. Date/Time2025-04-15 14:3015 new features!see below
4. Textreview = "best phone ever"Sentiment + lengthTextBlob(text).sentiment.polarity
5. ID columnscustomer_idAggregations!df.groupby('cust_id')['sales'].agg(['sum','count','mean'])

🔥 Date/Time = Your Secret Weapon (Goldman Sachs uses this)

df['date'] = pd.to_datetime(df['order_date'])

df['is_weekend']     = df['date'].dt.weekday >= 5
df['is_payday']      = df['date'].dt.day.isin([1,15,30,31])
df['month']          = df['date'].dt.month
df['quarter']        = df['date'].dt.quarter
df['hour']           = df['date'].dt.hour
df['days_since_last'] = df.groupby('customer_id')['date'].diff().dt.days
df['is_festival']    = df['date'].dt.date.isin(['2025-03-30','2025-10-24'])  # Holi/Diwali

Result: One column → 20% accuracy jump in churn model


💰 Target Encoding (Used by Flipkart & Amazon)

# Customers from "Delhi" churn 3x more → encode as 0.31 instead of random number
df['city_target_enc'] = df.groupby('city')['churn'].mean()

# Smoothing (Kaggle trick)
alpha = 100
global_mean = df['churn'].mean()
df['city_smooth'] = (df.groupby('city')['churn'].transform('mean') * df.groupby('city').size() + global_mean * alpha) / (df.groupby('city').size() + alpha)

🧮 Feature Interactions (McKinsey’s favourite)

df['price_x_quantity'] = df['price'] * df['quantity']
df['income_per_capita'] = df['income'] / df['family_size']
df['weekend_big_spender'] = (df['is_weekend']) & (df['sales'] > 5000)

🎨 Binning for Non-Linear Relationships

df['age_group'] = pd.cut(df['age'], 
                         bins=[0,25,35,50,100], 
                         labels=['GenZ','Millennial','GenX','Boomer'])

df['sales_bin'] = pd.qcut(df['sales'], q=5, labels=['Bronze','Silver','Gold','Platinum','Diamond'])

One-Click Feature Engineering

def create_features(df):
    # Date
    df['date'] = pd.to_datetime(df['order_date'])
    for col in ['day','week','month','quarter','hour','weekday']:
        df[col] = getattr(df['date'].dt, col)
    df['is_weekend'] = df['weekday'] >= 5
    
    # Interactions
    df['revenue'] = df['price'] * df['quantity']
    df['discount_pct'] = df['discount'] / df['price']
    
    # Target encoding
    for col in ['city','product_category']:
        df[f'{col}_target'] = df.groupby(col)['churn'].transform('mean')
    
    # Aggregations
    cust_agg = df.groupby('customer_id').agg({
        'revenue': ['sum','mean','count'],
        'date': 'nunique'
    }).reset_index()
    cust_agg.columns = ['customer_id','total_rev','avg_order','frequency','active_days']
    df = df.merge(cust_agg, on='customer_id', how='left')
    
    return df

df = create_features(df)
  • original: The original time series data.

  • minmax: The data scaled using the Min-Max Scaler, which transforms values to a range between 0 and 1.

    • Mathematically, for each value xx, the scaled value xx' is:

    x=xmin(X)max(X)min(X)x' = \frac{x - \min(X)}{\max(X) - \min(X)}

    where min(X)\min(X) and max(X)\max(X) are the minimum and maximum values in the dataset XX.

  • robust: The data scaled using the Robust Scaler, which is less sensitive to outliers. It scales the data based on the interquartile range (IQR).

    • Mathematically, the scaled value xx' is:

    x=xmedian(X)IQR(X)x' = \frac{x - \mathrm{median}(X)}{\mathrm{IQR}(X)}

    where IQR(X)=Q3(X)Q1(X)\mathrm{IQR}(X) = Q_3(X) - Q_1(X) is the difference between the 75th and 25th percentiles.

  • standard: The data scaled using the StandardScaler (Z-score normalization), which centers the data around zero with a standard deviation of one.

    • Mathematically, the scaled value xx' is:

    x=xμσx' = \frac{x - \mu}{\sigma}

    where μ\mu is the mean of the data and σ\sigma is the standard deviation.

  • maxabs: The data scaled using the MaxAbsScaler, which scales each value by the maximum absolute value in the dataset, resulting in a range of [1,1][-1, 1].

    • Mathematically, the scaled value xx' is:

    x=xmax(X)x' = \frac{x}{\max(|X|)}

    where max(X)\max(|X|) is the maximum of the absolute values in the dataset XX.

  • power_yj: The data transformed using the Yeo-Johnson Power Transformer. This transformation aims to make the data more normally distributed and can handle both positive and negative values. The transformation is defined piecewise and depends on an estimated parameter λ\lambda.

  • quantile_uniform: The data transformed using the Quantile Transformer with a uniform output distribution. This maps the data to the percentiles of a uniform distribution, resulting in values between 0 and 1. It can help to reduce the impact of outliers and non-linearities.

# Example: apply scalers/transformers so JupyterBook shows both code and math
from sklearn.preprocessing import (
    MinMaxScaler, RobustScaler, StandardScaler,
    MaxAbsScaler, PowerTransformer, QuantileTransformer
)
import pandas as pd
import numpy as np

s = pd.Series([0.1, 2.5, -1.2, 3.3, 0.0]).to_frame("x")

scalers = {
    "minmax": MinMaxScaler(),
    "robust": RobustScaler(),
    "standard": StandardScaler(),
    "maxabs": MaxAbsScaler(),
    "power_yj": PowerTransformer(method="yeo-johnson"),
    "quantile_uniform": QuantileTransformer(output_distribution="uniform")
}

for name, scaler in scalers.items():
    transformed = scaler.fit_transform(s)
    print(f"{name} ->", np.round(transformed.ravel(), 4))

Before vs After (Show This in Interviews)

Feature SetLogistic RegressionRandom ForestXGBoost
Raw data62%71%74%
+ Engineered88%91%93%

“My feature engineering alone increased revenue prediction accuracy by 19%, preventing ₹12.4 crore inventory loss”
— Your LinkedIn headline tomorrow


Next Mission → Deploy Your Model

Click here → Model Deployment for Business

# Your code here

🔤 String Methods

Series has powerful string processing methods in the .str attribute:

import numpy as np
import pandas as pd

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
print("Original series:")
print(s)

print("\nLowercase:")
s.str.lower()

🏷️ Working with Categorical Data

Pandas can include categorical data in a DataFrame:

df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
print("Original DataFrame:")
df

Converting to Categorical Type

df["grade"] = df["raw_grade"].astype("category")
print("Categorical column:")
df["grade"]

Renaming Categories

new_categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.rename_categories(new_categories)
print("After renaming:")
df["grade"]

Reordering Categories

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)
print("After reordering:")
df["grade"]

Sorting by Category Order

print("Sorted by category order:")
df.sort_values(by="grade")

Grouping by Categorical Column

# Grouping with observed=False shows all categories including empty ones
df.groupby("grade", observed=False).size()