Feature Types & Encoding#
Teaching Your Model to Speak Business
Welcome to the chapter where we teach your model to understand human language — or at least pretend to.
💬 “Your model doesn’t know what ‘East’ means. But give it a 0 or 1, and suddenly it’s Einstein.”
🧠 Why Encoding Matters#
Machine Learning models are like accountants — they only understand numbers.
So when you show them text like “Product Type = Luxury”, they panic.
Our job: translate business data into machine-friendly numbers without losing meaning.
🧱 Step 1. Identify Feature Types#
Before encoding, we need to know what kind of data we’re dealing with.
Type |
Example |
Encoding Method |
|---|---|---|
Numeric |
Sales = 1200 |
None (already fine) |
Categorical |
Region = East, West, North |
One-Hot or Label |
Ordinal |
Size = Small, Medium, Large |
Ordinal Encoding |
Boolean |
IsActive = True/False |
Binary (0/1) |
Date/Time |
2025-10-23 |
Date Components |
🧩 Step 2. Dealing with Categorical Variables#
Let’s start with the fun ones — the regions, products, and departments that give your boss headaches.
import pandas as pd
df = pd.DataFrame({
'region': ['East', 'West', 'East', 'North', 'South'],
'sales': [1200, 800, 900, 950, 1100]
})
df
🧃 One-Hot Encoding — “Everyone Gets a Column!”#
encoded_df = pd.get_dummies(df, columns=['region'])
encoded_df
region_East |
region_West |
region_North |
region_South |
|---|---|---|---|
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
💬 “It’s like handing out badges — everyone belongs somewhere, and nobody gets left out.”
Pros: Simple, works well with tree-based models. Cons: Can explode into 100s of columns (known as the dummy trap).
🧃 Label Encoding — “Rank ‘Em All!”#
When the category count is huge, use label encoding — assign numbers instead of new columns.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['region_encoded'] = le.fit_transform(df['region'])
df
region |
region_encoded |
|---|---|
East |
0 |
West |
3 |
East |
0 |
North |
2 |
South |
1 |
💬 “The model doesn’t care that ‘West’ is 3 — but you should remember it means nothing about order.”
Pros: Compact Cons: Models might think 3 > 0 (which is false for categorical meaning)
🎯 Step 3. Ordinal Encoding — “Respect the Order”#
Sometimes, categories do have order:
Small < Medium < Large
size_order = {'Small': 1, 'Medium': 2, 'Large': 3}
df = pd.DataFrame({'size': ['Small', 'Medium', 'Large', 'Small']})
df['size_encoded'] = df['size'].map(size_order)
df
💬 “Now your model knows that ‘Large’ really is larger — not just alphabetically ahead.”
⚡ Step 4. Boolean Encoding — “Yes or No, Baby”#
Booleans are easy. True = 1, False = 0. Sometimes it’s literally that simple (enjoy it while it lasts).
df = pd.DataFrame({'is_active': [True, False, True]})
df['is_active_encoded'] = df['is_active'].astype(int)
df
is_active |
is_active_encoded |
|---|---|
True |
1 |
False |
0 |
True |
1 |
💬 “Binary encoding — because your model doesn’t do maybe.”
📅 Step 5. Date Features — “Teaching Models About Time”#
Dates aren’t just strings — they’re treasure chests of useful signals.
df = pd.DataFrame({'date': pd.to_datetime(['2024-01-01', '2024-06-15', '2025-01-01'])})
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df
💡 Pro Tip: Always extract year, month, and weekday — business patterns love calendars.
🧪 Practice Challenge — “Business Translator”#
Given this DataFrame:
data = {
'region': ['North', 'East', 'South', 'West'],
'customer_tier': ['Gold', 'Silver', 'Platinum', 'Gold'],
'is_active': [True, False, True, True]
}
df = pd.DataFrame(data)
Perform:
One-hot encode
regionOrdinal encode
customer_tier(Silver=1, Gold=2, Platinum=3)Convert
is_activeto integersCombine into one clean ML-ready dataset
✅ Bonus: Count how many features you just created.
💬 “If your dataframe just got wider — congrats, your model just learned a new language.”
🧭 Recap#
Feature Type |
Encoding Method |
Example |
|---|---|---|
Categorical |
One-Hot / Label |
Region |
Ordinal |
Ordinal Map |
Customer Tier |
Boolean |
Binary |
IsActive |
Date |
Component Split |
Year, Month, Weekday |
🎯 You’ve now taught your model to read business spreadsheets — without complaining about merged cells.
🔜 Next Stop#
👉 Head to Exploratory Data Analysis (EDA) where we’ll turn data into visual stories — spotting patterns, trends, and those delightful surprises that make dashboards go “Ooooh!” 📊
Excellent — let’s develop Topic 9: Memory Optimization Techniques from your Data Analysis module in a structured, detailed, and practical way. This section will explain how to make pandas DataFrames more memory-efficient — crucial when handling large datasets.
Memory Optimization Techniques#
🎯 Learning Objectives#
By the end of this section, students should be able to:
Understand how pandas stores data internally.
Measure the memory footprint of a DataFrame.
Optimize memory usage using categorical variables and numeric downcasting.
Compare memory usage before and after optimization.
1. Why Memory Optimization Matters#
When working with large datasets, pandas can quickly consume gigabytes of memory because:
It uses NumPy arrays internally (which have fixed data types).
Columns with mixed data types or unnecessary precision waste memory.
Objects and strings are stored inefficiently.
Goal: Reduce memory usage without losing accuracy or data integrity.
2. Checking Memory Usage#
You can view the memory usage of a DataFrame using:
import pandas as pd
df = pd.read_csv("sales_data.csv")
df.info(memory_usage="deep")
The
memory_usage='deep'flag provides an accurate estimate by including string/object columns.The result tells you how much memory each column consumes.
Example:#
print("Memory usage before optimization: {:.2f} MB".format(df.memory_usage(deep=True).sum() / 1024**2))
3. Changing Data Types#
Many columns use unnecessarily large data types. For example:
Integers (
int64) can often be stored asint8,int16, orint32.Floats (
float64) can often be stored asfloat32.
Example:#
df['quantity'] = df['quantity'].astype('int16')
df['price'] = df['price'].astype('float32')
You can check the new memory usage:
print("Memory usage after type conversion: {:.2f} MB".format(df.memory_usage(deep=True).sum() / 1024**2))
4. Using Categorical Data Types#
Columns with repeated string values (like “City”, “Category”, or “Department”) can be converted to the category type.
This stores unique labels once and uses integer codes internally — drastically reducing memory usage.
Example:#
df['city'] = df['city'].astype('category')
df['product_category'] = df['product_category'].astype('category')
You can verify:
df['city'].memory_usage(deep=True)
Benefits:#
Less memory.
Faster filtering and grouping.
Clearer metadata for columns with limited categories.
5. Downcasting Numeric Columns#
Instead of manually converting data types, you can automatically downcast:
df['sales'] = pd.to_numeric(df['sales'], downcast='float')
df['quantity'] = pd.to_numeric(df['quantity'], downcast='integer')
This tells pandas to store the smallest possible numeric type that still fits all values.
6. Measuring Improvement#
Let’s compare memory usage before and after optimization:
before = df.memory_usage(deep=True).sum() / 1024**2
# Optimization
df['city'] = df['city'].astype('category')
df['quantity'] = pd.to_numeric(df['quantity'], downcast='integer')
df['sales'] = pd.to_numeric(df['sales'], downcast='float')
after = df.memory_usage(deep=True).sum() / 1024**2
print(f"Memory before: {before:.2f} MB")
print(f"Memory after: {after:.2f} MB")
print(f"Reduction: {100 * (before - after) / before:.2f}%")
7. General Optimization Function#
You can create a utility function to optimize all columns automatically:
def optimize_dataframe(df):
start_mem = df.memory_usage(deep=True).sum() / 1024**2
print(f"Initial memory usage: {start_mem:.2f} MB")
for col in df.columns:
col_type = df[col].dtypes
if col_type == 'object':
num_unique = df[col].nunique()
num_total = len(df[col])
if num_unique / num_total < 0.5:
df[col] = df[col].astype('category')
elif 'int' in str(col_type):
df[col] = pd.to_numeric(df[col], downcast='integer')
elif 'float' in str(col_type):
df[col] = pd.to_numeric(df[col], downcast='float')
end_mem = df.memory_usage(deep=True).sum() / 1024**2
print(f"Optimized memory usage: {end_mem:.2f} MB")
print(f"Reduced by {(100 * (start_mem - end_mem) / start_mem):.1f}%")
return df
Usage:
df = optimize_dataframe(df)
8. Key Takeaways#
Technique |
Description |
Benefit |
|---|---|---|
Change data types |
Convert large integer/float types to smaller ones |
Reduces memory |
Categorical type |
Replace repeated strings with integer codes |
Huge savings for string-heavy data |
Downcast numerics |
Auto-reduce precision where possible |
Saves space, keeps accuracy |
Measure before/after |
Always check memory savings |
Avoids unexpected precision loss |
9. Summary#
Memory optimization ensures:
Faster computations.
Less RAM usage.
Ability to handle larger datasets smoothly.
In large-scale business analytics, these techniques make the difference between a slow, crashing notebook and a fast, scalable analysis pipeline.
# Your code here