Feature Types & Encoding

Feature Types & Encoding#

Teaching Your Model to Speak Business

Welcome to the chapter where we teach your model to understand human language — or at least pretend to.

💬 “Your model doesn’t know what ‘East’ means. But give it a 0 or 1, and suddenly it’s Einstein.”

🧠 Why Encoding Matters#

Machine Learning models are like accountants — they only understand numbers. So when you show them text like “Product Type = Luxury”, they panic.

Our job: translate business data into machine-friendly numbers without losing meaning.

🧱 Step 1. Identify Feature Types#

Before encoding, we need to know what kind of data we’re dealing with.

Type	Example	Encoding Method
Numeric	Sales = 1200	None (already fine)
Categorical	Region = East, West, North	One-Hot or Label
Ordinal	Size = Small, Medium, Large	Ordinal Encoding
Boolean	IsActive = True/False	Binary (0/1)
Date/Time	2025-10-23	Date Components

🧩 Step 2. Dealing with Categorical Variables#

Let’s start with the fun ones — the regions, products, and departments that give your boss headaches.

import pandas as pd

df = pd.DataFrame({
    'region': ['East', 'West', 'East', 'North', 'South'],
    'sales': [1200, 800, 900, 950, 1100]
})
df

🧃 One-Hot Encoding — “Everyone Gets a Column!”#

encoded_df = pd.get_dummies(df, columns=['region'])
encoded_df

region_East	region_West	region_North	region_South
1	0	0	0
0	1	0	0
1	0	0	0
0	0	1	0
0	0	0	1

💬 “It’s like handing out badges — everyone belongs somewhere, and nobody gets left out.”

Pros: Simple, works well with tree-based models. Cons: Can explode into 100s of columns (known as the dummy trap).

🧃 Label Encoding — “Rank ‘Em All!”#

When the category count is huge, use label encoding — assign numbers instead of new columns.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['region_encoded'] = le.fit_transform(df['region'])
df

region	region_encoded
East	0
West	3
East	0
North	2
South	1

💬 “The model doesn’t care that ‘West’ is 3 — but you should remember it means nothing about order.”

Pros: Compact Cons: Models might think 3 > 0 (which is false for categorical meaning)

🎯 Step 3. Ordinal Encoding — “Respect the Order”#

Sometimes, categories do have order: Small < Medium < Large

size_order = {'Small': 1, 'Medium': 2, 'Large': 3}
df = pd.DataFrame({'size': ['Small', 'Medium', 'Large', 'Small']})
df['size_encoded'] = df['size'].map(size_order)
df

💬 “Now your model knows that ‘Large’ really is larger — not just alphabetically ahead.”

⚡ Step 4. Boolean Encoding — “Yes or No, Baby”#

Booleans are easy. True = 1, False = 0. Sometimes it’s literally that simple (enjoy it while it lasts).

df = pd.DataFrame({'is_active': [True, False, True]})
df['is_active_encoded'] = df['is_active'].astype(int)
df

is_active	is_active_encoded
True	1
False	0
True	1

💬 “Binary encoding — because your model doesn’t do maybe.”

📅 Step 5. Date Features — “Teaching Models About Time”#

Dates aren’t just strings — they’re treasure chests of useful signals.

df = pd.DataFrame({'date': pd.to_datetime(['2024-01-01', '2024-06-15', '2025-01-01'])})
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df

💡 Pro Tip: Always extract year, month, and weekday — business patterns love calendars.

🧪 Practice Challenge — “Business Translator”#

Given this DataFrame:

data = {
    'region': ['North', 'East', 'South', 'West'],
    'customer_tier': ['Gold', 'Silver', 'Platinum', 'Gold'],
    'is_active': [True, False, True, True]
}
df = pd.DataFrame(data)

Perform:

One-hot encode region
Ordinal encode customer_tier (Silver=1, Gold=2, Platinum=3)
Convert is_active to integers
Combine into one clean ML-ready dataset

✅ Bonus: Count how many features you just created.

💬 “If your dataframe just got wider — congrats, your model just learned a new language.”

🧭 Recap#

Feature Type	Encoding Method	Example
Categorical	One-Hot / Label	Region
Ordinal	Ordinal Map	Customer Tier
Boolean	Binary	IsActive
Date	Component Split	Year, Month, Weekday

🎯 You’ve now taught your model to read business spreadsheets — without complaining about merged cells.

🔜 Next Stop#

👉 Head to Exploratory Data Analysis (EDA) where we’ll turn data into visual stories — spotting patterns, trends, and those delightful surprises that make dashboards go “Ooooh!” 📊

Excellent — let’s develop Topic 9: Memory Optimization Techniques from your Data Analysis module in a structured, detailed, and practical way. This section will explain how to make pandas DataFrames more memory-efficient — crucial when handling large datasets.

Memory Optimization Techniques#

🎯 Learning Objectives#

By the end of this section, students should be able to:

Understand how pandas stores data internally.
Measure the memory footprint of a DataFrame.
Optimize memory usage using categorical variables and numeric downcasting.
Compare memory usage before and after optimization.

1. Why Memory Optimization Matters#

When working with large datasets, pandas can quickly consume gigabytes of memory because:

It uses NumPy arrays internally (which have fixed data types).
Columns with mixed data types or unnecessary precision waste memory.
Objects and strings are stored inefficiently.

Goal: Reduce memory usage without losing accuracy or data integrity.

2. Checking Memory Usage#

You can view the memory usage of a DataFrame using:

import pandas as pd

df = pd.read_csv("sales_data.csv")
df.info(memory_usage="deep")

The memory_usage='deep' flag provides an accurate estimate by including string/object columns.
The result tells you how much memory each column consumes.

Example:#

print("Memory usage before optimization: {:.2f} MB".format(df.memory_usage(deep=True).sum() / 1024**2))

3. Changing Data Types#

Many columns use unnecessarily large data types. For example:

Integers (int64) can often be stored as int8, int16, or int32.
Floats (float64) can often be stored as float32.

Example:#

df['quantity'] = df['quantity'].astype('int16')
df['price'] = df['price'].astype('float32')

You can check the new memory usage:

print("Memory usage after type conversion: {:.2f} MB".format(df.memory_usage(deep=True).sum() / 1024**2))

4. Using Categorical Data Types#

Columns with repeated string values (like “City”, “Category”, or “Department”) can be converted to the category type.

This stores unique labels once and uses integer codes internally — drastically reducing memory usage.

Example:#

df['city'] = df['city'].astype('category')
df['product_category'] = df['product_category'].astype('category')

You can verify:

df['city'].memory_usage(deep=True)

Benefits:#

Less memory.
Faster filtering and grouping.
Clearer metadata for columns with limited categories.

5. Downcasting Numeric Columns#

Instead of manually converting data types, you can automatically downcast:

df['sales'] = pd.to_numeric(df['sales'], downcast='float')
df['quantity'] = pd.to_numeric(df['quantity'], downcast='integer')

This tells pandas to store the smallest possible numeric type that still fits all values.

6. Measuring Improvement#

Let’s compare memory usage before and after optimization:

before = df.memory_usage(deep=True).sum() / 1024**2

# Optimization
df['city'] = df['city'].astype('category')
df['quantity'] = pd.to_numeric(df['quantity'], downcast='integer')
df['sales'] = pd.to_numeric(df['sales'], downcast='float')

after = df.memory_usage(deep=True).sum() / 1024**2

print(f"Memory before: {before:.2f} MB")
print(f"Memory after: {after:.2f} MB")
print(f"Reduction: {100 * (before - after) / before:.2f}%")

7. General Optimization Function#

You can create a utility function to optimize all columns automatically:

def optimize_dataframe(df):
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Initial memory usage: {start_mem:.2f} MB")

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type == 'object':
            num_unique = df[col].nunique()
            num_total = len(df[col])
            if num_unique / num_total < 0.5:
                df[col] = df[col].astype('category')

        elif 'int' in str(col_type):
            df[col] = pd.to_numeric(df[col], downcast='integer')

        elif 'float' in str(col_type):
            df[col] = pd.to_numeric(df[col], downcast='float')

    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Optimized memory usage: {end_mem:.2f} MB")
    print(f"Reduced by {(100 * (start_mem - end_mem) / start_mem):.1f}%")

    return df

Usage:

df = optimize_dataframe(df)

8. Key Takeaways#

Technique	Description	Benefit
Change data types	Convert large integer/float types to smaller ones	Reduces memory
Categorical type	Replace repeated strings with integer codes	Huge savings for string-heavy data
Downcast numerics	Auto-reduce precision where possible	Saves space, keeps accuracy
Measure before/after	Always check memory savings	Avoids unexpected precision loss

9. Summary#

Memory optimization ensures:

Faster computations.
Less RAM usage.
Ability to handle larger datasets smoothly.

In large-scale business analytics, these techniques make the difference between a slow, crashing notebook and a fast, scalable analysis pipeline.

# Your code here

Feature Types & Encoding

Contents

Feature Types & Encoding#

🧠 Why Encoding Matters#

🧱 Step 1. Identify Feature Types#

🧩 Step 2. Dealing with Categorical Variables#

🧃 One-Hot Encoding — “Everyone Gets a Column!”#

🧃 Label Encoding — “Rank ‘Em All!”#

🎯 Step 3. Ordinal Encoding — “Respect the Order”#

⚡ Step 4. Boolean Encoding — “Yes or No, Baby”#

📅 Step 5. Date Features — “Teaching Models About Time”#

🧪 Practice Challenge — “Business Translator”#

🧭 Recap#

🔜 Next Stop#

Memory Optimization Techniques#

🎯 Learning Objectives#

1. Why Memory Optimization Matters#

2. Checking Memory Usage#

Example:#

3. Changing Data Types#

Example:#

4. Using Categorical Data Types#

Example:#

Benefits:#

5. Downcasting Numeric Columns#

6. Measuring Improvement#

7. General Optimization Function#

8. Key Takeaways#

9. Summary#