Feature Types & Encoding#

Teaching Your Model to Speak Business

Welcome to the chapter where we teach your model to understand human language — or at least pretend to.

💬 “Your model doesn’t know what ‘East’ means. But give it a 0 or 1, and suddenly it’s Einstein.”


🧠 Why Encoding Matters#

Machine Learning models are like accountants — they only understand numbers. So when you show them text like “Product Type = Luxury”, they panic.

Our job: translate business data into machine-friendly numbers without losing meaning.


🧱 Step 1. Identify Feature Types#

Before encoding, we need to know what kind of data we’re dealing with.

Type

Example

Encoding Method

Numeric

Sales = 1200

None (already fine)

Categorical

Region = East, West, North

One-Hot or Label

Ordinal

Size = Small, Medium, Large

Ordinal Encoding

Boolean

IsActive = True/False

Binary (0/1)

Date/Time

2025-10-23

Date Components


🧩 Step 2. Dealing with Categorical Variables#

Let’s start with the fun ones — the regions, products, and departments that give your boss headaches.

import pandas as pd

df = pd.DataFrame({
    'region': ['East', 'West', 'East', 'North', 'South'],
    'sales': [1200, 800, 900, 950, 1100]
})
df

🧃 One-Hot Encoding — “Everyone Gets a Column!”#

encoded_df = pd.get_dummies(df, columns=['region'])
encoded_df

region_East

region_West

region_North

region_South

1

0

0

0

0

1

0

0

1

0

0

0

0

0

1

0

0

0

0

1

💬 “It’s like handing out badges — everyone belongs somewhere, and nobody gets left out.”

Pros: Simple, works well with tree-based models. Cons: Can explode into 100s of columns (known as the dummy trap).


🧃 Label Encoding — “Rank ‘Em All!”#

When the category count is huge, use label encoding — assign numbers instead of new columns.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['region_encoded'] = le.fit_transform(df['region'])
df

region

region_encoded

East

0

West

3

East

0

North

2

South

1

💬 “The model doesn’t care that ‘West’ is 3 — but you should remember it means nothing about order.”

Pros: Compact Cons: Models might think 3 > 0 (which is false for categorical meaning)


🎯 Step 3. Ordinal Encoding — “Respect the Order”#

Sometimes, categories do have order: Small < Medium < Large

size_order = {'Small': 1, 'Medium': 2, 'Large': 3}
df = pd.DataFrame({'size': ['Small', 'Medium', 'Large', 'Small']})
df['size_encoded'] = df['size'].map(size_order)
df

💬 “Now your model knows that ‘Large’ really is larger — not just alphabetically ahead.”


⚡ Step 4. Boolean Encoding — “Yes or No, Baby”#

Booleans are easy. True = 1, False = 0. Sometimes it’s literally that simple (enjoy it while it lasts).

df = pd.DataFrame({'is_active': [True, False, True]})
df['is_active_encoded'] = df['is_active'].astype(int)
df

is_active

is_active_encoded

True

1

False

0

True

1

💬 “Binary encoding — because your model doesn’t do maybe.”


📅 Step 5. Date Features — “Teaching Models About Time”#

Dates aren’t just strings — they’re treasure chests of useful signals.

df = pd.DataFrame({'date': pd.to_datetime(['2024-01-01', '2024-06-15', '2025-01-01'])})
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df

💡 Pro Tip: Always extract year, month, and weekday — business patterns love calendars.


🧪 Practice Challenge — “Business Translator”#

Given this DataFrame:

data = {
    'region': ['North', 'East', 'South', 'West'],
    'customer_tier': ['Gold', 'Silver', 'Platinum', 'Gold'],
    'is_active': [True, False, True, True]
}
df = pd.DataFrame(data)

Perform:

  1. One-hot encode region

  2. Ordinal encode customer_tier (Silver=1, Gold=2, Platinum=3)

  3. Convert is_active to integers

  4. Combine into one clean ML-ready dataset

✅ Bonus: Count how many features you just created.

💬 “If your dataframe just got wider — congrats, your model just learned a new language.”


🧭 Recap#

Feature Type

Encoding Method

Example

Categorical

One-Hot / Label

Region

Ordinal

Ordinal Map

Customer Tier

Boolean

Binary

IsActive

Date

Component Split

Year, Month, Weekday

🎯 You’ve now taught your model to read business spreadsheets — without complaining about merged cells.


🔜 Next Stop#

👉 Head to Exploratory Data Analysis (EDA) where we’ll turn data into visual stories — spotting patterns, trends, and those delightful surprises that make dashboards go “Ooooh!” 📊

Excellent — let’s develop Topic 9: Memory Optimization Techniques from your Data Analysis module in a structured, detailed, and practical way. This section will explain how to make pandas DataFrames more memory-efficient — crucial when handling large datasets.


Memory Optimization Techniques#

🎯 Learning Objectives#

By the end of this section, students should be able to:

  • Understand how pandas stores data internally.

  • Measure the memory footprint of a DataFrame.

  • Optimize memory usage using categorical variables and numeric downcasting.

  • Compare memory usage before and after optimization.


1. Why Memory Optimization Matters#

When working with large datasets, pandas can quickly consume gigabytes of memory because:

  • It uses NumPy arrays internally (which have fixed data types).

  • Columns with mixed data types or unnecessary precision waste memory.

  • Objects and strings are stored inefficiently.

Goal: Reduce memory usage without losing accuracy or data integrity.


2. Checking Memory Usage#

You can view the memory usage of a DataFrame using:

import pandas as pd

df = pd.read_csv("sales_data.csv")
df.info(memory_usage="deep")
  • The memory_usage='deep' flag provides an accurate estimate by including string/object columns.

  • The result tells you how much memory each column consumes.

Example:#

print("Memory usage before optimization: {:.2f} MB".format(df.memory_usage(deep=True).sum() / 1024**2))

3. Changing Data Types#

Many columns use unnecessarily large data types. For example:

  • Integers (int64) can often be stored as int8, int16, or int32.

  • Floats (float64) can often be stored as float32.

Example:#

df['quantity'] = df['quantity'].astype('int16')
df['price'] = df['price'].astype('float32')

You can check the new memory usage:

print("Memory usage after type conversion: {:.2f} MB".format(df.memory_usage(deep=True).sum() / 1024**2))

4. Using Categorical Data Types#

Columns with repeated string values (like “City”, “Category”, or “Department”) can be converted to the category type.

This stores unique labels once and uses integer codes internally — drastically reducing memory usage.

Example:#

df['city'] = df['city'].astype('category')
df['product_category'] = df['product_category'].astype('category')

You can verify:

df['city'].memory_usage(deep=True)

Benefits:#

  • Less memory.

  • Faster filtering and grouping.

  • Clearer metadata for columns with limited categories.


5. Downcasting Numeric Columns#

Instead of manually converting data types, you can automatically downcast:

df['sales'] = pd.to_numeric(df['sales'], downcast='float')
df['quantity'] = pd.to_numeric(df['quantity'], downcast='integer')

This tells pandas to store the smallest possible numeric type that still fits all values.


6. Measuring Improvement#

Let’s compare memory usage before and after optimization:

before = df.memory_usage(deep=True).sum() / 1024**2

# Optimization
df['city'] = df['city'].astype('category')
df['quantity'] = pd.to_numeric(df['quantity'], downcast='integer')
df['sales'] = pd.to_numeric(df['sales'], downcast='float')

after = df.memory_usage(deep=True).sum() / 1024**2

print(f"Memory before: {before:.2f} MB")
print(f"Memory after: {after:.2f} MB")
print(f"Reduction: {100 * (before - after) / before:.2f}%")

7. General Optimization Function#

You can create a utility function to optimize all columns automatically:

def optimize_dataframe(df):
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Initial memory usage: {start_mem:.2f} MB")

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type == 'object':
            num_unique = df[col].nunique()
            num_total = len(df[col])
            if num_unique / num_total < 0.5:
                df[col] = df[col].astype('category')

        elif 'int' in str(col_type):
            df[col] = pd.to_numeric(df[col], downcast='integer')

        elif 'float' in str(col_type):
            df[col] = pd.to_numeric(df[col], downcast='float')

    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Optimized memory usage: {end_mem:.2f} MB")
    print(f"Reduced by {(100 * (start_mem - end_mem) / start_mem):.1f}%")

    return df

Usage:

df = optimize_dataframe(df)

8. Key Takeaways#

Technique

Description

Benefit

Change data types

Convert large integer/float types to smaller ones

Reduces memory

Categorical type

Replace repeated strings with integer codes

Huge savings for string-heavy data

Downcast numerics

Auto-reduce precision where possible

Saves space, keeps accuracy

Measure before/after

Always check memory savings

Avoids unexpected precision loss


9. Summary#

Memory optimization ensures:

  • Faster computations.

  • Less RAM usage.

  • Ability to handle larger datasets smoothly.

In large-scale business analytics, these techniques make the difference between a slow, crashing notebook and a fast, scalable analysis pipeline.


# Your code here