# Feature Types & Encoding



*Teaching Your Model to Speak Business*

Welcome to the chapter where we teach your model to **understand human language** ‚Äî or at least *pretend to*.

> üí¨ ‚ÄúYour model doesn‚Äôt know what ‚ÄòEast‚Äô means. But give it a 0 or 1, and suddenly it‚Äôs Einstein.‚Äù

---

## üß† Why Encoding Matters

Machine Learning models are like accountants ‚Äî they only understand **numbers**.
So when you show them text like `‚ÄúProduct Type = Luxury‚Äù`, they panic.

Our job: **translate business data into machine-friendly numbers** without losing meaning.

---

## üß± Step 1. Identify Feature Types

Before encoding, we need to know *what kind* of data we‚Äôre dealing with.

| Type        | Example                     | Encoding Method     |
| ----------- | --------------------------- | ------------------- |
| Numeric     | Sales = 1200                | None (already fine) |
| Categorical | Region = East, West, North  | One-Hot or Label    |
| Ordinal     | Size = Small, Medium, Large | Ordinal Encoding    |
| Boolean     | IsActive = True/False       | Binary (0/1)        |
| Date/Time   | 2025-10-23                  | Date Components     |

---

## üß© Step 2. Dealing with Categorical Variables

Let‚Äôs start with the fun ones ‚Äî *the regions, products, and departments that give your boss headaches*.

```python
import pandas as pd

df = pd.DataFrame({
    'region': ['East', 'West', 'East', 'North', 'South'],
    'sales': [1200, 800, 900, 950, 1100]
})
df
```

---

### üßÉ One-Hot Encoding ‚Äî ‚ÄúEveryone Gets a Column!‚Äù

```python
encoded_df = pd.get_dummies(df, columns=['region'])
encoded_df
```

| region_East | region_West | region_North | region_South |
| ----------- | ----------- | ------------ | ------------ |
| 1           | 0           | 0            | 0            |
| 0           | 1           | 0            | 0            |
| 1           | 0           | 0            | 0            |
| 0           | 0           | 1            | 0            |
| 0           | 0           | 0            | 1            |

> üí¨ ‚ÄúIt‚Äôs like handing out badges ‚Äî everyone belongs somewhere, and nobody gets left out.‚Äù

**Pros:** Simple, works well with tree-based models.
**Cons:** Can explode into 100s of columns (known as the *dummy trap*).

---

### üßÉ Label Encoding ‚Äî ‚ÄúRank ‚ÄòEm All!‚Äù

When the category count is huge, use **label encoding** ‚Äî assign numbers instead of new columns.

```python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['region_encoded'] = le.fit_transform(df['region'])
df
```

| region | region_encoded |
| ------ | -------------- |
| East   | 0              |
| West   | 3              |
| East   | 0              |
| North  | 2              |
| South  | 1              |

> üí¨ ‚ÄúThe model doesn‚Äôt care that ‚ÄòWest‚Äô is 3 ‚Äî but you should remember it means *nothing about order*.‚Äù

**Pros:** Compact
**Cons:** Models might think 3 > 0 (which is false for categorical meaning)

---

## üéØ Step 3. Ordinal Encoding ‚Äî ‚ÄúRespect the Order‚Äù

Sometimes, categories *do* have order:
`Small < Medium < Large`

```python
size_order = {'Small': 1, 'Medium': 2, 'Large': 3}
df = pd.DataFrame({'size': ['Small', 'Medium', 'Large', 'Small']})
df['size_encoded'] = df['size'].map(size_order)
df
```

> üí¨ ‚ÄúNow your model knows that ‚ÄòLarge‚Äô really is larger ‚Äî not just alphabetically ahead.‚Äù

---

## ‚ö° Step 4. Boolean Encoding ‚Äî ‚ÄúYes or No, Baby‚Äù

Booleans are easy. True = 1, False = 0.
Sometimes it‚Äôs literally that simple (enjoy it while it lasts).

```python
df = pd.DataFrame({'is_active': [True, False, True]})
df['is_active_encoded'] = df['is_active'].astype(int)
df
```

| is_active | is_active_encoded |
| --------- | ----------------- |
| True      | 1                 |
| False     | 0                 |
| True      | 1                 |

> üí¨ ‚ÄúBinary encoding ‚Äî because your model doesn‚Äôt do maybe.‚Äù

---

## üìÖ Step 5. Date Features ‚Äî ‚ÄúTeaching Models About Time‚Äù

Dates aren‚Äôt just strings ‚Äî they‚Äôre treasure chests of useful signals.

```python
df = pd.DataFrame({'date': pd.to_datetime(['2024-01-01', '2024-06-15', '2025-01-01'])})
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df
```

> üí° Pro Tip: Always extract year, month, and weekday ‚Äî business patterns love calendars.

---

## üß™ Practice Challenge ‚Äî ‚ÄúBusiness Translator‚Äù

Given this DataFrame:

```python
data = {
    'region': ['North', 'East', 'South', 'West'],
    'customer_tier': ['Gold', 'Silver', 'Platinum', 'Gold'],
    'is_active': [True, False, True, True]
}
df = pd.DataFrame(data)
```

Perform:

1. One-hot encode `region`
2. Ordinal encode `customer_tier` (`Silver=1, Gold=2, Platinum=3`)
3. Convert `is_active` to integers
4. Combine into one clean ML-ready dataset

‚úÖ Bonus: Count how many features you just created.

> üí¨ ‚ÄúIf your dataframe just got wider ‚Äî congrats, your model just learned a new language.‚Äù

---

## üß≠ Recap

| Feature Type | Encoding Method | Example              |
| ------------ | --------------- | -------------------- |
| Categorical  | One-Hot / Label | Region               |
| Ordinal      | Ordinal Map     | Customer Tier        |
| Boolean      | Binary          | IsActive             |
| Date         | Component Split | Year, Month, Weekday |

> üéØ You‚Äôve now taught your model to read business spreadsheets ‚Äî without complaining about merged cells.

---

## üîú Next Stop

üëâ Head to **[Exploratory Data Analysis (EDA)](eda)**
where we‚Äôll turn data into visual stories ‚Äî spotting patterns, trends, and those delightful surprises that make dashboards go *‚ÄúOoooh!‚Äù* üìä



Excellent ‚Äî let‚Äôs develop **Topic 9: Memory Optimization Techniques** from your Data Analysis module in a structured, detailed, and practical way.
This section will explain how to make pandas DataFrames more memory-efficient ‚Äî crucial when handling large datasets.

---

# **Memory Optimization Techniques**

### üéØ Learning Objectives

By the end of this section, students should be able to:

* Understand how pandas stores data internally.
* Measure the memory footprint of a DataFrame.
* Optimize memory usage using categorical variables and numeric downcasting.
* Compare memory usage before and after optimization.

---

## **1. Why Memory Optimization Matters**

When working with large datasets, pandas can quickly consume gigabytes of memory because:

* It uses NumPy arrays internally (which have fixed data types).
* Columns with mixed data types or unnecessary precision waste memory.
* Objects and strings are stored inefficiently.

**Goal:**
Reduce memory usage **without losing accuracy** or data integrity.

---

## **2. Checking Memory Usage**

You can view the memory usage of a DataFrame using:

```python
import pandas as pd

df = pd.read_csv("sales_data.csv")
df.info(memory_usage="deep")
```

* The `memory_usage='deep'` flag provides an accurate estimate by including string/object columns.
* The result tells you how much memory each column consumes.

### Example:

```python
print("Memory usage before optimization: {:.2f} MB".format(df.memory_usage(deep=True).sum() / 1024**2))
```

---

## **3. Changing Data Types**

Many columns use unnecessarily large data types.
For example:

* Integers (`int64`) can often be stored as `int8`, `int16`, or `int32`.
* Floats (`float64`) can often be stored as `float32`.

### Example:

```python
df['quantity'] = df['quantity'].astype('int16')
df['price'] = df['price'].astype('float32')
```

You can check the new memory usage:

```python
print("Memory usage after type conversion: {:.2f} MB".format(df.memory_usage(deep=True).sum() / 1024**2))
```

---

## **4. Using Categorical Data Types**

Columns with **repeated string values** (like ‚ÄúCity‚Äù, ‚ÄúCategory‚Äù, or ‚ÄúDepartment‚Äù) can be converted to the `category` type.

This stores unique labels once and uses integer codes internally ‚Äî drastically reducing memory usage.

### Example:

```python
df['city'] = df['city'].astype('category')
df['product_category'] = df['product_category'].astype('category')
```

You can verify:

```python
df['city'].memory_usage(deep=True)
```

### Benefits:

* Less memory.
* Faster filtering and grouping.
* Clearer metadata for columns with limited categories.

---

## **5. Downcasting Numeric Columns**

Instead of manually converting data types, you can **automatically downcast**:

```python
df['sales'] = pd.to_numeric(df['sales'], downcast='float')
df['quantity'] = pd.to_numeric(df['quantity'], downcast='integer')
```

This tells pandas to store the smallest possible numeric type that still fits all values.

---

## **6. Measuring Improvement**

Let‚Äôs compare memory usage before and after optimization:

```python
before = df.memory_usage(deep=True).sum() / 1024**2

# Optimization
df['city'] = df['city'].astype('category')
df['quantity'] = pd.to_numeric(df['quantity'], downcast='integer')
df['sales'] = pd.to_numeric(df['sales'], downcast='float')

after = df.memory_usage(deep=True).sum() / 1024**2

print(f"Memory before: {before:.2f} MB")
print(f"Memory after: {after:.2f} MB")
print(f"Reduction: {100 * (before - after) / before:.2f}%")
```

---

## **7. General Optimization Function**

You can create a **utility function** to optimize all columns automatically:

```python
def optimize_dataframe(df):
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Initial memory usage: {start_mem:.2f} MB")

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type == 'object':
            num_unique = df[col].nunique()
            num_total = len(df[col])
            if num_unique / num_total < 0.5:
                df[col] = df[col].astype('category')

        elif 'int' in str(col_type):
            df[col] = pd.to_numeric(df[col], downcast='integer')

        elif 'float' in str(col_type):
            df[col] = pd.to_numeric(df[col], downcast='float')

    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Optimized memory usage: {end_mem:.2f} MB")
    print(f"Reduced by {(100 * (start_mem - end_mem) / start_mem):.1f}%")

    return df
```

Usage:

```python
df = optimize_dataframe(df)
```

---

## **8. Key Takeaways**

| Technique                | Description                                       | Benefit                            |
| ------------------------ | ------------------------------------------------- | ---------------------------------- |
| **Change data types**    | Convert large integer/float types to smaller ones | Reduces memory                     |
| **Categorical type**     | Replace repeated strings with integer codes       | Huge savings for string-heavy data |
| **Downcast numerics**    | Auto-reduce precision where possible              | Saves space, keeps accuracy        |
| **Measure before/after** | Always check memory savings                       | Avoids unexpected precision loss   |

---

## **9. Summary**

Memory optimization ensures:

* Faster computations.
* Less RAM usage.
* Ability to handle larger datasets smoothly.

In large-scale business analytics, these techniques make the difference between a **slow, crashing notebook** and a **fast, scalable analysis pipeline**.

---


In [None]:
# Your code here