Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Where data meets drama, and revenue meets regression! 😎


🎬 The Business Scenario

Congratulations — you’ve just been promoted to Data Science Intern of the Year at Acme Retail Co. 🎉

Your boss (who once said “AI is just fancy Excel”) wants you to predict monthly sales using marketing spend, pricing, and seasonal factors.

Your mission:

Build a simple but effective regression model to forecast sales, explain it clearly, and make it look fancy on a dashboard so everyone thinks it’s magic. ✨


🧾 Step 1. Load the Data

You’ve been handed a “beautifully messy” Excel file by the finance team (of course 🙃).

import pandas as pd

url = "https://raw.githubusercontent.com/chandraveshchaudhari/datasets/main/retail_sales.csv"
data = pd.read_csv(url)

data.head()

📊 Expected columns:

  • Month

  • TV_Spend

  • Social_Media_Spend

  • Discount_Percent

  • Season

  • Sales


🧹 Step 2. Clean the Data

Finance swore the data was “clean.” You’ll find out soon enough.

data.info()
data.describe()
data.isnull().sum()

🧽 Handle missing values or weird outliers:

data = data.dropna()
data = data[data["Sales"] > 0]

Tip: Don’t delete too much — remember, data is like gossip; even the noisy parts tell a story. 😏


🔍 Step 3. Feature Engineering

Convert categorical Season into numerical features:

data = pd.get_dummies(data, columns=["Season"], drop_first=True)
data.head()

Normalize ad spend (optional but helps training):

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = data.drop("Sales", axis=1)
X_scaled = scaler.fit_transform(X)
y = data["Sales"]

⚙️ Step 4. Split Data

Time to create our train-test split — because life is all about testing your assumptions.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

🧠 Step 5. Train the Model

Let’s start simple — just a Linear Regression:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

Check the coefficients (aka “how much each feature drives sales”):

coeff_df = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_.round(2)
})
coeff_df

📈 Translation: For every $1 increase in TV spend, sales increase by X units (if your CFO is reading this, round the number dramatically). 😅


📉 Step 6. Evaluate Performance

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

Interpretation:

  • MAE = average error in prediction

  • RMSE = punishes big mistakes more

  • = model’s bragging rights (1.0 = perfection, 0.0 = chaos)


📊 Step 7. Visualize Predictions

Let’s make the boss happy with a plot that looks “AI-powered.” 🤖

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.scatter(y_test, y_pred, alpha=0.7, color='royalblue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--', color='orange')
plt.title("Actual vs Predicted Sales")
plt.xlabel("Actual Sales")
plt.ylabel("Predicted Sales")
plt.show()

🎨 If points hug the orange line → your model is doing great!


💬 Step 8. Business Insights

FeatureEffectBusiness Advice
TV_Spend+veAds still work! Old-school isn’t dead 📺
Social_Media_Spend+veKeep feeding the algorithm 💰
Discount_Percent-ve (sometimes)Too many discounts hurt profit margin 💸
Season_Summer+veIce creams sell better in heat 😎

💡 “A regression model is just a data-driven way to say ‘I told you so’ in a meeting.”


📈 Step 9. Bonus: Predict Future Sales

You can predict next month’s sales like a forecasting wizard:

next_month = [[3000, 1000, 10, 0, 1, 0]]  # Example: custom input
predicted_sales = model.predict(next_month)
print(f"Predicted Sales: ${predicted_sales[0]:.2f}")

Now go make that dashboard and act mysterious when someone asks,

“So… how does it work?” 😏


💡 Optional Extension

Try the same task using:

  • Ridge or Lasso Regression

  • PolynomialFeatures

  • or even a Random Forest (coming soon in later chapters 🌲)

Then compare R² — because competition keeps models humble.


🧩 Practice Exercises

ChallengeHint
1️⃣ Train a Ridge Regression model and compare its R²from sklearn.linear_model import Ridge
2️⃣ Add a new feature Price_Per_Unit = Sales / Discount_PercentThink creatively!
3️⃣ Create a seaborn plot of coefficientsUse sns.barplot()
4️⃣ Try predicting sales when Discount_Percent = 0Does the model behave logically?
5️⃣ Deploy your model to Colab and make it interactiveAdd widgets or sliders!

🧠 Recap

  • You loaded, cleaned, and prepped real data

  • Built a regression model

  • Evaluated business KPIs with metrics

  • Made it explainable and visually appealing

You’re now a certified Sales Forecast Whisperer 📊🧙‍♂️


🐍 Python Help

If you’re unsure about any pandas or sklearn syntax, 🪄 explore Programming for Business — it’s your friendly guide to all the Python basics behind the magic!


🚀 Next Up

➡️ Chapter 6: Classification Models Because predicting “how much” is cool — but predicting “who buys” is where real marketing power begins. 🎯

Knowledge Check

Why do we separate training and test data in a regression lab?

To evaluate how well the model generalizes to unseen dataCorrect. A held-out test set gives a better estimate of real-world performance.
To make the regression line look steeperSplitting data does not control the slope for cosmetic reasons.
To remove all business interpretation from the modelThe purpose is evaluation, not removing interpretability.
To avoid computing metricsA train-test split is useful because metrics will be computed on unseen data.

What is the main value of plotting predictions against actual values?

It guarantees the model is causalPlots do not establish causality by themselves.
It helps reveal whether the fitted relationship is reasonable and where errors occurCorrect. Visual checks make model fit and error patterns easier to interpret.
It replaces the need for numerical metrics entirelyVisualization complements metrics rather than replacing them.
It forces coefficients to become significantPlots do not change fitted coefficients.