Lab – Sales Forecasting - Machine Learning for Business

Where data meets drama, and revenue meets regression! 😎

🎬 The Business Scenario¶

Congratulations — you’ve just been promoted to Data Science Intern of the Year at Acme Retail Co. 🎉

Your boss (who once said “AI is just fancy Excel”) wants you to predict monthly sales using marketing spend, pricing, and seasonal factors.

Your mission:

Build a simple but effective regression model to forecast sales, explain it clearly, and make it look fancy on a dashboard so everyone thinks it’s magic. ✨

🧾 Step 1. Load the Data¶

You’ve been handed a “beautifully messy” Excel file by the finance team (of course 🙃).

import pandas as pd

url = "https://raw.githubusercontent.com/chandraveshchaudhari/datasets/main/retail_sales.csv"
data = pd.read_csv(url)

data.head()

📊 Expected columns:

Month
TV_Spend
Social_Media_Spend
Discount_Percent
Season
Sales

🧹 Step 2. Clean the Data¶

Finance swore the data was “clean.” You’ll find out soon enough.

data.info()
data.describe()
data.isnull().sum()

🧽 Handle missing values or weird outliers:

data = data.dropna()
data = data[data["Sales"] > 0]

✅ Tip: Don’t delete too much — remember, data is like gossip; even the noisy parts tell a story. 😏

🔍 Step 3. Feature Engineering¶

Convert categorical Season into numerical features:

data = pd.get_dummies(data, columns=["Season"], drop_first=True)
data.head()

Normalize ad spend (optional but helps training):

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = data.drop("Sales", axis=1)
X_scaled = scaler.fit_transform(X)
y = data["Sales"]

⚙️ Step 4. Split Data¶

Time to create our train-test split — because life is all about testing your assumptions.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

🧠 Step 5. Train the Model¶

Let’s start simple — just a Linear Regression:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

Check the coefficients (aka “how much each feature drives sales”):

coeff_df = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_.round(2)
})
coeff_df

📈 Translation: For every $1 increase in TV spend, sales increase by X units (if your CFO is reading this, round the number dramatically). 😅

📉 Step 6. Evaluate Performance¶

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

✅ Interpretation:

MAE = average error in prediction
RMSE = punishes big mistakes more
R² = model’s bragging rights (1.0 = perfection, 0.0 = chaos)

📊 Step 7. Visualize Predictions¶

Let’s make the boss happy with a plot that looks “AI-powered.” 🤖

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.scatter(y_test, y_pred, alpha=0.7, color='royalblue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--', color='orange')
plt.title("Actual vs Predicted Sales")
plt.xlabel("Actual Sales")
plt.ylabel("Predicted Sales")
plt.show()

🎨 If points hug the orange line → your model is doing great!

💬 Step 8. Business Insights¶

Feature	Effect	Business Advice
TV_Spend	+ve	Ads still work! Old-school isn’t dead 📺
Social_Media_Spend	+ve	Keep feeding the algorithm 💰
Discount_Percent	-ve (sometimes)	Too many discounts hurt profit margin 💸
Season_Summer	+ve	Ice creams sell better in heat 😎

💡 “A regression model is just a data-driven way to say ‘I told you so’ in a meeting.”

📈 Step 9. Bonus: Predict Future Sales¶

You can predict next month’s sales like a forecasting wizard:

next_month = [[3000, 1000, 10, 0, 1, 0]]  # Example: custom input
predicted_sales = model.predict(next_month)
print(f"Predicted Sales: ${predicted_sales[0]:.2f}")

Now go make that dashboard and act mysterious when someone asks,

“So… how does it work?” 😏

💡 Optional Extension¶

Try the same task using:

Ridge or Lasso Regression
PolynomialFeatures
or even a Random Forest (coming soon in later chapters 🌲)

Then compare R² — because competition keeps models humble.

🧩 Practice Exercises¶

Challenge	Hint
1️⃣ Train a Ridge Regression model and compare its R²	`from sklearn.linear_model import Ridge`
2️⃣ Add a new feature `Price_Per_Unit` = `Sales / Discount_Percent`	Think creatively!
3️⃣ Create a seaborn plot of coefficients	Use `sns.barplot()`
4️⃣ Try predicting sales when `Discount_Percent = 0`	Does the model behave logically?
5️⃣ Deploy your model to Colab and make it interactive	Add widgets or sliders!

🧠 Recap¶

You loaded, cleaned, and prepped real data
Built a regression model
Evaluated business KPIs with metrics
Made it explainable and visually appealing

You’re now a certified Sales Forecast Whisperer 📊🧙‍♂️

🐍 Python Help¶

If you’re unsure about any pandas or sklearn syntax, 🪄 explore Programming for Business — it’s your friendly guide to all the Python basics behind the magic!

🚀 Next Up¶

➡️ Chapter 6: Classification Models Because predicting “how much” is cool — but predicting “who buys” is where real marketing power begins. 🎯

Knowledge Check¶

Why do we separate training and test data in a regression lab?¶

To evaluate how well the model generalizes to unseen dataCorrect. A held-out test set gives a better estimate of real-world performance.

To make the regression line look steeperSplitting data does not control the slope for cosmetic reasons.

To remove all business interpretation from the modelThe purpose is evaluation, not removing interpretability.

To avoid computing metricsA train-test split is useful because metrics will be computed on unseen data.

What is the main value of plotting predictions against actual values?¶

It guarantees the model is causalPlots do not establish causality by themselves.

It helps reveal whether the fitted relationship is reasonable and where errors occurCorrect. Visual checks make model fit and error patterns easier to interpret.

It replaces the need for numerical metrics entirelyVisualization complements metrics rather than replacing them.

It forces coefficients to become significantPlots do not change fitted coefficients.