Linear Model Family¶
One linear engine, many business prediction shapes

Why This Family Matters¶
Business teams rarely ask for “a model family.” They ask questions like:
| Business question | Useful family member | Why |
|---|---|---|
| How does house size affect price? | Simple linear regression | One continuous target, one main feature |
| How do TV, radio, and social spend combine to predict sales? | Multiple linear regression | One continuous target, several features |
| Will this customer churn next month? | Logistic regression as a GLM | Binary target, probability output |
| How many support tickets will arrive tomorrow? | Poisson regression as a GLM | Count target, non-negative output |
| Does radio work better when TV spend is high? | Linear model with interaction terms | The effect of one feature depends on another |
Family Map¶
The previous notebook introduced the baseline equation:
Unexpected character: '' at position 11: \hat{y} = ̲eta_0 + eta_1x…
\hat{y} = eta_0 + eta_1x_1 + eta_2x_2 + \cdots + eta_nx_nThis notebook organizes the variants around three questions: how many features, what kind of target, and whether feature effects combine.
Alt: A business prediction task branches by target type, then by number of features and whether effects interact.
Meet the Main Members¶
Use simple linear regression when one feature is the main driver of a continuous target:
Unexpected character: '' at position 11: \hat{y} = ̲eta_0 + eta_1x
\hat{y} = eta_0 + eta_1xExample: predict sales from ad spend, or predict house price from square footage.
The coefficient eta_1 answers: “How much does the prediction change for one additional unit of ?”
Use multiple linear regression when several features jointly predict one continuous target:
Unexpected character: '' at position 11: \hat{y} = ̲eta_0 + eta_1x…
\hat{y} = eta_0 + eta_1x_1 + eta_2x_2 + \cdots + eta_nx_nExample: predict sales using TV spend, radio spend, social spend, seasonality, and price discount.
Each coefficient is interpreted while holding the other features fixed.
Use a generalized linear model when the target is not a plain continuous number:
Unexpected character: '' at position 43: …color{#1d4ed8}{̲eta_0 + eta_1x…
\color{#7c3aed}{g(\mu)} = \color{#1d4ed8}{eta_0 + eta_1x_1 + \cdots + eta_nx_n}Color legend: transformed target expectation, linear score.
Examples:
Logistic regression: binary outcomes such as churn or purchase
Poisson regression: count outcomes such as tickets, visits, or claims
Gamma regression: positive skewed outcomes such as claim size or time-to-resolution
Use interaction terms when one feature changes the effect of another:
Unexpected character: '' at position 11: \hat{y} = ̲eta_0 + eta_1x…
\hat{y} = eta_0 + eta_1x_1 + eta_2x_2 + eta_3(x_1x_2)Example: radio spend may work better when TV spend is also high. The interaction term lets the model represent that combined lift while still staying in the linear model family.
Worked Example - House Size and Price¶
The original notebook used a tiny housing table to connect with the machine-learning notation . We will keep that example because it teaches an important bridge:
| Size of house in square feet, | Price in y$ |
|---|---|
| 450 | 100 |
| 324 | 78 |
| 844 | 123 |
The one-feature hypothesis is:
This is the same line as , where and . When we say the model learns, we mean it adjusts and until the line fits the observed data as well as possible under the chosen loss function.
import numpy as np
import matplotlib.pyplot as plt
# Data points from the table
x = np.array([450, 324, 844]) # Size of house in square feet
y = np.array([100, 78, 123]) # Price in $1000
# Design matrix with a bias column x_0 = 1.
# Each row is [1, x_i], so theta = [theta_0, theta_1].
X = np.vstack([np.ones(len(x)), x]).T
# Normal equation: theta = (X^T X)^(-1) X^T y
# This is taught fully in the OLS notebook; here it lets us fit a line directly.
theta = np.linalg.inv(X.T @ X) @ X.T @ y
theta_0, theta_1 = theta
print(f"theta_0 / intercept / c: {theta_0:.2f}")
print(f"theta_1 / slope / m: {theta_1:.4f}")
print(f"h_theta(x) = {theta_0:.2f} + {theta_1:.4f} * x")
example_size = 600
example_prediction = theta_0 + theta_1 * example_size
print(f"Predicted price for {example_size} sq. ft: ${example_prediction:.1f}k")
x_line = np.linspace(300, 900, 100)
y_line = theta_0 + theta_1 * x_line
fig, ax = plt.subplots(figsize=(7, 4))
ax.scatter(x, y, color="#1d4ed8", s=70, label="Observed houses", zorder=3)
ax.plot(x_line, y_line, color="#b91c1c", lw=2, label="Best-fit line")
ax.scatter([example_size], [example_prediction], color="#047857", s=80, marker="D", label="600 sq. ft prediction", zorder=4)
ax.set_xlabel("House size (sq. ft)")
ax.set_ylabel("Price ($1000)")
ax.set_title("Simple Linear Regression: House Size -> Price")
ax.grid(True, alpha=0.25)
ax.legend()
plt.tight_layout()
plt.show()
theta_0 / intercept / c: 57.29
theta_1 / slope / m: 0.0798
h_theta(x) = 57.29 + 0.0798 * x
Predicted price for 600 sq. ft: $105.2k

How the notation scales from one feature to many
For one feature, the model is easy to read:
By adding a constant feature , we can write the same prediction as a dot product:
Unexpected character: '' at position 14: h_ heta(x) = ̲oldsymbol{ heta…
h_ heta(x) = oldsymbol{ heta}^ op\mathbf{x}
= heta_0 \cdot 1 + heta_1 \cdot xFor many features, the idea does not change; the vectors just get longer:
Across a whole dataset, all rows stack into the feature matrix :
Unexpected character: '' at position 30: …}} = \mathbf{X}̲oldsymbol{ heta…
\hat{\mathbf{y}} = \mathbf{X}oldsymbol{ heta}The OLS notebook later derives why the normal equation gives the best-fitting coefficients for ordinary linear regression.
Worked Example - Multiple Channels¶
A marketing team may start with one feature, then add more when the business question becomes richer.
| TV spend | Radio spend | Social spend | Sales |
|---|---|---|---|
| 100 | 50 | 20 | 15 |
| 200 | 60 | 25 | 25 |
| 300 | 80 | 30 | 35 |
| 400 | 90 | 42 | 46 |
| 500 | 95 | 55 | 52 |
In a multiple linear model, the equation becomes:
Unexpected character: '' at position 21: … ext{Sales}} = ̲eta_0 + eta_1 …
\hat{ ext{Sales}} = eta_0 + eta_1 ext{TV} + eta_2 ext{Radio} + eta_3 ext{Social}The important interpretation rule is “holding the other channels fixed.” For example, the TV coefficient estimates the sales change associated with one extra unit of TV spend while radio and social spend stay unchanged.
import numpy as np
import pandas as pd
channels = pd.DataFrame({
"TV": [100, 200, 300, 400, 500],
"Radio": [50, 60, 80, 90, 95],
"Social": [20, 25, 30, 42, 55],
"Sales": [15, 25, 35, 46, 52],
})
feature_columns = ["TV", "Radio", "Social"]
X_features = channels[feature_columns].to_numpy(dtype=float)
y_sales = channels["Sales"].to_numpy(dtype=float)
# Add the intercept column and solve with least squares.
X_design = np.column_stack([np.ones(len(channels)), X_features])
theta, residuals, rank, singular_values = np.linalg.lstsq(X_design, y_sales, rcond=None)
coefficient_table = pd.DataFrame({
"term": ["intercept"] + feature_columns,
"coefficient": theta,
})
print(coefficient_table.to_string(index=False))
print()
print("Prediction for TV=350, Radio=85, Social=35:")
new_campaign = np.array([1, 350, 85, 35], dtype=float)
print(f"${new_campaign @ theta:.1f}k predicted sales")
term coefficient
intercept -1.511706
TV 0.075485
Radio 0.208696
Social -0.063545
Prediction for TV=350, Radio=85, Social=35:
$40.4k predicted sales
Generalized Linear Models: Same Score, Different Scale¶
A GLM keeps a linear score but uses a link function to connect that score to a target that is not ordinary continuous revenue.
Alt: Features create a linear score; a link function maps that score to a probability, count rate, or positive value.
Examples:
| Target | GLM choice | Typical link | Business output |
|---|---|---|---|
| Purchase: yes/no | Logistic regression | Logit | Probability of purchase |
| Tickets per day | Poisson regression | Log | Expected count |
| Claim size | Gamma regression | Log or inverse | Positive cost estimate |
Interactive - Choose the Family Member¶
The Pyodide cell below is lightweight on purpose: it uses plain Python lists and rules, not heavy ML libraries. Edit the task descriptions to practice matching business problems to model families.
Knowledge Check¶
When do you move from simple linear regression to multiple linear regression?¶
What is the main benefit of generalized linear models?¶
What does an interaction term represent?¶
Practice Exercises¶
A product team wants to predict delivery time in minutes from distance only. Name the model family member and write the equation.
A finance team wants to predict loan default:
default = 1,paid = 0. Which GLM example fits best, and why is plain linear regression not ideal?A retailer suspects discount effectiveness depends on customer loyalty tier. Write a linear-model equation that includes an interaction term.
In the house-price example, change
example_sizeto 1000 and rerun the code. Does the prediction feel reasonable given only three training observations?
Hints
Continuous one-feature target: start with \hat{y} = eta_0 + eta_1x.
Binary target: think logistic regression and probability outputs between 0 and 1.
Interaction term pattern: include as an extra engineered feature.
Extrapolation beyond the observed range is risky, especially with tiny datasets.
Simple linear regression: \hat{ ext{minutes}} = eta_0 + eta_1 ext{distance}.
Logistic regression as a GLM, because the target is binary and the business needs a probability.
\hat{ ext{sales}} = eta_0 + eta_1 ext{discount} + eta_2 ext{loyalty} + eta_3( ext{discount} imes ext{loyalty}).
A 1000 sq. ft prediction is extrapolation beyond the observed examples. Treat it as a rough guess, not a reliable valuation.
The exercises separate three decisions: target type, feature count, and whether one feature changes another feature’s effect. That decision tree is the main skill for this notebook.
Summary and Next Bridge¶
| Concept | Takeaway |
|---|---|
| Simple linear regression | One continuous target, one feature |
| Multiple linear regression | One continuous target, many features |
| GLM | Linear score adapted to non-standard targets through a link function and distribution |
| Interaction term | An engineered feature that lets one feature modify another feature’s effect |
| Vectorized notation | The same line equation scales to matrices: \hat{\mathbf{y}} = \mathbf{X}oldsymbol{ heta} |
Linear models become useful only after we can measure how wrong their predictions are. Next, move to Mean Squared Error, where the book turns prediction mistakes into the objective that linear regression tries to minimize.