Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Linear Model Family

One linear engine, many business prediction shapes


You already know the core regression equation and how coefficients translate business change into predictions. This notebook shows how that same idea expands from one-feature regression to multiple-feature regression, generalized linear models, and interaction terms.

Why This Family Matters

Business teams rarely ask for “a model family.” They ask questions like:

Business questionUseful family memberWhy
How does house size affect price?Simple linear regressionOne continuous target, one main feature
How do TV, radio, and social spend combine to predict sales?Multiple linear regressionOne continuous target, several features
Will this customer churn next month?Logistic regression as a GLMBinary target, probability output
How many support tickets will arrive tomorrow?Poisson regression as a GLMCount target, non-negative output
Does radio work better when TV spend is high?Linear model with interaction termsThe effect of one feature depends on another

Family Map

The previous notebook introduced the baseline equation:

Unexpected character: '' at position 11: \hat{y} = ̲eta_0 + eta_1x…

\hat{y} = eta_0 + eta_1x_1 + eta_2x_2 + \cdots + eta_nx_n

This notebook organizes the variants around three questions: how many features, what kind of target, and whether feature effects combine.

Alt: A business prediction task branches by target type, then by number of features and whether effects interact.

Meet the Main Members

Simple Linear
Multiple Linear
GLM
Interactions

Use simple linear regression when one feature is the main driver of a continuous target:

Unexpected character: '' at position 11: \hat{y} = ̲eta_0 + eta_1x

\hat{y} = eta_0 + eta_1x

Example: predict sales from ad spend, or predict house price from square footage.

The coefficient eta_1 answers: “How much does the prediction change for one additional unit of xx?”

Worked Example - House Size and Price

The original notebook used a tiny housing table to connect y=mx+cy = mx + c with the machine-learning notation hheta(x)h_ heta(x). We will keep that example because it teaches an important bridge:

Size of house in square feet, xxPrice in 1000,1000, y$
450100
32478
844123

The one-feature hypothesis is:

hheta(x)=heta0+heta1xh_ heta(x) = heta_0 + heta_1x

This is the same line as y=mx+cy = mx + c, where c=heta0c = heta_0 and m=heta1m = heta_1. When we say the model learns, we mean it adjusts heta0 heta_0 and heta1 heta_1 until the line fits the observed data as well as possible under the chosen loss function.

import numpy as np
import matplotlib.pyplot as plt

# Data points from the table
x = np.array([450, 324, 844])  # Size of house in square feet
y = np.array([100, 78, 123])   # Price in $1000

# Design matrix with a bias column x_0 = 1.
# Each row is [1, x_i], so theta = [theta_0, theta_1].
X = np.vstack([np.ones(len(x)), x]).T

# Normal equation: theta = (X^T X)^(-1) X^T y
# This is taught fully in the OLS notebook; here it lets us fit a line directly.
theta = np.linalg.inv(X.T @ X) @ X.T @ y
theta_0, theta_1 = theta

print(f"theta_0 / intercept / c: {theta_0:.2f}")
print(f"theta_1 / slope / m:      {theta_1:.4f}")
print(f"h_theta(x) = {theta_0:.2f} + {theta_1:.4f} * x")

example_size = 600
example_prediction = theta_0 + theta_1 * example_size
print(f"Predicted price for {example_size} sq. ft: ${example_prediction:.1f}k")

x_line = np.linspace(300, 900, 100)
y_line = theta_0 + theta_1 * x_line

fig, ax = plt.subplots(figsize=(7, 4))
ax.scatter(x, y, color="#1d4ed8", s=70, label="Observed houses", zorder=3)
ax.plot(x_line, y_line, color="#b91c1c", lw=2, label="Best-fit line")
ax.scatter([example_size], [example_prediction], color="#047857", s=80, marker="D", label="600 sq. ft prediction", zorder=4)
ax.set_xlabel("House size (sq. ft)")
ax.set_ylabel("Price ($1000)")
ax.set_title("Simple Linear Regression: House Size -> Price")
ax.grid(True, alpha=0.25)
ax.legend()
plt.tight_layout()
plt.show()
theta_0 / intercept / c: 57.29
theta_1 / slope / m:      0.0798
h_theta(x) = 57.29 + 0.0798 * x
Predicted price for 600 sq. ft: $105.2k
<Figure size 700x400 with 1 Axes>
How the notation scales from one feature to many

For one feature, the model is easy to read:

hheta(x)=heta0+heta1xh_ heta(x) = heta_0 + heta_1x

By adding a constant feature x0=1x_0 = 1, we can write the same prediction as a dot product:

Unexpected character: '' at position 14: h_	heta(x) = ̲oldsymbol{	heta…

h_	heta(x) = oldsymbol{	heta}^	op\mathbf{x}
= 	heta_0 \cdot 1 + 	heta_1 \cdot x

For many features, the idea does not change; the vectors just get longer:

hheta(x)=heta0+heta1x1++hetadxdh_ heta(\mathbf{x}) = heta_0 + heta_1x_1 + \cdots + heta_dx_d

Across a whole dataset, all rows stack into the feature matrix X\mathbf{X}:

Unexpected character: '' at position 30: …}} = \mathbf{X}̲oldsymbol{	heta…

\hat{\mathbf{y}} = \mathbf{X}oldsymbol{	heta}

The OLS notebook later derives why the normal equation gives the best-fitting coefficients for ordinary linear regression.

Worked Example - Multiple Channels

A marketing team may start with one feature, then add more when the business question becomes richer.

TV spendRadio spendSocial spendSales
100502015
200602525
300803035
400904246
500955552

In a multiple linear model, the equation becomes:

Unexpected character: '' at position 21: …	ext{Sales}} = ̲eta_0 + eta_1	…

\hat{	ext{Sales}} = eta_0 + eta_1	ext{TV} + eta_2	ext{Radio} + eta_3	ext{Social}

The important interpretation rule is “holding the other channels fixed.” For example, the TV coefficient estimates the sales change associated with one extra unit of TV spend while radio and social spend stay unchanged.

import numpy as np
import pandas as pd

channels = pd.DataFrame({
    "TV": [100, 200, 300, 400, 500],
    "Radio": [50, 60, 80, 90, 95],
    "Social": [20, 25, 30, 42, 55],
    "Sales": [15, 25, 35, 46, 52],
})

feature_columns = ["TV", "Radio", "Social"]
X_features = channels[feature_columns].to_numpy(dtype=float)
y_sales = channels["Sales"].to_numpy(dtype=float)

# Add the intercept column and solve with least squares.
X_design = np.column_stack([np.ones(len(channels)), X_features])
theta, residuals, rank, singular_values = np.linalg.lstsq(X_design, y_sales, rcond=None)

coefficient_table = pd.DataFrame({
    "term": ["intercept"] + feature_columns,
    "coefficient": theta,
})

print(coefficient_table.to_string(index=False))
print()
print("Prediction for TV=350, Radio=85, Social=35:")
new_campaign = np.array([1, 350, 85, 35], dtype=float)
print(f"${new_campaign @ theta:.1f}k predicted sales")
     term  coefficient
intercept    -1.511706
       TV     0.075485
    Radio     0.208696
   Social    -0.063545

Prediction for TV=350, Radio=85, Social=35:
$40.4k predicted sales

Generalized Linear Models: Same Score, Different Scale

A GLM keeps a linear score but uses a link function to connect that score to a target that is not ordinary continuous revenue.

Alt: Features create a linear score; a link function maps that score to a probability, count rate, or positive value.

Examples:

TargetGLM choiceTypical linkBusiness output
Purchase: yes/noLogistic regressionLogitProbability of purchase
Tickets per dayPoisson regressionLogExpected count
Claim sizeGamma regressionLog or inversePositive cost estimate

Interactive - Choose the Family Member

The Pyodide cell below is lightweight on purpose: it uses plain Python lists and rules, not heavy ML libraries. Edit the task descriptions to practice matching business problems to model families.

Knowledge Check

When do you move from simple linear regression to multiple linear regression?

When there are no predictors at allA regression model needs predictors to estimate outcomes.
When more than one feature helps explain a continuous targetCorrect. Multiple linear regression uses several predictors together.
When the model must stop using coefficientsMultiple regression still uses coefficients; it just has more of them.
When the target becomes a sentenceText generation is outside ordinary linear regression.

What is the main benefit of generalized linear models?

They remove the need for dataNo learning model can estimate relationships without data.
They extend linear-model ideas to different target distributions and link functionsCorrect. GLMs keep a linear score but adapt the target scale.
They always outperform every nonlinear modelThat is too broad; performance depends on the data and task.
They automatically fix data leakageData leakage is a workflow problem, not something a model family fixes automatically.

What does an interaction term represent?

A feature that should be deleted before trainingInteraction terms are intentionally engineered when feature effects combine.
A way to make the target binaryBinary targets point toward logistic regression, not interaction terms by themselves.
A feature whose effect depends on another featureCorrect. An interaction such as TV x Radio lets one channel change the effect of another.
The MSE value after trainingMSE measures prediction error; it is not a feature interaction.

Practice Exercises

  1. A product team wants to predict delivery time in minutes from distance only. Name the model family member and write the equation.

  2. A finance team wants to predict loan default: default = 1, paid = 0. Which GLM example fits best, and why is plain linear regression not ideal?

  3. A retailer suspects discount effectiveness depends on customer loyalty tier. Write a linear-model equation that includes an interaction term.

  4. In the house-price example, change example_size to 1000 and rerun the code. Does the prediction feel reasonable given only three training observations?

Hints
  • Continuous one-feature target: start with \hat{y} = eta_0 + eta_1x.

  • Binary target: think logistic regression and probability outputs between 0 and 1.

  • Interaction term pattern: include x1x2x_1x_2 as an extra engineered feature.

  • Extrapolation beyond the observed range is risky, especially with tiny datasets.

One Possible Solution
Why This Works
  1. Simple linear regression: \hat{ ext{minutes}} = eta_0 + eta_1 ext{distance}.

  2. Logistic regression as a GLM, because the target is binary and the business needs a probability.

  3. \hat{ ext{sales}} = eta_0 + eta_1 ext{discount} + eta_2 ext{loyalty} + eta_3( ext{discount} imes ext{loyalty}).

  4. A 1000 sq. ft prediction is extrapolation beyond the observed examples. Treat it as a rough guess, not a reliable valuation.

Summary and Next Bridge

ConceptTakeaway
Simple linear regressionOne continuous target, one feature
Multiple linear regressionOne continuous target, many features
GLMLinear score adapted to non-standard targets through a link function and distribution
Interaction termAn engineered feature that lets one feature modify another feature’s effect
Vectorized notationThe same line equation scales to matrices: \hat{\mathbf{y}} = \mathbf{X}oldsymbol{ heta}

Linear models become useful only after we can measure how wrong their predictions are. Next, move to Mean Squared Error, where the book turns prediction mistakes into the objective that linear regression tries to minimize.