Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Welcome to Calculus Essentials — the chapter where your ML models finally learn how to learn. Don’t worry — we’re not here to prove theorems. We’re here to explain why your model behaves like an over-caffeinated intern: it keeps making mistakes, learning from them, and slowly improving. ☕🤖


🧠 Why Calculus Matters in ML

In business terms:

Calculus helps your model minimize regret (also known as “loss”).

It’s the math of change and improvement — used by machine learning algorithms to adjust parameters, reduce errors, and get better at predictions.

Think of it as the “performance review” process for algorithms.

ConceptWhat It Means in MLBusiness Analogy
DerivativeMeasures how much something changes“If ad spend goes up slightly, how does revenue change?”
GradientMulti-dimensional derivative“What’s the direction to move to improve profits?”
OptimizationUsing calculus to find the best outcome“Find the marketing budget that maximizes ROI.”

💡 The Idea in One Sentence

Machine Learning = Data + Calculus + Patience.

Every training loop goes something like this:

  1. Make a prediction (it’s probably wrong 🙃)

  2. Measure how wrong (loss function)

  3. Use calculus (gradient) to adjust parameters

  4. Try again, but smarter

And that’s gradient descent — the backbone of all modern ML.


⚙️ Meet the Star: The Derivative

The derivative tells us how fast something changes.

[ \frac{d}{dx}f(x) ]

means “how much does ( f(x) ) change if we nudge ( x ) a little?”

If ( f(x) ) = revenue and ( x ) = ad spend:

[ \frac{d}{dx}f(x) \Rightarrow \text{‘How much does revenue change if we increase ads slightly?’} ]

That’s business calculus — not rocket science. 🚀


🏔️ Finding the Sweet Spot (Minima)

Imagine your model’s “error” as a landscape of hills and valleys. Your goal? Find the lowest point — the minimum loss.

  • 🏔️ Too high → bad predictions

  • 🕳️ Lowest valley → best model parameters

Gradient descent is your model walking downhill — step by step — until it reaches that sweet spot.

[ \theta = \theta - \eta \cdot \frac{d}{d\theta}L(\theta) ]

Where:

  • ( \theta ) = model parameters

  • ( L(\theta) ) = loss (sadness level)

  • ( \eta ) = learning rate (coffee intake per iteration ☕)


🧩 Practice Corner #1: “Find the Direction”

Suppose your model’s loss is shaped like a hill: [ L(w) = (w - 3)^2 ]

You start at ( w = 0 ). The slope is:

[ \frac{dL}{dw} = 2(w - 3) ]

At ( w = 0 ): slope = -6 → negative means “go right.” The model learns to move toward ( w = 3 ) — where loss is minimum.

Key Idea: The sign of the derivative tells your model which direction to move. That’s literally “learning” in one line of math.


💬 Common Calculus Concepts in ML

SymbolNameML RoleBusiness Analogy
( \frac{d}{dx} )DerivativeChange in one variableHow revenue reacts to marketing spend
( \nabla L )GradientMulti-variable slopeDirection of steepest improvement
( \eta )Learning RateStep size“How aggressive should we change strategy?”
( L(\theta) )Loss FunctionModel’s total error“How wrong are we?”
( \min L(\theta) )OptimizationGoal of training“Find the best possible outcome.”

🧩 Practice Corner #2: “Business Gradient Descent”

Your model predicts sales with one parameter — price. It’s currently too high, and customers are leaving.

What should the model do?

ObservationDerivative SignAdjustment
Price ↑ → Sales ↓NegativeMove price down
Price ↓ → Sales ↑PositiveMove price up

💡 Models don’t know “cheap” or “expensive.” They just see gradients — mathematical hints toward better business outcomes.


⚖️ Business Translation: “Gradient Descent in Real Life”

Let’s compare:

Model TrainingBusiness Analogy
Model makes bad predictionEmployee makes mistake
Calculate lossYou review KPIs
Compute gradientYou identify what went wrong
Update weightsEmployee learns and improves
RepeatContinuous improvement cycle

Yes — your ML model is basically a self-correcting employee that never sleeps and works for free. 🤖💼


🎓 Quick Recap

✅ Derivatives measure change ✅ Gradients tell direction ✅ Learning rate controls speed ✅ Loss function defines goal ✅ Gradient descent = how models learn from mistakes


🧠 Bonus: The Secret Behind Backpropagation

If you’ve heard “backpropagation” in neural networks — that’s just calculus playing catch-up.

The model:

  1. Makes a forward pass (predicts)

  2. Measures error (loss)

  3. Uses calculus (chain rule) to assign blame

  4. Updates weights accordingly

In other words:

Backprop = Calculus + Accountability.


🧩 Practice Corner #3: “Check Your Intuition”

QuestionYour Guess
What does the gradient represent?
Why can’t we use too high a learning rate?
What happens when the gradient = 0?

Answers: Gradient = direction of improvement Too high rate = overshooting the goal Gradient = 0 → you’ve reached (hopefully) the minimum loss


🚀 Up Next

Next stop: Probability Essentials → We’ll swap calculus for chance and uncertainty — because predicting customers is basically rolling dice, but with better spreadsheets. 🎲📈


💡 Quick Intro: Calculus: Key concepts at a glance...

Let’s define a total cost function:

C(q)=5q240q+200C(q) = 5q^2 - 40q + 200

Where:

  • qq is the quantity of goods produced (e.g., number of items),

  • C(q)C(q) is the total cost of producing qq items.

This quadratic function helps us demonstrate limits, continuity, differentiation, minima, and gradient descent, and is easy to connect with real business decisions.

1. Limits

A limit tells us what value the cost function is approaching as quantity qq approaches some value.

Example:

limq4C(q)=limq4(5q240q+200)\lim_{q \to 4} C(q) = \lim_{q \to 4} (5q^2 - 40q + 200)

Calculating:

C(4)=5(4)240(4)+200=80160+200=120C(4) = 5(4)^2 - 40(4) + 200 = 80 - 160 + 200 = 120

So, the limit as qq approaches 4 is 120.

2. Continuity

The cost function is a polynomial and therefore continuous for all real values of qq.

That means:

  • The function has no gaps, jumps, or holes.

  • You can evaluate limits simply by plugging in the number:

limqaC(q)=C(a)\lim_{q \to a} C(q) = C(a)

Example: At q=5q = 5, we get:

C(5)=5(5)240(5)+200=125200+200=125C(5) = 5(5)^2 - 40(5) + 200 = 125 - 200 + 200 = 125

3. Differentiation

The first derivative of the cost function gives us the marginal cost: How much the total cost changes when we produce one more unit.

Given:

C(q)=5q240q+200C(q) = 5q^2 - 40q + 200

Then:

C(q)=ddqC(q)=10q40C'(q) = \frac{d}{dq}C(q) = 10q - 40

So at q=6q = 6:

C(6)=10(6)40=20C'(6) = 10(6) - 40 = 20

The marginal cost is 20 currency units per item.

4. Minima and Maxima

To find the optimal quantity qq that minimizes cost, we set:

C(q)=010q40=0q=4C'(q) = 0 \Rightarrow 10q - 40 = 0 \Rightarrow q = 4

Then, we check the second derivative:

C(q)=d2dq2C(q)=10>0C''(q) = \frac{d^2}{dq^2} C(q) = 10 > 0

Since C(q)>0C''(q) > 0, this point is a minimum.

Therefore, minimum total cost occurs when q=4q = 4:

C(4)=5(4)240(4)+200=120C(4) = 5(4)^2 - 40(4) + 200 = 120

5. Gradient Descent

Gradient descent finds the value of qq that minimizes C(q)C(q).

Update rule:

qnew=qoldηC(qold)q_{\text{new}} = q_{\text{old}} - \eta \cdot C'(q_{\text{old}})

Where η\eta is the learning rate, and C(q)C'(q) is the marginal cost.

Starting from q=8q = 8 and using η=0.1\eta = 0.1, we iteratively reduce cost by following the slope.

Below is a single, physics‐based theme—using position, velocity, and acceleration—to illustrate limits, continuity, differentiation, non-continuous functions, and gradient descent. At the end, you’ll see a Python snippet to visualize gradient descent on a simple cost function derived from velocity.


1. Position, Velocity, Acceleration

Let an object move along a line so that its position at time tt is

s(t)=t2+2t+1,tR.s(t) = t^2 + 2t + 1\,,\quad t\in\mathbb{R}.
  • Its velocity is the derivative of position:

    v(t)=s(t)=ddt(t2+2t+1)=2t+2.v(t) = s'(t) = \frac{d}{dt}(t^2 + 2t + 1) = 2t + 2.
  • Its acceleration is the derivative of velocity:

    a(t)=v(t)=ddt(2t+2)=2.a(t) = v'(t) = \frac{d}{dt}(2t + 2) = 2.

2. Limits

A limit tells us what a function “approaches” as tt nears some value.

Example:

limt1v(t)=limt1(2t+2)=2(1)+2=0.\lim_{t\to -1} v(t) = \lim_{t\to -1}(2t + 2) = 2(-1) + 2 = 0.

Intuitively, as tt approaches -1, the velocity approaches 00\,.


3. Continuity

A function f(t)f(t) is continuous at t=at=a if

  1. f(a)f(a) is defined,

  2. limtaf(t)\lim_{t\to a}f(t) exists,

  3. limtaf(t)=f(a)\lim_{t\to a}f(t) = f(a).

Since s(t)=t2+2t+1s(t)=t^2+2t+1 is a polynomial, it’s continuous for all tt. Thus

limt3s(t)=s(3)=32+23+1=16.\lim_{t\to 3}s(t) = s(3) = 3^2 + 2\cdot3 + 1 = 16.

4. Differentiation

  • First derivative s(t)=v(t)s'(t)=v(t) gives the instantaneous velocity.

  • Second derivative s(t)=a(t)s''(t)=a(t) gives the instantaneous acceleration.

At t=1t=1:

v(1)=21+2=4,a(1)=2.v(1)=2\cdot1+2=4,\quad a(1)=2.

5. Non-Continuous Example

Consider the piecewise velocity:

vdisc(t)={2t+2,t<1,5,t=1,2t1,t>1.v_{\rm disc}(t) = \begin{cases} 2t + 2, & t < 1,\\ 5, & t = 1,\\ 2t - 1, & t > 1. \end{cases}
  • At t=1t=1,

    limt1vdisc(t)=21+2=4,limt1+vdisc(t)=211=1,\lim_{t\to1^-}v_{\rm disc}(t)=2\cdot1+2=4,\quad \lim_{t\to1^+}v_{\rm disc}(t)=2\cdot1-1=1,

    but vdisc(1)=5v_{\rm disc}(1)=5.

  • Conclusion: left‐limit \neq right‐limit \neq function value → discontinuity.

Gradient‐based methods fail at such jumps, since the slope is undefined there.


6. Gradient Descent Connection

We often want to tune a parameter to make velocity hit a target. Define a cost measuring squared error from a desired velocity vv^*:

J(t)=(v(t)v)2=(2t+2v)2.J(t) = \bigl(v(t) - v^*\bigr)^2 = \bigl(2t+2 - v^*\bigr)^2.
  • Its derivative (gradient) is

    J(t)=2(2t+2v)2=4(2t+2v).J'(t) = 2\bigl(2t+2 - v^*\bigr)\cdot 2 = 4\bigl(2t+2 - v^*\bigr).
  • Gradient descent updates

    tnew=toldηJ(told),t_{\rm new} = t_{\rm old} - \eta\,J'(t_{\rm old}),

    stepping “downhill” in J(t)J(t) until v(t)vv(t)\approx v^*.


7. Derivative vs. Limit as tt\to\infty

  • A derivative f(t)f'(t) is the slope of ff at a specific tt.

  • A limit limtf(t)\lim_{t\to\infty}f(t) describes the behavior far out as tt grows without bound.

Example for s(t)=t2s(t)=t^2:

  • s(t)=2ts'(t)=2t gives slope at each tt (e.g.\ s(3)=6s'(3)=6).

  • limts(t)=\lim_{t\to\infty}s(t)=\infty tells us the position grows arbitrarily large—but says nothing about the instantaneous slope at any finite tt.


8. Python Code to Visualize Gradient Descent on J(t)J(t)

  • What it shows: the quadratic cost J(t)J(t) (blue curve) and how successive gradient‐descent iterations (markers) march toward the minimum, where the object’s velocity matches vv^*.

Perfect! I’ll create a conversational, easy-to-teach explanation covering limits, continuity, differentiation, non-continuous functions, integration, and gradient descent — all using a business profit function as the main example. I’ll organize it so you can walk your students smoothly through each idea. I’ll get it ready for you shortly!

Business Calculus with a Profit Function Example

Consider a simple business profit model: the profit P(x)P(x) from selling xx units is revenue minus cost, i.e., P(x)=R(x)C(x)P(x) = R(x) - C(x) (Calculus I - Business Applications). For example, if each item sells at price pp and total revenue is R(x)=pxR(x) = p \cdot x, then P(x)=pxC(x)P(x) = p \cdot x - C(x). In general R(x)R(x) and C(x)C(x) could be curved (e.g., price may fall at higher xx and costs may rise nonlinearly). Throughout, think of xx as quantity and P(x)P(x) as profit. We will use this unified profit function example to introduce limits, continuity, derivatives, discontinuities, integration, and even gradient descent, keeping the math clear but in everyday language.

Limits

The limit of a function describes its behavior as the input xx approaches some value. Intuitively, “limxaP(x)=L\lim_{x \to a} P(x) = L” means that as xx gets closer and closer to aa, the profit P(x)P(x) gets arbitrarily close to some number LL (Limit of a function - Wikipedia). In business terms, we might ask “what happens to profit as production approaches a certain level?” The limit formalizes this. For example, if there were a huge fixed cost or piecewise change at x=ax=a, the profit might approach different values from the left or right. More concretely, if R(x)R(x) and C(x)C(x) are smooth, then as xax \to a the profit smoothly approaches P(a)P(a). The Wikipedia definition says: for any target distance around LL, we can keep f(x)f(x) within that target by choosing xx close enough to aa (Limit of a function - Wikipedia). In practice, one might consider limits like limx0P(x)\lim_{x \to 0} P(x) (profit at very low output) or limxP(x)\lim_{x \to \infty} P(x) (profit if output grows without bound). For instance, if costs grow faster than revenue at high xx, limxP(x)\lim_{x \to \infty} P(x) could be -\infty (huge losses). Limits are the foundation for defining continuity and derivatives, as we see next.

Continuity

A function is continuous if its value doesn’t jump suddenly as xx changes. Formally, P(x)P(x) is continuous at x=ax=a if limxaP(x)=P(a)\lim_{x \to a} P(x) = P(a) (Limit of a function - Wikipedia). In plain terms, small changes in xx (production) cause only small changes in profit – the graph of profit is unbroken. For most simple profit models (like polynomial cost and revenue), P(x)P(x) will be continuous everywhere in their domain. In business terms, continuity means no sudden surprises in profit: e.g., incremental production smoothly increases/decreases profit. However, if there are abrupt changes – such as a step cost (perhaps a new factory upgrade kicks in at x=100x=100) or a sudden tariff – the profit graph could have breaks or jumps, meaning discontinuities. A helpful picture is that a continuous graph can be drawn without lifting a pen (Discontinuous Function - Meaning, Types, Examples). If you must pick up your pen (there’s a gap or jump), the function is discontinuous. We discuss those next.

Non-Continuous (Discontinuous) Functions

A discontinuous profit function has gaps, jumps, or holes. As Cuemath explains, a discontinuous function “has breaks/gaps on its graph” (Discontinuous Function - Meaning, Types, Examples). In business, imagine this: your factory can produce up to 100 units at one cost structure, but producing the 101st unit requires renting additional equipment. Suddenly cost jumps at x=100x=100, so the profit function has a jump at that point. Mathematically, either the profit is not defined at some x=ax=a, or limxaP(x)limxa+P(x)\lim_{x \to a^-} P(x) \neq \lim_{x \to a^+} P(x), or the limit doesn’t equal P(a)P(a) (Discontinuous Function - Meaning, Types, Examples). For example, if a demand price changes abruptly at a certain quantity or a tax kicks in, profit can drop sharply. Discontinuous profit functions are more complex to analyze, but the key idea is intuitive: if your profit model has built-in jumps (like step costs or price tiers), it’s discontinuous. In such cases, classic calculus tools like setting derivatives to zero may fail at the jump point. We can still study limits on each side or use piecewise analysis. But typically for optimization we assume the profit is continuous (no gaps), so calculus works smoothly.

Differentiation (Marginal Profit)

The derivative P(x)P'(x) measures the instantaneous rate of change of profit with respect to quantity. It is the slope of the tangent line to P(x)P(x) at xx (Derivative - Wikipedia). In practice, P(x)P'(x) is called the marginal profit: approximately how much additional profit you get by selling one more unit (or a tiny increment of units). If P(x)P'(x) is large and positive, a small increase in production yields a large profit gain; if P(x)P'(x) is negative, making one more item reduces profit. We often write the derivative definition as P(x)=limh0P(x+h)P(x)hP'(x) = \lim_{h \to 0} \frac{P(x+h) - P(x)}{h}, but conceptually it’s “instantaneous” change. The Wikipedia article sums it up: the derivative is a “fundamental tool that quantifies the sensitivity to change” – essentially the rate at which profit changes for a small change in xx (Derivative - Wikipedia). Business interpretation:

  • If P(x)>0P'(x) > 0, producing one more unit increases profit.

  • If P(x)<0P'(x) < 0, producing one more unit decreases profit.

  • If P(x)=0P'(x) = 0, profit is at a local maximum or minimum (a critical point).

These rules can be bulleted as key takeaways:

  • P(x)>0P'(x) > 0: Profit is increasing in xx, so it pays to produce more.

  • P(x)<0P'(x) < 0: Profit is decreasing in xx, so reduce production.

  • P(x)=0P'(x) = 0: You may have reached an optimum (e.g., maximum profit or minimum loss).

In fact, to maximize profit, we set P(x)=0P'(x) = 0 and check the result. This is equivalent to the well-known economic rule “marginal revenue = marginal cost.” Since P(x)=R(x)C(x)P(x) = R(x) - C(x), we have P(x)=R(x)C(x)P'(x) = R'(x) - C'(x). Paul’s notes explain that if P<0P'' < 0 (concave down), then the maximum occurs when R(x)=C(x)R'(x) = C'(x) – i.e., marginal profit is zero (Calculus I - Business Applications). In other words, you find the quantity xx where increasing production neither increases nor decreases profit. That will be the peak of the profit curve under normal concave conditions. Once we solve P(x)=0P'(x) = 0, we may check P(x)P''(x) (the second derivative) to ensure it is a maximum (usually P<0P'' < 0 there). In our example, you would compute P(x)P'(x) and solve for xx. For instance, if P(x)=100+0.05x0.000012x2P'(x) = 100 + 0.05x - 0.000012x^2 (as in Paul’s example (Calculus I - Business Applications)), setting this to zero and solving gives candidate maximizers. Then checking concavity confirms which is a maximum. Thus, differentiation turns profit curves into actionable advice: the sign of P(x)P'(x) tells you to increase or decrease output, and P(x)=0P'(x) = 0 finds the optimal production for peak profit (Calculus I - Business Applications).

Integration

Integration is the reverse process of differentiation. It accumulates small quantities. In calculus, the integral of P(x)P'(x) gives the total change in profit. The Fundamental Theorem of Calculus tells us that for a continuous marginal-profit function PP', we can recover profit from it (Fundamental theorem of calculus - Wikipedia). In intuitive terms, “integrating” means summing up tiny bits of profit. The Wikipedia description says integration computes the area under the graph of a function or the cumulative effect of small contributions (Fundamental theorem of calculus - Wikipedia). Concretely, if you know the marginal profit P(x)P'(x) at every xx, then P(x)=P(x)dx+constantP(x) = \int P'(x) \, dx + \text{constant}. If we set a reference point (say P(0)=0P(0) = 0 when selling nothing yields zero profit), then the definite integral from 0 to xx gives total profit: P(x)=0xP(t)dtP(x) = \int_0^x P'(t) \, dt. In other words, the area under the marginal-profit curve from 0 to xx is the profit at xx. More generally, the Fundamental Theorem states abP(x)dx=P(b)P(a)\int_a^b P'(x) \, dx = P(b) - P(a), meaning the integral of marginal profit from aa to bb equals the change in actual profit (Fundamental theorem of calculus - Wikipedia). Business interpretation: If you know how each additional item contributes to profit (the marginal profit function P(x)P'(x)), then integrating it tells you overall profit. For example, if a new product’s marginal profit is P(x)150P'(x) \approx 150 at x=2500x=2500 (meaning each unit around 2500 adds 150profit),thenintegrating150 profit), then integrating P’(x)$ up to 2500 shows the total profit (minus any base profit at 0). Integration also applies to costs: the total cost is the integral of marginal cost. In summary, integration lets us find accumulated profit or cost by summing up marginal contributions (Fundamental theorem of calculus - Wikipedia) (Fundamental theorem of calculus - Wikipedia).

Gradient Descent

Gradient descent is an iterative method to optimize functions using derivatives (Gradient descent - Wikipedia). Think of it as algorithmically “following the slope” to find a minimum. In our context, we often want to maximize profit. Gradient descent as defined in calculus actually finds local minima (it moves against the gradient). However, to find a maximum profit, one can simply take the opposite approach (“gradient ascent”) by stepping in the positive gradient direction (Gradient descent - Wikipedia). The basic rule of gradient descent is: start with some initial xx (production level) and update it by moving against the gradient of the function. In formula form, one step is xnew=xoldγP(xold)x_{\text{new}} = x_{\text{old}} - \gamma \, P'(x_{\text{old}}), where γ\gamma is a small positive step size (learning rate) (Gradient descent - Wikipedia). Here P(x)P'(x) is the derivative (slope) of profit. The Wikipedia description notes that by moving in the negative gradient direction, the function value decreases fastest (Gradient descent - Wikipedia). In our profit context, we would usually use the positive gradient (add γP(x)\gamma P'(x)) to move uphill towards higher profit (or equivalently minimize P-P). In practice, one might adjust xx iteratively: if P(x)P'(x) is positive, increase xx (moving up the profit slope); if P(x)P'(x) is negative, decrease xx. The update formula ensures each step moves towards the optimum. For sufficiently small γ\gamma, this process converges toward the local maximum (like climbing a hill one small step at a time). We must choose a suitable γ\gamma so steps aren’t too large (overshooting) or too small (too slow). Business tie-in: Gradient descent (or ascent) is useful when the profit function is complicated and we can’t easily solve P=0P' = 0 algebraically. It suggests a rule: adjust production gradually in the direction that increases profit. If you notice profit rising as you add units, keep adding; if profit falls, cut back. By iterating this idea, you hone in on the best production level. Although businesses often solve for P=0P' = 0 directly, the gradient approach is analogous to trial-and-error tuning of output to maximize profit or minimize cost. In summary, limits and continuity ensure our profit models behave sensibly, derivatives (marginal profit) tell us how small changes affect profit, integration sums up those changes to total profit, and gradient descent is a practical algorithm to adjust production toward optimal profit. This suite of calculus ideas—limits, continuity, differentiation, integration, and gradient-based optimization—provides a powerful toolkit for making intuitive business decisions like “should I produce more or less?” and “how do I find the output that maximizes profit?” (Derivative - Wikipedia) (Gradient descent - Wikipedia) (Calculus I - Business Applications). Sources: Definitions and key concepts are supported by calculus references (Limit of a function - Wikipedia) (Limit of a function - Wikipedia) (Derivative - Wikipedia) (Fundamental theorem of calculus - Wikipedia) (Gradient descent - Wikipedia), and the profit formula P(x)=R(x)C(x)P(x) = R(x) - C(x) by Paul’s calculus notes (Calculus I - Business Applications).

That’s a great final question — and it gets to the core idea of optimization using derivatives and gradient descent.


🌄 The Derivative Is a Local Slope

The derivative of a function f(x)f(x) at a point xx tells us the slope (or rate of change) of the function right at that point. It’s like standing on a hill and asking:

“If I take one small step forward or backward, will I go uphill or downhill?”

Mathematically:

  • If f(x)>0f'(x) > 0, the function is increasing at xx → slope tilts upward

  • If f(x)<0f'(x) < 0, the function is decreasing at xx → slope tilts downward

  • If f(x)=0f'(x) = 0, we’re at a flat spot — it could be a minimum, maximum, or inflection point

So:

🧭 How Do We Know Which Direction Leads to the Minimum?

We use the sign of the derivative.

In gradient descent, we always move opposite the direction of the slope, because that’s the way to go downhill (toward the minimum).


🚶 Simple Example: Walk on a Curve

Let’s say we have a cost function:

C(x)=(x3)2+5C(x) = (x - 3)^2 + 5

This is a simple parabola with its minimum at x=3x = 3.

The derivative is:

C(x)=2(x3)C'(x) = 2(x - 3)

So at different points:

  • At x=5x = 5: C(5)=2(53)=4C'(5) = 2(5 - 3) = 4 → positive slope → go left to reduce cost

  • At x=1x = 1: C(1)=2(13)=4C'(1) = 2(1 - 3) = -4 → negative slope → go right to reduce cost

  • At x=3x = 3: C(3)=0C'(3) = 0 → flat spot → minimum!

We use the derivative as a compass:

Go opposite the sign of the derivative to reduce the function.


🔍 Why Small Intervals?

Derivatives are local — they only tell you what’s happening right now. So in gradient descent, we take small steps in the negative gradient direction. Over many steps, we spiral toward the minimum.

This works well as long as:

  • The function is smooth

  • The learning rate γ\gamma is small enough not to overshoot


📌 Summary

  • Derivative tells us the slope — the direction and steepness of the curve at one point

  • We use its sign to decide direction (go opposite to minimize)

  • Gradient descent uses this info to take many small steps toward a minimum

  • The process doesn’t know the whole curve, but follows the slope like walking downhill

Would you like a simple Python plot showing this visually with arrows and updates step-by-step?

Origin of the Power Rule for Differentiation

The power rule for differentiation, ddx(xn)=nxn1\frac{d}{dx}(x^n) = nx^{n-1}, comes directly from the definition of the derivative using limits:

ddxf(x)=limh0f(x+h)f(x)h\frac{d}{dx}f(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

Let f(x)=xnf(x) = x^n, where nn is a positive integer. Then:

ddx(xn)=limh0(x+h)nxnh\frac{d}{dx}(x^n) = \lim_{h \to 0} \frac{(x+h)^n - x^n}{h}

We can expand (x+h)n(x+h)^n using the binomial theorem:

(x+h)n=xn+nxn1h+n(n1)2!xn2h2++hn(x+h)^n = x^n + nx^{n-1}h + \frac{n(n-1)}{2!}x^{n-2}h^2 + \dots + h^n

Substituting this back into the limit expression:

ddx(xn)=limh0(xn+nxn1h+n(n1)2!xn2h2++hn)xnh\frac{d}{dx}(x^n) = \lim_{h \to 0} \frac{(x^n + nx^{n-1}h + \frac{n(n-1)}{2!}x^{n-2}h^2 + \dots + h^n) - x^n}{h}
=limh0nxn1h+n(n1)2!xn2h2++hnh= \lim_{h \to 0} \frac{nx^{n-1}h + \frac{n(n-1)}{2!}x^{n-2}h^2 + \dots + h^n}{h}

Now, we can divide each term in the numerator by hh:

=limh0(nxn1+n(n1)2!xn2h++hn1)= \lim_{h \to 0} \left( nx^{n-1} + \frac{n(n-1)}{2!}x^{n-2}h + \dots + h^{n-1} \right)

As hh approaches 0, all terms containing hh will also approach 0. Therefore, we are left with:

ddx(xn)=nxn1\frac{d}{dx}(x^n) = nx^{n-1}

This proves the power rule for positive integer values of nn. The rule can be extended to other real numbers using more advanced techniques like logarithmic differentiation.

Origin of the Power Rule for Integration

The power rule for integration, xndx=xn+1n+1+C\int x^n \, dx = \frac{x^{n+1}}{n+1} + C (for n1n \neq -1), is essentially the reverse process of the power rule for differentiation.

If we differentiate xn+1n+1\frac{x^{n+1}}{n+1} with respect to xx:

ddx(xn+1n+1)=1n+1ddx(xn+1)\frac{d}{dx} \left( \frac{x^{n+1}}{n+1} \right) = \frac{1}{n+1} \frac{d}{dx} (x^{n+1})

Applying the power rule for differentiation (with the exponent being n+1n+1):

=1n+1((n+1)x(n+1)1)=1n+1(n+1)xn=xn= \frac{1}{n+1} ((n+1)x^{(n+1)-1}) = \frac{1}{n+1} (n+1)x^n = x^n

Since the derivative of xn+1n+1\frac{x^{n+1}}{n+1} is xnx^n, it follows by the definition of the antiderivative (indefinite integral) that:

xndx=xn+1n+1+C\int x^n \, dx = \frac{x^{n+1}}{n+1} + C

The constant of integration, CC, arises because the derivative of a constant is always zero.


The indefinite integral of x+1x+1 with respect to xx is found by applying the power rule of integration and the linearity of integration. Here’s the step-by-step process:

We want to find:

(x+1)dx\int (x + 1) \, dx

Using the linearity of integration, which states that [f(x)+g(x)]dx=f(x)dx+g(x)dx\int [f(x) + g(x)] \, dx = \int f(x) \, dx + \int g(x) \, dx, we can split the integral into two parts:

(x+1)dx=xdx+1dx\int (x + 1) \, dx = \int x \, dx + \int 1 \, dx

Now, let’s integrate each part separately using the power rule for integration, which states that xndx=xn+1n+1+C\int x^n \, dx = \frac{x^{n+1}}{n+1} + C (for n1n \neq -1).

For the first part, xdx\int x \, dx, we have n=1n = 1:

x1dx=x1+11+1+C1=x22+C1\int x^1 \, dx = \frac{x^{1+1}}{1+1} + C_1 = \frac{x^2}{2} + C_1

For the second part, 1dx\int 1 \, dx, we can think of 1 as x0x^0, so n=0n = 0:

x0dx=x0+10+1+C2=x11+C2=x+C2\int x^0 \, dx = \frac{x^{0+1}}{0+1} + C_2 = \frac{x^1}{1} + C_2 = x + C_2

Combining the results of the two integrals, we get:

(x+1)dx=x22+C1+x+C2\int (x + 1) \, dx = \frac{x^2}{2} + C_1 + x + C_2

Since C1C_1 and C2C_2 are arbitrary constants, their sum is also an arbitrary constant, which we can denote as CC:

(x+1)dx=x22+x+C\int (x + 1) \, dx = \frac{x^2}{2} + x + C

Therefore, the indefinite integral of x+1x+1 is x22+x+C\frac{x^2}{2} + x + C.

import numpy as np
import matplotlib.pyplot as plt

# Function to plot and show derivative as tangent
def visualize_derivative(func, x, h=0.01):
    y = func(x)
    slope = (func(x + h) - func(x)) / h
    tangent = lambda t: slope * (t - x) + y

    x_vals = np.linspace(x - 2, x + 2, 400)
    y_vals = func(x_vals)
    tangent_vals = tangent(x_vals)

    plt.figure(figsize=(10, 6))
    plt.plot(x_vals, y_vals, label='f(x)')
    plt.scatter(x, y, color='red', label=f'Point (x, f(x)) = ({x:.2f}, {y:.2f})')
    plt.plot(x_vals, tangent_vals, '--g', label=f'Tangent at x={x:.2f}, slope={slope:.2f}')
    plt.xlabel('x')
    plt.ylabel('f(x)')
    plt.title('Visualizing Derivative as Tangent')
    plt.legend()
    plt.grid(True)
    plt.ylim(min(y_vals) - 1, max(y_vals) + 1)
    plt.show()

# Function to approximate integral using rectangles
def visualize_integral(func, a, b, n=50):
    x = np.linspace(a, b, n+1)
    dx = (b - a) / n
    x_mid = (x[:-1] + x[1:]) / 2
    y_mid = func(x_mid)
    area = np.sum(y_mid * dx)

    x_fine = np.linspace(a, b, 400)
    y_fine = func(x_fine)

    plt.figure(figsize=(10, 6))
    plt.plot(x_fine, y_fine, label='f(x)')
    plt.bar(x[:-1], y_mid, width=dx, alpha=0.7, edgecolor='black', label=f'Approximate Area = {area:.2f} (n={n})')
    plt.xlabel('x')
    plt.ylabel('f(x)')
    plt.title('Visualizing Integral as Area Under Curve')
    plt.legend()
    plt.grid(True)
    plt.show()

# Example function: f(x) = x^2
def f(x):
    return x**2

# Visualize the derivative at x = 2
visualize_derivative(f, 2)

# Visualize the integral of f(x) from 0 to 3
visualize_integral(f, 0, 3)
<Figure size 1000x600 with 1 Axes>
<Figure size 1000x600 with 1 Axes>
# Your code here