Calculus for Machine Learning
Difficulty: Intermediate | Time: 60-90 minutes | Key Concepts: Derivatives, Gradients, Optimization
Why Calculus Matters for ML
Training neural networks is an optimization problem. Calculus (specifically derivatives) tells us how to adjust parameters to improve the model.
1. Derivatives Explained
What is a Derivative?
The derivative of a function measures how much the output changes when the input changes slightly.
Geometric Interpretation
The derivative is the slope of the tangent line to the function at a point.
Mathematical Definition
f'(x) = lim(h→0) [f(x+h) - f(x)] / h
Simpler intuition: How steep is the curve at this point?
Common Derivatives (Power Rule)
f(x) = x^n → f'(x) = n × x^(n-1)
Examples:
f(x) = x² → f'(x) = 2x
f(x) = x³ → f'(x) = 3x²
f(x) = x → f'(x) = 1
f(x) = 5 → f'(x) = 0 (constant has zero derivative)
2. The Chain Rule (Most Important!)
Formula
If y = f(g(x)), then dy/dx = df/dg × dg/dx
Or more simply: derivative of outside × derivative of inside
Real Example
f(x) = (2x + 3)²
Let u = 2x + 3 (inside function)
Then f(x) = u² (outside function)
du/dx = 2
df/du = 2u = 2(2x + 3)
df/dx = df/du × du/dx = 2(2x + 3) × 2 = 4(2x + 3) = 8x + 12
Why Chain Rule Matters for ML
Neural networks are compositions of functions. Backpropagation is just the chain rule applied repeatedly!
3. Partial Derivatives
What When You Have Multiple Variables?
Partial derivatives measure how a function changes with respect to ONE variable, holding others constant.
Notation & Example
f(x, y) = x² + 3xy + y²
∂f/∂x = 2x + 3y (treat y as constant)
∂f/∂y = 3x + 2y (treat x as constant)
4. Gradients (Vector of Derivatives)
What is a Gradient?
The gradient is a vector containing all partial derivatives. It points in the direction of steepest increase.
Notation
∇f = [∂f/∂x₁, ∂f/∂x₂, ∂f/∂x₃, ...]
Example for f(x, y, z) = x² + y² + z²:
∇f = [2x, 2y, 2z]
Why Gradients Matter
The negative gradient points toward lower loss, so gradient descent follows -∇L to minimize loss!
5. Gradient Descent Explained
The Algorithm
Repeat until convergence:
θ_new = θ_old - α × ∇L(θ_old)
Where:
θ = model parameters
α = learning rate (step size)
L = loss function
∇L = gradient of loss
Intuition
Start at a random point. Look at the slope (gradient). Take a step in the opposite direction (downhill). Repeat until you reach the bottom.
Learning Rate Matters
- Too small: Very slow training, takes forever to converge
- Just right: Steady progress toward minimum
- Too large: Might overshoot the minimum or diverge
6. Second Derivatives & Curvature
Second Derivative
f(x) = x²
f'(x) = 2x
f''(x) = 2
The second derivative tells us about curvature:
f'' > 0: Curving upward (convex)
f'' = 0: Straight line
f'' < 0: Curving downward (concave)
In Optimization
Second derivatives help determine if we’re at a minimum, maximum, or saddle point.
7. Backpropagation (Gradient Computation)
Simple Neural Network
Input (x) → Weight (w) → Bias (b) → Sigmoid → Loss (L)
y = sigmoid(wx + b)
L = (y - target)²
To train: Find ∂L/∂w and ∂L/∂b
By chain rule:
∂L/∂w = ∂L/∂y × ∂y/∂(wx+b) × ∂(wx+b)/∂w
Why It’s Called Backpropagation
We compute gradients by working backward from the loss to each parameter.
8. Python Examples
Manual Gradient Descent
import numpy as np
# Simple function: f(x) = x²
def f(x):
return x**2
def f_derivative(x):
return 2*x
# Start at x = 5
x = 5.0
learning_rate = 0.1
print(f"Starting at x = {x}, f(x) = {f(x)}")
for iteration in range(10):
gradient = f_derivative(x)
x = x - learning_rate * gradient
print(f"Iteration {iteration+1}: x = {x:.4f}, f(x) = {f(x):.4f}")
# Should converge to x = 0 (minimum)
Using NumPy for Automatic Differentiation
import autograd.numpy as np
from autograd import grad
# Define function
def f(x):
return x**2 + 3*x + 2
# Compute gradient automatically
f_grad = grad(f)
x = 5.0
print(f"f'(5) = {f_grad(5.0)}") # Should be 13 (= 2*5 + 3)
# Use for gradient descent
x = 5.0
for i in range(5):
grad_value = f_grad(x)
x = x - 0.1 * grad_value
print(f"x = {x:.4f}, f(x) = {f(x):.4f}")
9. Key Rules to Remember
Common Derivatives
| Function | Derivative |
|---|---|
| c (constant) | 0 |
| x^n | n × x^(n-1) |
| e^x | e^x |
| ln(x) | 1/x |
| sin(x) | cos(x) |
| cos(x) | -sin(x) |
10. Optimization Algorithms Beyond Gradient Descent
Momentum
Accumulate gradients to build momentum, like a ball rolling downhill.
Adam (Adaptive Moment Estimation)
Most popular optimizer. Adapts learning rate for each parameter individually.
RMSprop
Keeps a running average of gradient magnitudes to adapt learning rate.
Key Takeaways
- Derivatives measure how functions change
- Chain rule connects derivatives through function compositions
- Gradients point toward function increase
- Gradient descent follows negative gradient to minimize loss
- Learning rate controls step size
- Backpropagation computes gradients in neural networks
You’ve Completed Math Foundations!
Congratulations! You now have the mathematical foundation needed for AI/ML. Next step: Apply this knowledge in hands-on projects and advanced ML courses.