When Machines Learn to Differentiate

A simple, intuitive look at how computers compute derivatives automatically.

Nov 07, 2025

Any deep learning model has its parameter, for simplicity consider weights and bias, that helps it decide how much importance to give to different features in the data to produce an output close to the ground truths.

These parameters are learned during the training and the learning using a neat mathematical trick called differentiation.

During training, we show the model an input whose correct outputs are already known. The model makes its predictions, we measure how far this predictions are from the actual answers. This difference is called loss.

The goal is then to adjust the parameters to reduce this loss. Each parameter is updated based on how much it influences the loss: parameters that have a larger impact are adjusted more, and those with smaller effects are adjusted less.

In mathematical terms, we measure this influence through the derivative, the rate of change of the loss with respect to each parameter. The derivative tells us which direction to move (increase or decrease the parameter) and by how much, guiding the model to learn step by step.

What is derivative?

Derivative is simply the rate of change.

Imagine you’re driving a car, and your speed increases by 10 units every second. That means the rate of change of your speed, or the derivative of your speed, is 10 units per second.

Here’s how to build a practical intuition for derivatives:

If the value of a function is increasing, its rate of change is positive, so the derivative is positive.
If the value of a function is decreasing, its rate of change is negative, so the derivative is negative.
If the function’s value stays constant, the rate of change is zero, meaning the derivative is zero.

Now, imagine two functions: one that rises sharply and another that increases more gently. The function that changes more rapidly has a larger derivative, while the slower one has a smaller derivative.

Manual Differentiation

In Calculus, we have learned various rules including addition, multiplication, division, and especially the chain rule, that let us compute derivatives of any function step by step.

So, one option is to manually compute the derivative of your model using these rules and then code it in. But for a deep neural network, that’s simply not practical.

Every time you tweak the model structure, you’d have to redo all those derivative calculations. That’s manual differentiation, accurate but tedious and unscalable.

Numeric Differentiation

Since, derivative is measure of rate of change so, a straightforward way to compute the derivative is to slightly increase the input value and see how much the output changes. The rate of this change gives you the derivative.

This method is called numerical differentiation.

The principle is simple: nudge the input by a small amount and compute how much the function changes per step of that nudge. It is easy to implement but comes with an important question: how small should that step be?

A smaller step gives a more accurate estimate, but computes also have limits on how precisely they can represent small numbers. With increasing complexity of the function i.e. increase in depth in deep learning, these errors accumulates and becomes significant.

Symbolic Differentiation

In symbolic differentiation, a computer applies the same calculus rules but you just pass a function and it computes the derivative automatically. You don’t need to do the manual work as in manual differentiation.

For a given function, the computer builds a symbolic computation tree representing the entire function, then applies differentiation rules to that tree.

It’s precise and automatic, but there’s a catch: for large models, the symbolic representation grows very quickly, a problem known as expression swell.

This makes it memory-heavy and inefficient for deep learning models that can have millions of interconnected operations.

In short, symbolic differentiation tries to expand the entire function algebraically before evaluating it. That’s great for math textbooks or computer algebra systems like SymPy or Mathematica, but not for real-world neural networks.

Autodiff: The Best of Both Worlds

This is where Automatic Differentiation (autodiff) shines. It combines the ideas of symbolic and numeric differentiation to calculate derivatives exactly and efficiently, no approximations, no symbolic explosions.

Like symbolic differentiation, autodiff represents your model as a computation graph, but it doesn’t expand everything into a giant formula.

Instead, it tracks how each intermediate value is computed as the program runs. You can think of it as the computer taking careful notes during the forward pass, remembering which operations were applied and how they connect.

Then, when it’s time to update the parameters, autodiff moves backward through this graph, applying the chain rule step by step to find how each parameter affects the final loss. This backward pass is what we know as backpropagation and it’s just autodiff in action.

The difference from symbolic differentiation is subtle but powerful:
symbolic methods manipulate algebraic expressions before running the computation, while autodiff works during computation, using real numerical values and local derivatives.

That’s why autodiff can scale to deep models efficiently. It gives exact gradients without ever building or simplifying massive symbolic expressions.

Thanks for reading! In the next post, we’ll dive into the technical side of autodiff and even build a simple autograd engine from scratch.

Discussion about this post

Ready for more?