How can you implement autograd from scratch?
Autograd, or automatic differentiation, may sound difficult, not only to implement but even to understand. This blog will help you learn how autograd works behind the scenes.
Autograd, or automatic differentiation, is a mathematical technique for computing derivatives automatically. This is extremely important in AI and ML because it helps determine how much each parameter (weight or bias) influences the final output, also known as the objective or loss function.
If you’re new to derivatives in ML, I recommend checking out our earlier blog on that topic.
Scenario
Consider a simple polynomial-style expression:
Our goal is to compute how much influence does inputs (x1 and x2) has on y.
Mathematically,
This tells us: If we change input slightly, how much will y change?
And importantly, we want to compute this automatically for any function, not manually every time.
This is exactly what autograd helps us do.
Creating a Computational Graph
If we compute the above equation normally (just plugging in numbers), it is easy to get the final numerical value of y.
However, computing the derivative requires more information:
What does each variable depend on?
In what order do these dependencies happen?
What operation links them together?
To capture all this, we break the expression into tiny elemental operations, each doing exactly one thing (addition, multiplication, log, sin, etc.).
For example, above equation
can be broken into;
Now, we represent each of these operations as nodes in a graph.
Each node stores:
its value,
its parents (the nodes it depends on),
the operation used, and,
a place to store its gradient (initialized to zero).
This structure is called a computational graph.
Each node is a variable, and the directed edges (arrows) represent how values flow from one operation to the next.
With this, we now have a blueprint of how the final output was computed. This will be crucial for computing derivatives.
Backward Propagation (Backprop)
Once the computational graph is built, we can compute the derivatives by moving backwards through the graph.
To understand this, imagine a tiny piece of a computational graph:
Suppose we already know the gradient of y with respect to V2 as
We now want, to know the gradient of V1 i.e. derivative of y with respect to V1.
Using the chain rule:
This is exactly how gradients “flow” backward.
The beautiful part is, in the computational graph, each pair of nodes is connected by only one simple, elemental operation.
So, hardcoding the derivative of these simpler operation is very easy.
How Backprop Works in the Graph
From previous section, we are clear that the gradient of child can be computed easily if gradient of parent is known. But, the question is what is the gradient of parent.
For parent node, the gradient is defined as the derivative of the output with respect to itself, which is 1:
So we start backprop by setting, gradient of y as 1.
Then, for each node, we:
Use its stored operation to compute how its gradient distributes to its parents.
Accumulate the gradient in each parent (because multiple paths may contribute).
Call backward() on its parents.
Each node therefore:
updates its children’s gradients
triggers their backward passes recursively
This continues until gradients reach the input variables. At the end, every leaf variable (like x1 and x2) contains its own gradient.
And that’s autograd, built from scratch.











