Introduction
Gradient Descent is a fundamental algorithm in Machine Learning. It's an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent.
The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.
Suppose you are lost in the mountains in a dense fog; you can only feel the slope of the ground below your feet. A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. This is exactly what Gradient Descent does: it measures the local gradient of the error function with regards to the parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum!
![]() |
| Fig 1: Gradient Descent |
How Does it works?
- Initialization: Start with random values for the parameters of your model. you start by filling θ with random values (this is called random initialization).
- Calculate the Gradient: Compute the gradient of the cost function by taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE) with respect to the parameters. The gradient indicates the direction of steepest ascent.
- Update Parameters: Take a step in the opposite direction of the gradient. This step size is determined by the learning rate.
- Iterate: Repeat steps 2 and 3 until the algorithm converges to a minimum.
What is Learning Rate
An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time (see Figure 2)
![]() |
| Fig 2: Learning Rate too small |
![]() |
| Fig 3: Learning Rate too large |
Challenges
- If the random initialization starts the algorithm on the left, then it will converge to a local mini‐ mum, which is not as good as the global minimum.
- If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.
Types of Gradient Descent
- Batch Gradient Descent: Calculates the gradient for the entire dataset in each iteration.
- Stochastic Gradient Descent (SGD): Calculates the gradient for a single data point in each iteration
- Mini-batch Gradient Descent: Calculates the gradient for a small batch of data points in each iteration.





0 Comments