Introduction

Gradient Descent is a fundamental algorithm in Machine Learning. It's an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent.

The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

Suppose you are lost in the mountains in a dense fog; you can only feel the slope of the ground below your feet. A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. This is exactly what Gradient Descent does: it measures the local gradient of the error function with regards to the parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum! 


Fig 1: Gradient Descent


How Does it works?

  1. Initialization: Start with random values for the parameters of your model. you start by filling θ with random values (this is called random initialization).
  2. Calculate the Gradient: Compute the gradient of the cost function by taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE) with respect to the parameters. The gradient indicates the direction of steepest ascent.
  3. Update Parameters: Take a step in the opposite direction of the gradient. This step size is determined by the learning rate.
  4. Iterate: Repeat steps 2 and 3 until the algorithm converges to a minimum.

What is Learning Rate

An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time (see Figure 2)


Fig 2: Learning Rate too small

On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution (see Figure 3).
Fig 3: Learning Rate too large

Challenges

Not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult. Figure 4 shows the two main challenges with Gradient Descent: 
  1. If the random initialization starts the algorithm on the left, then it will converge to a local mini‐ mum, which is not as good as the global minimum. 
  2. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.

Feature Scaling: Features should be scaled to have a similar range to prevent the algorithm from taking large steps in one direction.


Types of Gradient Descent

  1. Batch Gradient Descent: Calculates the gradient for the entire dataset in each iteration.
  2. Stochastic Gradient Descent (SGD): Calculates the gradient for a single data point in each iteration
  3. Mini-batch Gradient Descent: Calculates the gradient for a small batch of data points in each iteration.

Conclusion

Gradient Descent is a powerful optimization algorithm that forms the backbone of many machine learning models. By understanding its principles and challenges, you can effectively apply it to your own projects.