Understanding Optimization in ML with Gradient Descent & implement SGD Regressor from scratch

19 min readNov 3, 2019

Gradient Descend Visualization. Credit: rasbt.github.io

We’ll solve a machine learning problem with help of differentiation, and that is what optimization is all about basically. With the help of concepts discussed in this blog we’ll implement a deployment level SGD regressor model from scratch.

Why should we care about Optimization:

Optimization is an important concept not only in the context of Machine learning but also in general. At very high-level, optimization is all about making the best use of an available resource or choosing the best solution out of many possible solutions.

On Google, this is the meaning of optimization- “The action of making the best or most effective use of a situation or resource”.

Examples where we use some sort of optimization in our day to day activities like planning a day or a trip, taking the best route while travelling, etc.

If speaking of Machine learning, We’re solving a problem then our end goal would be to have the best solution that maximizes the profit or minimizes the loss. So It makes sense to apply optimization while solving a machine learning problem to get an optimized solution that minimizes the loss.

Differential calculus to understand optimization in Machine learning:

We looked at a very high-level meaning of optimization. And now we’ll look into differentiation, maxima and minima and we’ll try to connect the dots between understand the optimization much better.

(PS — I’m not going to explain you in-depth that what is differential calculus and all. Of course, I’m limited to Optimization so I’ll try to explain only that is necessary to understand the optimization better. )

Well, 99% of Machine learning optimization depends on these two ideas mainly -

Differentiation
Maxima and Minima

Differentiation:

We’ll look at a very simple example of a single variable differentiation. Let’s first consider this graph:

Here, let’s consider a function is y=f(x). It could be any function. (You’ll understand it later please beware with me for now). x is a scaler value.

First, let’s understand what does dy/dx intuitively mean?

This is — How much does y changes as x changes or rate of change of y with respect to x. To understand it better let’s re-draw the above graph:

To understand the change in y w.r to change in x let’s consider x1 and x2 on the x-axis and corresponding y1 and y2 on the y-axis in the above graph. So as you can see we’ll get a right angle triangle as:

In the right-angle triangle from the above graph, hypotenuse (tan 𝛉) tells that how much does y changes as x changes. So we can write this in a formulation as:

Mathematically, we can also write as:

So we have only added the limit( Δx tend to 0) to the above formulation and can be read as differentiation of y w.r to x is — the rate of change of y w.r to x as the change in x tends to zero. So as x2 comes closer and closer to x1 and if the difference is very small between x2 and x1 then we’ll get a tangent(green line) at x1 as (we can see it in the graph below):

Tangent in the above graph is — tangent of f(x) at x1 as Δx tends to zero. That means as x2 moves close to x1 then the difference between x2 and x1 becomes very small then a tangent is formed on the point x1 touching the x-axis and that gives an angle 𝛉. So to compute the derivative of f(x) w.r to x at x1 we take the slope of the tangent at x1. We can write the formulation as:

Considering the tan𝛉, here 𝛉 is the angle formed by tangent and x-axis as Δx tends to 0. We have seen this in the above graph. So let’s see what will happen for different values of 𝛉.

Here 𝛉 is formed by tangent and x-axis so 𝛉 could be any value depending on how does the tangent is formed as we can see in the above plot. Let’s consider different circumstances:

If 90>𝛉>0 between tangent and x-axis then tan𝛉 = +Ve as we can see in the above plot.
If 180>𝛉>90 between tangent and x-axis then tan𝛉 = -Ve as we can see in the above plot.
If 𝛉=0 between tangent and x-axis then tan𝛉 = 0 as we can see in the above plot.

I added this above plot to just explain that if we know the 𝛉 then can show that derivative of f(x) w.r to x at some point (xi) is going to be -Ve, +Ve or 0.

To summarize all — Tangent line = Derivative = Instantaneous rate of change.

Maxima & Minima:

Maxima and Minima are the largest and smallest values of function f(x). This value could be the largest or smallest in a given range(Local) or the entire domain(Global). Let’s look at different plots:

So from the above plot, it might be clear that some function f(x) may have only maxima or minima or maxima and minima both. Some functions may also have multiple maxima and minima i.e, global and locals. So let’s look at the below plot:

So in the above plot function f(x) have global and local maxima and minima. For more in detail, you can follow this — [source]

So we got the idea about maxima and minima of a function f(x). Now let’s connect the dots between maxima minima and derivative that we have discussed previously in the differentiation section in this blog:

Let’s take a function f(x) having only minima. Considering the below plot:

Let’s say x1 as minima and if we take the derivative of function w.r to x at x1 then that would be equal to zero because at minima of the function we’ll get tangent parallel to x-axis hence 𝛉=0 that will result in tan𝛉=0 that is nothing but derivative at minima(x1) of function f(x). So to summarise this we can write as:

An Important conclusion from above we can take is- Derivative of any function at minima/maxima is equal to zero. Or derivative of a function equals to zero tells the minima or maxima.

For example, consider a situation in which you have a loss function f(x) of ML or any problem. Your end goal would be finding the value x for which you have a minimal loss. So we can simply find it by computing derivatives on different xi points. We can do it iteratively by going through one by one point at a time until we did not reach very close to minima. But this whole process is not an easy task if we have some complex loss functions like mean squares error or others.

But for that, we have a gradient descent algorithm. It makes easy to compute minima or maxima for any function. Gradient descent is an iterative algorithm.

Gradient descent Algorithm:

At a very high level, we can say this algorithm randomly selects a value as a minimal or optimal then keep changing the value in an iterative manner and the end, it gives a value that is actually very close to minima. We’ll see how does it solve very complex ML problems with the help of derivatives.

First, let’s understand the behaviour of derivative (or slop) for that let’s consider below plot:

In the above plot, assume at x = 0 we have optimal value for x, as we can also see it visually. And followings are the observation we can take from the above plot that is -

One side of minima slop is +Ve as the angle is 90>𝛉>0 so in that case, tan𝛉 would always be +ve.
Another side of Minima slop is -Ve as the angle is 180>𝛉>90 so in that case, tan𝛉 would always be -ve.
And exactly at Minima slop is 0 because at Minima as the angle 𝛉=0 so in that case, tan𝛉 =0.

So we can take the following conclusion from the above plot as-

We can say that slop changes the sign from +Ve to -Ve or -Ve to +Ve at Minima.
As we converge (move closer) to the Minima from +Ve side then slop reduces as we can see and if we converge to the minima from -Ve side then slop increases.

Geometric Intuition behind this algorithm:

As this is an iterative algorithm so It would be better to understand in iterations or steps. First, let’s take a function f(x) for that we have to compute minima (x*). Following are the steps:

1. In the first iteration: the algorithm randomly selects the value. The first step is also called an Initialization step. And value or point we select randomly is called initialization point. Let’s say Initialization point is x0. In the below plot, initial point x0 is in +Ve side of Minima. As x0 is randomly selected so x0 may lie in +Ve or -Ve side of Minima. In my case, I’m assuming x0 is in +Ve side of minima. Either x0 lies in +Ve or -Ve side of Minima, all the steps would be the same. Below is the plot:

2. In the second iteration: As we can see above x0 (Initial point) is very far from optimal(x*) so we’ll have to find other point x1 such that x1 is closer to optimal (x*) means x1 should be between x* and x0. So important question here is how to compute x1 such that it is closer to optimal one(x*)?

First, for the sake of simplicity let’s consider a point x1 in between x0 and optimal (x*) on the plot. After that, we’ll look at mathematical formulation to compute the value for x1.

So to find value for x1 we solve the following equation:

Here “r” is the learning rate or step size (I’ll explain in detail in a later section for now just assume step size is 1 and is constant) and the step size is always positive.

So on solving the above formulation, we’ll get value for x1 such that x1 is more closer to optimal minima. Okay, let me break it down-

Here step size is constant i.e, 1
And we compute the slope at x0 (or say derivative of f(x) w.r to x at x0) that would be positive as x0 is in +Ve side of Minima (because of 90>𝛉>0)
So on multiplying step-size and slope at x0(+Ve) will result in a +Ve value
Now we’re subtracting the +Ve value from x0 then we’ll get the value for x1 that would be lesser than x0.

3. In the third iteration: So we got the x1 after the second iteration but x1 is still far from optimal value so now we’ll have to find another value x2 closer to x* means between x1 and x*. We’ll find this value for x3 in the same way as we computed in the previous step. First, let’s look at the below graph:

Now again solving the same equation to find x2 as:

So by solving the above equation, we’ll get value for x2 that would be now more closer to optimal x* or say minima. Here we have followed the exact same as the previous step but we are taking x1 instead of x0.

But still, x2 is not that closer to minima now so we can repeat this iteration for multiple times or say k times. And in the end, we’ll have the value very close to minima.

k. After kth iteration: Just assume we have run the algorithm for k-iterations and at the end of the kth iteration we have xk and that is very close to minima. This looks something like this on the plot:

So if xk is very very close to minima then we can terminate the algorithm and take xk as an optimal solution.

But the question here is after how many iterations or when to terminate the algorithm and also in real-world problems we won’t have any idea about minima/maxima so how one would find that xk is closer to minima/maxima or not?

Well, the answer is pretty simple -

In initial iterations, there would be larger updates towards minima but as iteration increases then update become smaller. When we start seeing very very small update then we can terminate the algorithm and consider xk as optimal one(x*). Let’s look at the below example:

In the above figure, we can see that in the initial iterations there are larger updates. So in later iterations updates become smaller and after many iterations, there would be almost negligible updates like very very small so then we can terminate the algorithm and take that xk as the optimal one. Also adding to this — The reason why updates become smaller is that as we move towards minima from +Ve side then slop reduces so that results in smaller updates.

Learning rate or step size:

Previously we took step-size (“r”) of constant value 1. It is not necessary that we take only constant step-size as sometimes there might be a problem if we take the constant step-size.

In the below figure, I have basically tried to explain why constant step-size is not always a good idea. So I took a function x² for which I’m trying to compute minima and step-size is constant=1.

So as you can see when I’m trying to compute xi+1 at xi then it just passes the minima. Again when I tried computing xi+2 to reach minima then again it passes the minima so there is a kind of oscillation due to the constant step size. If there would be a variable step size that reduces accordingly with iteration then we can be avoided such oscillation thing.

We can add variable step-size as- step-size will be reduced by some factor after every 1 or 2 or 4 iterations. We can design this adaptive step size as per our need.

Summary of Gradient descent:

In Gradient descent we saw how we can solve any problem and can find maxima or minima of that problem with the help of this algorithm in an iterative manner. We saw how we can use derivative at a point to reach closer to minima. At a high-level Gradient descent tries to find the minima or maxima of a given problem in an iterative manner.

Gradient descent for linear regression:

In linear regression, our main goal is to find the optimal hyperplane w(w*) that minimizes the mean squared loss. So we can write our optimization problem as:

Above w is variable and (xi, yi) is constant. For the sake of simplicity, I have excluded the intercept(b) and regularizer term. (I’ll add these in the code). So to find the optimal hyper-plane w* in each iteration we’ll have to compute Grad of the Loss function with respect to hyper-plane w(Grad means here we’re dealing with vector, not scaler). We can write it as:

So if we want to solve this Loss function L(w) using gradient descent to find the optimal w that minimizes the mean squared loss. Then we can follow in this iterative manner:

1.First iteration: Pick d-dimensional initialization vector w0 randomly. As [w1_0, w2_0, w3_0,…………]

2. Second iteration: In the second iteration, we update the wo(old) to w1 (new) and to find w1 we will have to solve the equation and that would look something like this:

In the above formulation: n is nos of points in the data set. w is a weight vector that we are trying to optimize.

3. In the third iteration, we again update the w1(old) to w2(new) and to find w2 we solve the equation :

k. In kth iteration: In the same way, we’ll run the algorithm for k iteration and at the end, we’ll have wk:

So if there is a very small update in the loss with wk compared to loss with wk-1 then we can terminate the algorithm and take the wk as optimal w*. Here I’m assuming step-size (‘r’) is reducing with each iteration. We can try both with constant and variable then select the best approach.

To summarize all:

So to summarise all we can find the minimal or optimal solution for any machine learning problem with the help of gradient descent algorithm. The standard way is:

Initialize the parameter we needed (Here weight vector w) and to update the initial value we compute grad of the loss function with respect to that parameter (here weight vector w) in each iteration. Like:

In each iteration computing grad of the loss function with respect to the desired parameter (in this case we need optimal w so w.r to w hyper-plane). Means if we want to find intercept b then we will compute grad of loss function w.r to b to update the value of b and follow the same approach as above to find optimal b. Like:

In below linear regression implementation section I’ll consider intercepting too then you’ll get a proper idea how can tune multiple parameters at once with the help of Gradient descent.

A Problem with gradient descent and solution for that:

If nos of data point(n) in data set very large then each iteration becomes very costly as in each iteration to compute grad of loss function we go through each data points. If nos of data points are like 100k or 400k then that’s still okay but assume if we have like 10 Million data points then that will require a lot of computing power and time too hence become costly. Consider the below figure to understand better :

So in the above figure, we can see if n=10 Million then in each iteration we’ll have to go through 10Million points to compute grad of the loss function for that one iteration.

So the solution to this is: Stochastic gradient descent also know as SGD-

Everything is the same as Gradient descent but only different is instead of taking n points to compute grad of loss function it randomly selects k sample from n data points in each iteration and computes grad of the loss function with that k data points in each iteration.

It selects k such that 1<k<n. Also, one thing to note is — SGD requires more iterations to converge minima or optimal solution compare to GD (Gradient descent) but still SGD would be much faster than Gradient descent and it is shown that the result we get with SGD is very close to the Gradient descent. Well, I would say with Gradient descent we can get better result but to save compute power and time we choose SGD. This is like a trade-off between performance and computation cost.

With SGD each step looks something like this:

We can see in above that now in each iteration to compute grad of loss function It will take only k unit time instead of n unit time.

Also, one thing to remember that It is not necessary to take k as a constant size and take a fixed number of the sample points from the population in each iteration. We can take any random number for ‘k’ in each iteration. Means if in itr_i k=1500 then it is not necessary for itr_1+1 we’ll take same 1500 number samples.

Solving the linear regression problem with SGD from scratch:

[IPYNB of this example is available on my GitHub at [src]]

Let me first explain some utility function that will be called by SGD algorithm in each iteration for the different purpose. Below are the utility functions:

Step_decay_lr: As I mentioned earlier that step-size or learning rate can be constant or variable. So below function reduces the learning rate by a factor after some epochs. The main purpose to add this is to give you an idea about variable learning rate and there are multiple ways in which we can implement that. Well, I found with constant learning rate performance on the test was better. For more, you can follow this — wiki page

Predict_point: It returns the predicted y for given weight vector w, a data point xi and an intercept b.

Compute_error: After every epoch, we compute mean squared error for the weight vector and intercept to check whether weight vector and intercept is optimal.

sgd_regressor:

This function takes the train and test data, initial learning rate/step size value, argument for wether learning rate would be constant or variable and total epochs to run the SGD.

At first, we do initialization of parameters like weight vector w and intercept b. There is a total of 13 features in data so weight vector would be of 13-dim. Intercept would be a single real-valued variable.

Initialization of weight vector and intercept

Now we’ll have to update the initial weight vector and intercept as we saw previously in the mathematical formulation. So to update w and b, we’ll have to compute grad of function w.r to w and b first then use learning rate and existing w and b for an update.

So as this is SGD so first we’ll randomly select k sample points from train data such that k<total nos of data points in train. So for this, we’re doing as:

This randomly selects k sample points from train data

Now once we have the k sample points than with these points we’ll compute grad of the loss function w.r to weight vector w and intercept b. So first let’s look at the mathematical formulation for which we’ll write code:

The mathematical formulation for which we’re going to write the code

So to compute grad of loss function w.r to w and b we’ll have to iterate through k points and keep summing up the result to one variable after every iteration for “k” points (Not taking whole train data for an update. Taking the only k points that’s why called stochastic gradient descendent or SGD). So below code snippet does exactly the same thing.

Iterating through each point to compute grad of loss function w.r to weight vector w and intercept b

w_track and b_track are created, to sum up, the result from each iteration to one final variable.

Once we’ll be done with computing grad of function w.r to weight vector and intercept then we can update the weight vector and intercept as.

After the end of every epoch, we can update the learning rate if the learning rate is not constant. I tired this variable but with constant, I got a better result. So you’ll have to try both. Below is the code snippet.

We also compute loss after each epoch to get a better idea about algorithm convergence towards an optimal solution. Below code computes and append the loss to the list.

Let’s call the sgd_regressor function:

Okay, so now we understand the code structure (If still confuse, please open IPYNB from this link. I have documented every step vert properly in the IPYNB) So we can call the function as:

In the above code snippet, I have declared nos of epoch to run and learning rate. The function returns the optimal weight vector (w) and intercept (b), error on test and train in every epoch.

After running SGD for 200 epochs with a constant learning rate, plot:

So as we can see by the 150 epochs algorithm has already converged to the optimal weight and intercept and with that, we’re having minimal mse on both train and test. And that is what we wanted, optimal w and b that minimizes the mean squared error on the test.

Compare with sklearn SGDRegressor:

Mean squared error on test with our SGDRegressor model is 27 and with sklearn SGDRegressor model is 25.91.

Well, the result is very good from our implementation of SGD. Very close to sklearn implementation of SGD Regressor. We can use this implementation in actual production as the end results are very good.

Extension to this code:

You can experiment with a concept like early stopping to stop SGD algorithm to reduce the computation cost. As we can see that we already have an optimal result by 150 epochs but SGD algorithm is still running for input epochs. Also, you can try various type of approach to reduce the learning rate adaptively. Add regularizer to lose function then compute grad of the loss function. etc, etc.

Just play with code and make a more optimized version of it. I’ll be happy to hear that. Just drop a mail or you can message me on Linkedin too.

Github: https://github.com/rumankhan1/

Linkedin: https://www.linkedin.com/in/rumankhan1/

References:

https://www.appliedaicourse.com/

https://en.wikipedia.org/wiki/Loss_function