Linear Regression
Diogo Nogueira · ·9 min read
When learning about any new subject there must allways be a starting point, and for machine learning (ML) this is usually linear regression (LR). This simple method provides an intuitive predictor using some basic math, while still being able to solve many problems. In this post I intend to provide a basic introduction to LR for any beginner in ML.
What is Regression?
Regression is one of the two main ML problems. The other one, classification, predicts a class based on an input. For regression the prediction is a continuous value, usually a scalar or a vector of scalars, which we can interpret as trying to model a function from an input \(x\) to an output \(f(x)\). When we use LR we are assuming that the output is linear, or afine, with respect to the input.
The Math
We know that any linear function follows the following format:
However, this formulation is the one usually used for functions in \(\mathbb{R}\rightarrow\mathbb{R}\), so the model can handle high dimentional input data we will use the following formulation:
In this new form the input \(x\) is a column vector in \(\mathbb{R}^{n+1}\) and we call \(w\) the weigth vector also contained in that space. You’ll notice that our \(b\) disappeared. This is a common trick used to hide this term, by adding a \(1\) to each input vector we can train an \(a_{0}\) will take the role of \(b\). We’ll call each entry of the input \(x_{i}\) and of the weight vector \(w_{i}\), where \(x_0\) is allways \(1\) and \(w_0\) is \(b\).
Our objective is now to learn the weight vector. For this purpose we will use an error function. These are very common in ML and are used to measure the performance of our model in a quantifiable way. We calculate this error over our dataset which takes the form \(\mathcal{D}=(X,y)\) where:
is the input \(N+1\times M\) matrix. Each column vector is an \(N+1\)-dimensional input, don’t forget \(x_{0}=1\) for all \(x\), and there are \(M\) input points. You probably noticed that \(x_{i}\) now refers to the \(ith\) point of the dataset instead of the corresponding dimension, this will ne our notation moving forward. Finally, \(y\), represents the inputs targets. It consists of:
This vector of scalars contains the targets for our observations meaning \(f(x_{i})=y_{i}\). With our dataset in hand we can now introduce error functions. These are used to quantify the quality of our model with respect to a given dataset. In the case of LR we will use the sum of squared errors, SSE:
As you can see this expression calculates the difference between our prediction and the correct value for every point in the dataset, squares it and then sums all of the errors. We can write this error in matrix form as:
This expression can then be simplified to:
We are finally ready to apply our optimization. This is the step where the computer will learn. We know we want the smallest SSE possible meaning the values for \(w\) that minimize the sum of squared errors for our dataset. How can we find this value? We know from calculus that a function’s derivative is \(0\) at its maximum and minimums. Intuitivaly we can say that the SSE has no maximums, we can allways make a worse prediction, so any \(0\)s we find will be minimums. In adition the SSE function is convex meaning any minimum we find is a global minimum. To find it we will solve following equation:
The usual definition of this closed form comes as \((X^{T}X)^{-1}X^{T} y\). You might notice that the only difference is that \(X\) became \(X^{T}\). This is because, usually, the input matrix represents datapoints as row vectors. In this math section we used the column vector representation as it is the mathematical standart.
Playing with the Model
Now that we understand the math let’s build some intuition for LR. First, we’ll go over a toy dataset:
Putting this data in our preferred shape, \(\mathcal{D}=(X,y)\), we get:
Notice how we add our \(x_{0}=1\) to every column vector input. Let’s calculate our weights then:
Don’t forget we can interpret this weight vector as \( w = \begin{bmatrix} b\\ m\\ \end{bmatrix} \), meaning that our final plotted line will be defined as \(y=0.35x+1.03\). Below we plot the result with our training points. Try adding and removing points from the taining dataset and see the effect on the regression.
A Simple Problem
So what are the uses of linear regression? As I stated in the introduction, we will look at this new tool as an entry point to machine learning. ML is a wide field, but in this post lets look at the classic supervised learning problem statement: How can we make the best future prediction for a problem using existing data? Usually an ML specialist will choose the best tool for the job, but since we only have one tool I’ve chosen the best job for the tool.
As the name implies linear regression is best suited to model linear relationships between independent input variables and dependent output variables. This can take many forms, but so we can visualize our results I chose a dataset that goes from \(\mathbb{R}\rightarrow\mathbb{R}\) (from this kaggle dataset). The input variable will be the ammount of experience of a worker in years, while the dependent variable is the salary this worker is able to obtain. Intuitively, this dataset is naive, and we shouldn’t expect good predictions in a realistic setting. However, it is a good example of how, in reality, linear regressions can be used to confirm or quantify these intuitive relatioships, as we would expect salary to grow with experience somewhat linearly.

We can see the result of our linear regression in the image above. The blue points represent our training data while the red line is our calculated predictor. The obvious linear relationship between our variables present in the points is adeptly modeled by our result. While this result is satisfactory there are some noteworthy observations.
First, we can see that we don’t predict the correct value for any of our traing data points, this makes sense as a person with n years of experience does not have a fixed point, but it is still important to highlight that we are only predicting approximations.
Then we have to mention a central concept in machine learning: generalization. We only have training data up to 11 years of experience, but a worker can have many more years of experience. Therefore this model loses cronfidence the further we are from this upper limit, a property other similar models like Bayesian inference precisely measure.
Finally, we must never forget: correlation is not causation! This model makes a naive prediction based on loosely related values. While we can use it to inform our decision making, we must not use to draw new absolute conclusions.
Conclusion
In this post we went over basic linear regression. This simple machine learning model is a powerfull tool for showcasing linear relashionships between data. Other, more advanced versions of this model exist, in the form of multiple linear regression, or even by employing a feature vector to model some non linear relationship. However, when solving most complex problems, specially the non linear ones, specialists will choose more advanced tools like support vector machines or neural networks.
While these are all valid jumping points for further exploration of machine learning, I hope to have provided a good introduction to the field and, if nothing else, a tool that is simple, yet powerfull.