Blog
In this article, I will be explaining linear regression in simple terms, and showing you how it is used to model data with a continuous response variable.
Linear regression is a machine learning model that fits a hyperplane on data points in an m+1 dimensional space for a data with m number of features. A hyperplane is a plane whose number of dimension is one less than its ambient space. For example, a 2 dimensional plane is a hyperplane for a 3 dimensional space, while a 1 dimensional plane (a line) is a hyperplane for a 2 dimensional space.
Linear regression model is a type of supervised machine learning model because its variables can be broadly classified into categories.
Independent variables are variables we want to use to predict or model the dependent variable. They are called independent variables because they can assume any values. On the other hand, dependent variables are the variables we want to use the independent variables to predict. As a result, dependent variables depend on the independent variables. Independent variables are sometimes called features or attributes, while dependent variables are sometimes called target variables or labels (often used when the values are categorical). The “ +1” in the m+1 number of dimensions of the space mathematically represent the addition of the dependent variable as a dimension in the space. For example, consider a randomly generated data with one independent variable x and dependent variable y as shown below:
Since the number of feature is 1, we can therefore represent this data in a m+1 dimensional space. Since m (the number of feature) is 1, the data can therefore be represented in a 2 dimensional space. The scatter plot of the two variables (dependent and independent variables) in a 2 dimensional space is as shown below:
From the scatter plot above, we can see a fairly linear relationship between variable x and y i.e. as x increases, y also increases. However, can we really answer the question “by how much do we expect y to increase if x increases by 1 unit?” One way to answer this question is to assume a perfectly linear relationship between x and y, and therefore use a straight line to represent that relationship. This however would lead to another question: “which straight line can we use to represent (or model) the relationship between x and y?”
From plane geometry, we know that a straight line can be represented by the equation:
y=mx+c
Where m is the slope, and c is the intercept.
The slope is the change in y for a unit change in x. We can confirm this by taking the derivative of y with respect to x.
This would give:
We then see that is indeed the derivative of with respect to .
The intercept on the other hand is the point on the y axis where the straight line crosses (or intercepts) the y axis. A little consideration would show that the intercept is the value of y when x is 0. We can
confirm this by plugging in the straight line equation.
Note that the convention for representing the equation if a linear model in machine learning is:
Here, is the intercept, while is the slope.
Since a straight line can be represented by an equation, the question: “which straight line can we use to represent (or model) the relationship between x and y?” can therefore be rephrased as “which value of slope and intercept should the straight line that would represent the relationship between x and y have?” Since the slope and intercept of the straight line can take infinity number of values, we then need a criterion for chosen the value of slope and intercept that represents the relationship between x and y. The criterion used by a linear regression model is the value of the slope and intercept that best represent the relationship between x and y. The word best is relative, and for linear regression models, it means the straight line with the minimum error.
Considering our sample dataset, let’s assume a slope of 2 and an intercept of 5. The equation of this line would thus be:
y=2x+5
Plotting this equation using the dataset, we have:
From the plot, we see that this assumed model seems to fit the data points fairly well. However, we don’t know to what extent it did fit it and if it is really the best we can fit.
To find the best model, we need a metric that tells us how well our model minimizes the errors. This metric is called the loss function. A little consideration would show that the best model would be the one that best minimizes the loss function. Since an error in prediction can be positive or negative, it is important that we use a loss function that is affected by the value of the error and not the sign of the error. A common loss function used in linear regression is the sum of squared errors. You may want to ask: “why not the sum of absolute errors?” Remember that finding the minimum value of the loss function is same as finding the combination of the slope and intercept that would result in the smallest value of the loss function We typically do this by differentiating the loss function with respect to the slope and intercept, and finding the slope and intercept that would make this derivative to be 0 (the turning point of the loss function vs slope/intercept plot).
Minimizing sum of absolute errors as a loss function is more difficult than sum of squared errors because sum of absolute errors not easily differentiable compared to sum of squared errors because of the uncertainty of the sign of an error. We now see that sum of squared errors has both advantage of having only positive values and being easily minimized.
Using LinearRegression model in python’s scikit-learn library to model our dataset, the slope and intercept that best minimizes the loss function are 15.38 and 1.04 respectively. The plot of the fitted straight line is as shown below:
The equation of the fitted line is thus:
y=15.38x+1.04
The same intuition applies for data with more than 1 feature. The equation for such is:
Where y is the dependent variable,
, , … are the independent variables (aka features)
is the y intercept,
, , … are the slopes of their corresponding feature when every other features aren’t changing.
A little consideration will show that the partial derivatives of with respect to , , … are , , … respectively.
A linear regression model thus looks for the values of , , … that best minimizes the loss function.
Conclusion
We have seen what a linear regression means in simple terms. A linear regression model uses a linear equation to represent the variation of the dependent variable with the independent variable. The equation is gotten by finding the values of the intercepts and the coefficients of the features that best minimizes the loss function.
See Also: Hypothesis Testing, Importance of Data Visualization, Logistic Regression Explained, Regression Analysis: Interpreting Stata Output, Understanding the Confusion Matrix
No comments added
Your one-stop website for academic resources, tutoring, writing, editing, study abroad application, cv writing & proofreading needs.