Blog
In this article, I will be explaining logistic regression in simple terms, and showing you how it is used to model data with a categorical response variable.
Logistic regression is a machine learning model that uses a hyperplane in an dimensional space to separate data points with number of features into their classes. A hyperplane is a plane whose number of dimension is one less than its ambient space. For example, a 2 dimensional plane is a hyperplane for a 3 dimensional space, while a 1 dimensional plane (a line) is a hyperplane for a 2 dimensional space.
Logistic regression is a type of supervised machine learning model because its variables can be broadly classified into categories.
Independent variables are variables we want to use to predict or model the dependent variable. They are called independent variables because they can assume any values. On the other hand, dependent variables are the variables we want to use the independent variables to predict. As a result, dependent variables depend on the independent variables. Independent variables are sometimes called features or attributes, while dependent variables are sometimes called target variables or labels (often used when the values are categorical).
Logistic regression is analogous to linear regression, the only difference between both of them is that logistic regression is used when the target variable is categorical, while linear regression is used when working with continuous target variable. As a result, logistic regression is used for classification, whereas linear regression is used for regression. Classification is the term used when a model is predicting the class an observation belongs to, while regression is associated with a model predicting the numerical value of a response variable. For example, predicting whether a bank transaction is fraudulent or not fraudulent is a classification task because we are trying to tag transactions as been fraudulent or not. However, if we want to predict the revenue generated by banks, we need to use a regression model because what we want to predict (amounts of revenue) are numerical values. To learn more about a linear regression model, check our article on linear regression.
Logistic regression is mostly used for classify observations into two classes such that each observation is classified into one and only one class. Such type of classification is called binary classification. For example, predicting if the image in a picture is that of a cat or a dog. Other types of classification problems are multi-class and multi-label classification.
Multi-class classification involves classifying observations into three or more classes such that each observation is classified into one and only one class. For example, predicting if the image in a picture is that of a cat, a dog, or a bird. On the other hand, multi-level classification involves classifying observations into two or more classes such that each observation can be classified into more than one class. For example, predicting all the fruits whose images are in a particular picture. Through strategies like one vs rest and one vs one, logistic regression can be used for both multi-class and multi-label classification. However, it is mostly used for binary classification. We will consider logistic regression as a binary classifier in this article.
In binary classification problem, it is common to name one of the class “positive class”, while the other “negative”. The choice of positive or negative class is arbitrary, however, naming the class of interest “positive class” often make things simpler. For example, in a fraud detection model, “fraud” would be the “positive class”, while “not fraud” the negative class because we are mainly interested in flagging fraudulent transactions. In machine learning, the “positive class” and “negative” class are commonly tagged as 1 and 0 respectively. So, for the dataset for a fraud detection model, “fraud” (positive class) is often represented as 1, while “not fraud” (negative class) as 0. Consider a randomly generated data of 20 data points with two independent variables and , and a dependent variable as shown below:
Since the number of features is 2, we can therefore represent this data in a dimensional space. Since (the number of features) is 2, the data can therefore be represented in a 2-dimensional space. The scatter plot of the two variables independent variables in a 2-dimensional space is as shown below:
From the scatter plot above, we can see that there is fairly a linear boundary separating the data points. For this data points, a logistic regression model would use a straight line as the hyperplane to separate the data into its class. So, data points above the hyperplane would be classified as “positive class”, while those beneath as “negative class”. Well, since there are many lines that can separate this data into its classes, the challenge would thus be finding the line that best separate the data into its classes.
From plane geometry, we know that a straight line can be represented by the equation:
Where is the slope, and is the intercept.
The slope is the change in for a unit change in . We can confirm this by taking the derivative of y with respect to x. This would give:
We then see that is indeed the derivative of with respect to .
The intercept on the other hand is the point on the axis where the straight line crosses (or intercepts) the axis. A little consideration would show that the intercept is the value of when is 0. We can confirm this by plugging in the straight line equation.
This would give:
Since the y axis of our data is , while the axis , we can thus write the equation of the line that separate the data points into its classes as:
Since a straight line can be represented by an equation, the question: “which straight line can best separate the data into its classes?” can be rephrased as “which values of slope and intercept should the straight line that best separates the data into its classes have?” Since the slope and intercept of the straight line can take infinity number of values, we then need a criterion for choosing the value of slope and intercept that best separate the data into its classes. The criterion used by a logistic regression model is the values of the slope and intercept that best separate the data into its classes. The word best is relative, and for logistic regression, it means the straight line with the minimum error. The function that quantifies errors in a model is called a loss function. Therefore, a model would try to minimize the value of the loss function as possible. A simple loss function we would typically use for a logistic regression is the number of misclassifications. Let’s see how this would look like.
Considering our sample dataset, let’s assume a slope of 2.5 and an intercept of -20. The equation of this line would thus be:
Plotting this equation in the dataset, we have:
From the plot above, we can see that our assumed model misclassified 2 positive classes and 2 negative classes making a total error of 4. If we wish to minimize the error, we need to adjust the line so that it reduces the value of the loss function (total misclassification) further. Let’s now assume a slope of 2.4 and an intercept of -18. The plot of such straight line is as shown below:
Though, there is a little movement it the line, but the total misclassification is still 4 (2 misclassified positive classes and 2 misclassified negative classes). This is not a good news because the value of a loss function should change on every hyperplane we use. This is needed so that the loss function can be differentiable. Using total misclassification as a loss function for logistic regression may not produce the optimum result because its values are discreet and not continuous. We therefore need to use another loss function.
If we make rewrite the equation for a logistic regression model as as
If we multiply both sides by an arbitrary number , we can rewrite the equation as:
We can further rewrite the equation as:
Where , ,
We now see that the arbitrary number is indeed the coefficient of
This is the form we would often see the equation of a logistic regression written as:
Since this equation is the equation of the hyperplane, only points on the hyperplane will satisfy the equation. Points above the hyperplane would make the right hand side of the equation to be a positive number, while points below would make it a negative number. We can therefore generalize the equation to be:
So, data points with positive and negative values of z mean that they are above and below the hyperplane respectively, while those whose value of z is zero are on the hyperplane. A little consideration will show that the values of z will range from to . For better decision making (deciding the class of an observation), it is better to convert the value of to probability values (to be between 0 and 1). We achieve this by using the Sigmoid function.
The formula for sigmoid function is:
This function will transform , 0, and to 0, 0.5, and 1 respectively. If the value of z for a data point is close , it means that the data point is very far above the hyperplane hence, we are very sure that it belongs to the positive class. On the other hand, If the value of z for a data point is close , it means that the data point is very far below the hyperplane hence, we are very sure that it belongs to the negative class (or very sure that it does not belong to the positive class). If a data point is on the hyperplane, the z value is 0, and its Sigmoid transform is 0.5 (50%probability of belonging to the positive class). We can thus interpret the value of the Sigmoid function as the probability of a data point belonging to the positive class.
One way to quantify the error (value of the loss function) is to multiply the probabilities that observations belong to their actual class (based on our predicted probability). This is called the maximum likelihood. For example,
if our model predicts the probabilities of belonging to the positive class (using the Sigmoid function) for four data points as 0.8, 0.15, 0.95, 0.1; and the first two data points actually belong to the positive class, while the second two belong to the negative class: for a threshold of 0.5, the model misclassified the second and third data points because they are actually positive and negative respectively. The maximum likelihood can be calculated as:
Predicted probabilities
Actual class
Predicted probability of belonging to actual class
0.8
Positive
0.15
0.95
Negative
1-0.95=0.05
0.1
1-0.1=0.9
Maximum likelihood
We see that the maximum likelihood:
For these reasons, we need another loss function by modifying the maximum likelihood. The loss function is the cross entropy loss. It is simply taking the negative of the natural logarithm of the maximum likelihood. For a binary classifier where 1 is the positive class and 0 is the negative class, the cross entropy loss is:
Where is the observation,
is the actual label of the observation (1 or 0),
is the predicted probability of observation,
We see that the cross entropy formula for each observation has two parts:
For our earlier four data points prediction, we can thus calculate the cross entropy as:
Predicted probability
Positive (=1)
Negative (=0)
Therefore,
We then see that:
Logistic regression model represents the hyperplane by a linear equation such that the coefficients are those that minimizes the cross entropy loss. The general equation is as shown below:
Logistic regression often predicts the probability of a observation belonging to the positive class and it uses the Sigmoid function to transform the linear equation into probabilities, we can thus write the equation of the predicted probabilities of logistic regression as follows:
If we rewrite the Sigmoid function in terms of , we will have:
is called the logit and is interpreted as the natural logarithm of the odd ratio. The odd ratio is the ratio of the odd of success to the odd of failure. i.e the ratio of the probability P that an event would occur to the probability (1-P) that the event will not occur. We then see that a logistic regression represents the logit as a function of the features of a data.
Fitting a logistic regression on our 20 data points using scikit-learn’s (a python machine learning library) Logistic Regression, the coefficients are: , , and . We can thus represent the linear equation of the model as:
The fitted hyperplane on the dataset is as shown below:
Logistics regression is a machine learning model that uses a hyperplane in an dimensional space to separate data points with number of features into their classes. It does so by finding the equation of the logit in terms of the features such that the coefficients are those that minimizes the cross entropy loss.
See Also: Hypothesis Testing, Importance of Data Visualization, Linear Regression Simplified, Regression Analysis: Interpreting Stata Output, Understanding the Confusion Matrix
Great article thanks for sharing!
Your one-stop website for academic resources, tutoring, writing, editing, study abroad application, cv writing & proofreading needs.