Blog
Regression analysis is a statistical method used by data analysts to estimate the relationship between a dependent variable and independent variable(s). In our articles on linear regression and logistic regression, we described independent variable(s) as variables we wish to use to predict the response variable (dependent variable), while dependent variable as a variable we wish to explain its variation using the independent variable(s).
Interpreting the findings of regression analysis is an important skill in data analytics because it can serve as a guide for data driven decisions in organizations. In this article, I will be explaining the regression output of Stata and the interpretation of the different results.
Stata is a statistical software used for data analysis, management and visualization. Its regression output is highly informative and it is one of the most widely used tool for estimating the relationship between dependent variable and independent variable(s).
In this article, we will be considering a randomly generated data with 20 observations, 3 independent variables and 1 dependent variable. One of the independent variables is a categorical variable. The data is as shown below:
Using Stata to fit a regression line in the data, the output is as shown below:
The Stata output has three tables and we will explain them one after the other.
SS is short for “sum of squares” and it is used to represent variation. The variance of the target variable comprises of that of the model (explainable variance) and that of the residuals (unexplainable variance). The total SS is the total variation of the target variable around its mean. it is given by:
Where represents the total variation that the target variable has.
is the value of the target variable for a given observation.
is the mean of the target variable.
On the other hand, SS residual is represents the unexplainable variation of the target variable (the variation of around its mean that our model cannot explain or capture). This is the variation of the residual and is given by:
is the predicted value of the target variable for a given observation.
The model’s sum of squares (explainable variance) would thus be:
Which is mathematically equivalent to:
From the output, we can see that out of a variation of 5793.43 in the dependent variable, 5649.48 is explainable by the model, while the remaining 143.95 is unexplainable.
df is the degree of freedom associated with a variance. Degree of freedom is the number of independent values that can vary. It is often given by .
The total degree of freedom is where is the number of observations in the data.
Since the model estimates number of variables (including the intercept), the degree of freedom in the ANOVA table is given by:
Where is the number of predictors (independent variables), the +1 represents the intercept.
The residual degree of freedom is the difference between the total degree of freedom and the model degree of freedom. It is given by:
From the output, we see that the degrees of freedom of the model, and residuals are 3 and 16 respectively, while that of whole data (total) is 19.
ms is the mean of the sum of squares. It is the sum of squares per unit degree of freedom (sum of squares divided by the degree of freedom).
From the output, the mean sum of squares of the model, residual, and total are respectively 1883.16, 8.997, and 304.917.
2. Model fit: This table summarizes the overall fit of the model. It answers the question “how well does the model use the predictors to model the target variable?” .The tabe is as shown below:
Number of obs is simply the number of observations used in the regression. Since the data has 20 observations, Number of obs is equal to 20.
F(3, 16) is the F-statistics of an ANOVA test run on the model. The F-statistics is the ratio of the mean sum of squares (ms) of the model to that of the residual. It measures how the ratio of the explainable mean variance to the unexplainable mean variance is statistically greater than 1. The 3 and 6 simply represents the model’s and residual degrees of freedom respectively. To know how well the predictors (taken together as a group) reliably predicts the dependent variable, Stata conducts an hypothesis test using the F-statistics. The null hypothesis is that the mean explainable variance is same as the mean unexplainable variance. From the table, we see that the mean sum of squares of the model is about 209.31 times greater than that of the residual.
The Prob > F is the probability of obtaining the estimated F-statistics or greater (the p-value). For a typical alpha level of 0.05, a p-value lesser than 0.05 like we have in our output, means that we have evidence to reject the null hypothesis and accept the alternate hypothesis that the ms of the model is significantly greater than that of the residual. Hence, the predictors of our model reliably predicts the target variable.
R-squared is the coefficient of determinant and it represents the goodness of fit. It is numerically the fraction of the variation in the dependent variable that can be accounted for (explained) by the independent variables. It is given by:
From the output, 97.52% of the variation in the dependent variable are explainable by the model.
Adj R-squared: Since the addition of more and more predictors tend to increase the R-squared, Adj R-squared tells us how much of the variation of the dependent variable is determined by the addition of the independent variables. Adj R-squared is the R-squared controlled for by the number of predictors. It is given by:
From the output, we can say that after adjusting for the degree of freedom, the coefficient of the determinant is 97.05%
Root MSE is simply the standard deviation of the residuals (error term). From the output, we can say that the measure of the spread of the residuals is 2.9995
y represents the target variable; x1, x2, and x3 represent the independent variables; and _cons
represent the constant term (intercept).
For a linear regression, Coef. Is the estimate of the values of the coefficients of the independent variables, and the value of the intercept. The equation of the model can thus be represented as follows:
Remember that from our linear regression article, we explained that these coefficients are the corresponding partial derivative of the dependent variable with respect to each independent variable and the intercept. That is, they represent the change in the target variable for a unit increase in the independent variable holding all other factor constant. We interpret these coefficients as follows:
If the dependent variable was categorical, the interpretation would change a little. Remember that from our logistic regression article, we showed that a regression model fits a line on the logit of the target variable. Since logit is the natural logarithm of the odd ratio, the odd ratio is thus the exponent of the logit. Recall that odd ratio is the ratio of the probability of success to failure i.e. how many times the chance of failure is the chance of success.
We can represent the linear equation as:
Each of the coefficient is thus the change in logit for a unit change in the independent variable. To interpret change in logit, we can write the change in logit as:
Therefore,
The above however is a ratio of odd ratios. For better interpretability, it is best to interpret the change in the dependent variable as a fractional change in the odd ratio.
Therefore, the fractional change in the odd ratio is:
Therefore, each coefficient of the independent variable means that for a unit change in the independent variable, the fractional increase in the odd ratio is (or the percentage increase in the odd ratio is )
Assuming our dependent variable is categorical and that 1 means success and 0 means failure, the coefficients would have been interpreted as follows:
Std. Err. is the standard error of the estimated parameters. It is a measured of the deviation we would expect for each of the parameter.
t is the t-statistics of the estimated parameters. It is the ratio of coef to the std.Err. It measures how many times is the difference between the estimated coef and zero is greater than the std. Err.
P>|t| is the p-value associated with the t-statistics. It is the probability of obtaining the coef or even more (or less if coef is negative) that is significantly different from zero. For a threshold of 0.05, we have enough evidence to accept the alternate hypothesis that the estimated coefficients of x1, x2 and the intercept are not equal to 0 because their p-values are all lesser than 0.05. However, we cannot reject the null hypothesis for x3 because the its p-value is greater than the 0.05 significant threshold.
[95% Conf. Interval] is the 95% confidence interval. It gives the lower and upper boundaries in which we would expect to have coef to be between 95% of the time.
We have seen how to interpret the regression output in Stata. The regression output of Stata can be categorized into ANOVA table, model fit, and parameter estimation. The interpretation depends on the type of data of a particular variable.
See Also: Hypothesis Testing, Importance of Data Visualization, Linear Regression Simplified, Logistic Regression Explained, Understanding the Confusion Matrix
Tank you.
additional material please
I really appreciate for your explanation thank you
Great site thanks for posting it!
Very good story thanks for sharing it!
Thanks for the website it is truly an amazing website, thanks for sending it.
Your one-stop website for academic resources, tutoring, writing, editing, study abroad application, cv writing & proofreading needs.