Regression Analysis: Interpreting Stata Output

Blog

Regression Analysis: Interpreting Stata Output

Regression Analysis: Interpreting Stata Regress Output

Regression Analysis

Regression analysis is a statistical method used by data analysts to estimate the relationship between a dependent variable and independent variable(s). In our articles on linear regression and logistic regression, we described independent variable(s) as variables we wish to use to predict the response variable (dependent variable), while dependent variable as a variable we wish to explain its variation using the independent variable(s).

Interpreting the findings of regression analysis is an important skill in data analytics because it can serve as a guide for data driven decisions in organizations. In this article, I will be explaining the regression output of Stata and the interpretation of the different results.

Stata Regression Output

Stata is a statistical software used for data analysis, management and visualization. Its regression output is highly informative and it is one of the most widely used tool for estimating the relationship between dependent variable and independent variable(s).

In this article, we will be considering a randomly generated data with 20 observations, 3 independent variables and 1 dependent variable. One of the independent variables is a categorical variable. The data is as shown below:

Data Table

Using Stata to fit a regression line in the data, the output is as shown below:

Stata Output

The Stata output has three tables and we will explain them one after the other.

  1. ANOVA table: This is the table at the top-left of the output in Stata and it is as shown below:

SS is short for “sum of squares” and it is used to represent variation. The variance of the target variable comprises of that of the model (explainable variance) and that of the residuals (unexplainable variance). The total SS is the total variation of the target variable around its mean. it is given by:

Where  represents the total variation that the target variable has.

 is the value of the target variable for a given observation.

 is the mean of the target variable.

On the other hand, SS residual is represents the unexplainable variation of the target variable  (the variation of  around its mean that our model cannot explain or capture). This is the variation of the residual and is given by:

Where  represents the total variation that the target variable has.

 is the value of the target variable for a given observation.

 is the predicted value of the target variable for a given observation.

The model’s sum of squares (explainable variance) would thus be:

Which is mathematically equivalent to:

From the output, we can see that out of a variation of 5793.43 in the dependent variable, 5649.48 is explainable by the model, while the remaining 143.95 is unexplainable. 

df is the degree of freedom associated with a variance. Degree of freedom is the number of independent values that can vary. It is often given by .

The total degree of freedom is  where  is the number of observations in the data.

Since the model estimates  number of variables (including the intercept), the degree of freedom in the ANOVA table is given by:

Where  is the number of predictors (independent variables), the +1 represents the intercept.

The residual degree of freedom is the difference between the total degree of freedom and the model degree of freedom. It is given by:

From the output, we see that the degrees of freedom of the model, and residuals are 3 and 16 respectively, while that of whole data (total) is 19.

ms is the mean of the sum of squares. It is the sum of squares per unit degree of freedom (sum of squares divided by the degree of freedom).

From the output, the mean sum of squares of the model, residual, and total are respectively 1883.16, 8.997, and 304.917.

2. Model fit: This table summarizes the overall fit of the model. It answers the question “how well does the model use the predictors to model the target variable?” .The tabe is as shown below:

Number of obs is simply the number of observations used in the regression. Since the data has 20 observations, Number of obs  is equal to 20.

F(3, 16) is the F-statistics of an ANOVA test run on the model. The F-statistics is the ratio of the mean sum of squares (ms) of the model to that of the residual. It measures how the ratio of the explainable mean variance to the unexplainable mean variance is statistically greater than 1. The 3 and 6 simply represents the model’s and residual degrees of freedom respectively. To know how well the predictors (taken together as a group) reliably predicts the dependent variable, Stata conducts an hypothesis test using the F-statistics. The null hypothesis is that the mean explainable variance is same as the mean unexplainable variance. From the table, we see that the mean sum of squares of the model is about 209.31 times greater than that of the residual.

The Prob > F  is the probability of obtaining the estimated F-statistics or greater (the p-value). For a typical alpha level of 0.05, a p-value lesser than 0.05 like we have in our output, means that we have evidence to reject the null hypothesis and accept the alternate hypothesis that the ms of the model is significantly greater than that of the residual. Hence, the predictors of our model reliably predicts the target variable.

R-squared is the coefficient of determinant and it represents the goodness of fit. It is numerically the fraction of the variation in the dependent variable that can be accounted for (explained) by the independent variables. It is given by:

From the output, 97.52% of the variation in the dependent variable are explainable by the model.

Adj R-squared: Since the addition of more and more predictors tend to increase the R-squared, Adj R-squared tells us how much of the variation of the dependent variable is determined by the addition of the independent variables. Adj R-squared is the R-squared controlled for by the number of predictors. It is given by:

From the output, we can say that after adjusting for the degree of freedom, the coefficient of the determinant is 97.05%

Root MSE is simply the standard deviation of the residuals (error term). From the output, we can say that the measure of the spread of the residuals is 2.9995

  1. Parameter Estimation: This table shows the parameters estimated by the model and their respective statistical significance. In addition to the estimated coefficients, Stata conducts a hypothesis test using the t-test to find how each estimated coefficient is significantly different from zero. The null hypothesis for each independent variable is that they have no relationship with the dependent variable hence, they have an estimated parameter of zero, and that the intercept is zero. The alternate hypothesis is that these coefficients are significantly different from zero.

Parameter Estimation Table

                y represents the target variable; x1, x2, and x3 represent the independent variables; and _cons

represent the constant term (intercept).

            Linear Regression

For a linear regression, Coef. Is the estimate of the values of the coefficients of the independent variables, and the value of the intercept. The equation of the model can thus be represented as follows:

Remember that from our linear regression article, we explained that these coefficients are the corresponding partial derivative of the dependent variable with respect to each independent variable and the intercept. That is, they represent the change in the target variable for a unit increase in the independent variable holding all other factor constant. We interpret these coefficients as follows:

  1. Holding all other factors constant, the value of  will increase by about 0.9286648 for a unit increase in .
  2. Holding all other factors constant, the value of  will decrease by about 2.337473 for a unit increase in .
  3. Holding all other factors constant, the value of  will increase by about 2.018029 for a unit increase in . Since  is a categorical variable, assuming 0 represents female while 1 represent male; a unit increase in  is same as switch from female (0) to male (1). We can thus say that the value of  increases by about 2.018029 for every unit switching from female gender to male or better still; we can say that holding all other factor constant, the value of  for male is 2.018029 more than that for female.
  4. Holding all other factors constant, the value of  is 25.7459 when all the independent variables each have a value of zero.

Logistics Regression

If the dependent variable was categorical, the interpretation would change a little. Remember that from our logistic regression article, we showed that a regression model fits a line on the logit of the target variable. Since logit is the natural logarithm of the odd ratio, the odd ratio is thus the exponent of the logit. Recall that odd ratio is the ratio of the probability of success to failure i.e. how many times the chance of failure is the chance of success.

We can represent the linear equation as:

Each of the coefficient is thus the change in logit for a unit change in the independent variable. To interpret change in logit, we can write the change in logit as:

Therefore,  

The above however is a ratio of odd ratios. For better interpretability, it is best to interpret the change in the dependent variable as a fractional change in the odd ratio.

Therefore, the fractional change in the odd ratio is:

Therefore, each coefficient of the independent variable means that for a unit change in the independent variable, the fractional increase in the odd ratio is  (or the percentage increase in the odd ratio is )

Assuming our dependent variable is categorical and that 1 means success and 0 means failure, the coefficients would have been interpreted as follows:

  1. Holding all other factors constant, the percentage increase in the odd of success is  (=153.11%) for a unit increase in .
  2. Holding all other factors constant, the percentage decrease in the odd of success is  (=935.5%) for a unit increase in  (this is because the sign of the coefficient is negative).
  3. Holding all other factors constant, the percentage increase in the odd of success is  (=652.35%) for a unit increase in . Since  is a categorical variable, assuming 0 represents female while 1 represent male; a unit increase in  is same as switch from female (0) to male (1). We can thus say that the percentage increase in the odd of success is 652.35% for every unit switching from female gender to male or better still; we can say that holding all other factor constant, the increase in the odd ratio of success we would expect for male is 652.35%  that of female.
  4. Holding all other factors constant, the percentage increase in the odd ratio of success is  (=51.8) when all the independent variables each have a value of zero.

Std. Err. is the standard error of the estimated parameters. It is a measured of the deviation we would expect for each of the parameter.

t is the t-statistics of the estimated parameters. It is the ratio of coef to the std.Err. It measures how many times is the difference between the estimated coef and zero is greater than the std. Err.

P>|t| is the p-value associated with the t-statistics. It is the probability of obtaining the coef or even more (or less if coef is negative) that is significantly different from zero. For a threshold of 0.05, we have enough evidence to accept the alternate hypothesis that the estimated coefficients of x1, x2 and the intercept are not equal to 0 because their p-values are all lesser than 0.05. However, we cannot reject the null hypothesis for x3 because the its p-value is greater than the 0.05 significant threshold.

[95% Conf. Interval] is the 95% confidence interval. It gives the lower and upper boundaries in which we would expect to have coef to be between 95% of the time.

Conclusion

We have seen how to interpret the regression output in Stata. The regression output of Stata can be categorized into ANOVA table, model fit, and parameter estimation. The interpretation depends on the type of data of a particular variable.

See Also: Hypothesis TestingImportance of Data VisualizationLinear Regression SimplifiedLogistic Regression Explained,  Understanding  the Confusion Matrix


← Back


Comments

getu hailu Dec 10, 2023

Tank you.

Nthanael May 16, 2023

additional material please

sakariye adan Jun 01, 2022

I really appreciate for your explanation thank you

rivalee Dec 29, 2021

Great site thanks for posting it!

online games to play with friends on zoom Dec 26, 2021

Very good story thanks for sharing it!

game Dec 20, 2021

Thanks for the website it is truly an amazing website, thanks for sending it.


Leave a Reply

Success/Error Message Goes Here
Do you need help with your academic work? Get in touch

AcademicianHelp

Your one-stop website for academic resources, tutoring, writing, editing, study abroad application, cv writing & proofreading needs.

Get Quote
TOP