April 10, 2017

INTERPRETING MULTIPLE REGRESSION RESULTS IN EXCEL




INTERPRETING MULTIPLE REGRESSION
RESULTS IN EXCEL
DR MUZAHET MASRURI

This article shows how to use Excel to perform multiple regression analysis. To make it simple and easy to understand, the analysis is referred to a hypothetical case study which provides a set of data representing the variables to be used in the regression model.

After performing the regression analysis in Excel, the result of estimation of the regression model is displayed. The question is how to interpret the result?

In interpreting the results, Correlation Analysis is applied to measure the accuracy of estimated regression coefficients. This analysis is needed because the regression results are based on samples and we need to determine how true that the results are reflective of the population. The correlation analysis of R-Square, F-Statistics (F-Test), t-statistic (or t-test), P-value and Confidence Intervals is used.

In the final part, the advise or policy recommendations are highlighted based on the findings of the analysis of the regression model.

WHAT IS MULTIPLE LINEAR REGRESSION? 
A statistical technique for finding linear relationship between a dependent variable with two or more independent variables. The dependent variable Y is hypothesized to be a function of independent variable X1, X2 …… Xm

WHAT IS THE FORMULA OF MULTIPLE LINEAR REGRESSION? 
Y  =  α + b1X1 +  b2X2 +  b3X3 …… +ɛ
Where,
Y = dependent variable
α = constant value (y-intercept)
b1, b2, b3 = slope (regression coefficients)
X1, X2, X3 = independent variables
ɛ= random error

WHY REGRESSION ANALYSIS?
Whilst giving a lecture on Managerial Economics for MBA, I was asked a good question by a student.
RESPONSE TO THE QUESTION FROM THE STUDENT
Regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modelling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors').

In this lecture, regression analysis is used for demand estimation. But regression is also the technique used to estimate many other observations which involve the causal relationships among variables.

For example, we can use multiple regression to assess the relationship between the number of dengue cases in Malaysia with factors, such as urbanization, rainfall, temperature, age group of population and incidence by states. OR, the relationship between salaries of employees and their years of experience and the level of education in the Ministry of Health..!!


WHEN TO USE MULTIPLE LINEAR REGRESSION ANALYSIS?

To answer this question, we refer to a hypothetical Case Study. In the following example, the study is on the sale of petrol at kiosks in Kuala Lumpur.



BASED ON THE INSTRUCTION, THE TASKS OF THE MARKETING MANAGER ARE SUMMARIZED AS FOLLOWS:
1. Using Microsoft Excel, perform regression analysis from the data in the Table.
2. Develop the function in linear form of demand model for the petrol sale for the kiosks.
3. Test whether the coefficients of the variables used in the model are statistically significant.
4. Determine the percentage of the variation of output that is ‘explained’ by the regression equation.
5. Based on the findings, what advise should the Manager propose to the top management to increase the petrol sale at all kiosks in Kuala Lumpur?


USING EXCEL TO PERFORM MULTIPLE REGRESSION ANALYSIS
The steps are as follows:





HOW TO INTERPRET THE RESULT OF THE ESTIMATION OF REGRESSION MODEL?

To answer this question, let us refer to the 5 tasks a marketing manager normally performs.

TASK 1:  Using Microsoft Excel, perform regression analysis from the data in the Table.


TASK 2: Develop the function in linear form of demand model for the petrol sale for the kiosks.


TASK 3: Test whether the coefficients of the variables used in the model are statistically significant.

This is the most crucial task in regression analysis as measure of the accuracy of estimation is needed to test the statistical significance of the estimated regression coefficients of each variable. In addition, the regression results are based on samples and we need to determine how true that the results are truly reflective of the population.


In conducting the test, Correlation Analysis Techniques is used, namely R-Square, F-Statistics (F-Test), t-statistic (or t-test), P-value and Confidence Intervals.


i. R-SQUARE (COEFFICIENT OF DETERMINATION)
R-Square measures the proportion of variation in dependent variable that is explained in the regression line (independent variables). It is computed as the ratio of the sum of squared errors from the regression (SSRegression) to the total sum of squared errors (SSTotal).

The value of R-Square ranges from 0 to 1.The closer R-Square is to one, the better the regression equation; i.e., the greater the explanatory of the regression equation. Low value of R-Square indicates the absence of some important variables from the model.



ii. F-STATISTICS (F-RATIO)

F-statistic is used to test the hypothesis that all the independence variables   (X1, X2, ….. Xm ) explain the variation in the dependence variable (Y). 

Null hypothesis =  Ho :  All βi = 0  (rejected)
Alternative hypothesis =  Ha : At least one βI ≠ 0 (cannot be rejected) 
    
In other words, we are testing whether at least one of the explanatory variables contributes information for the prediction of Y. The greater the F-value the better the overall fit of the regression line.


iii. T-STATISTIC (OR T-TEST)
t-statistic is also used to test the hypothesis that all independence variables (X1, X2, ….. Xm) explain the variation in the dependence variable (Y). 

Null hypothesis =  Ho :  All βi = 0

Alternative hypothesis =  Ha : At least one βI ≠ 0

In other words, we are testing whether at least one of the explanatory variables contributes to the explanation for the prediction of Y. The rule of thumb: if the absolute value of t-statistic is greater or equal to 2, than the parameter estimate is statistically different from zero.



iv. P-VALUE
P-value is associated with a test statistic, i.e. to test the hypothesis of relationship between dependent and independent variables. 

The smaller the P value, the more strongly the test rejects the null hypothesis, and thus, to accept the alternative hypothesis. A p-value of .05 or 5% confidence level implies that 95% of the observed variables are strongly correlated.



v. CONFIDENCE INTERVALS
A term used in measuring the probability that a parameter will fall between two set values. In other words, a confidence interval is the probability that a value will fall between an upper and lower bound of a probability distribution. The confidence interval can take any number of probabilities, with the most common being 95% or 99%.



SUMMARY TASK 3: RESULT OF CORRELATION ANALYSIS

The summary of Task 3 is summarized as follows:

F valueThe F value is 8.234 So we reject the null hypothesis and conclude that the independence variables are useful in explaining of petrol sale with (1- 0.015) = 98.5 % confidence.

t-valueThe  t-value for variable SHELL CARD is -2.593 (i.e. greater than 2 in absolute value). Hence, we can conclude that SHELL CARD is statistically significance in explaining the sale of petrol.

The inference is that SHELL should focus on SHELL CARD than the advertisement or disposable income to increase the petrol sale.

P-value. From the Table, the P-value for the estimated coefficient of SHELL CARD is 0.04104. This means, there is only 4.1 in 100 chance that the true coefficient of price is actually 0. The coefficient of SHELL CARD  implies that it is statistically significant at (1- 0.04104 = 0.9589) or 96%. Thus the estimated coefficient is highly significant

Confidence intervalsThe Table indicate that the upper and lower bound of the 95% confidence intervals. As the coefficients of advertisement, price and income are lying between the upper and lower boundary of 95% confidence level, these coefficients are significant


TASK 4: DETERMINE THE PERCENTAGE OF THE VARIATION OF OUTPUT THAT IS ‘EXPLAINED’ BY THE REGRESSION EQUATION

The answer to this question is normally referred to the value of R-Square. It measures the proportion of variation in dependent variable that is explained in the regression line (independent variables). From the Table above, the sum of squared errors is 6605.61 and the total sum of squared errors is 8210. Thus, the R-square is: 

R-Square  = 6605.61 / 8210   = 0.8045 

This means the estimated demand equation (the regression line) explains 80% of the total variation in petrol sales across the sample of the 10 kiosks.

TASK 5: BASED ON THE FINDINGS, WHAT ADVISE SHOULD THE MANAGER PROPOSE TO THE TOP MANAGEMENT?