Statistical Analysis: Regression Models

Statistical Analysis for Forecasting: Regression Analysis

The collection, storage, and analysis of information is critical for all businesses, providing real-time data on consumers, suppliers, and economic activities to support business functionality. Whether via enhancement of analytical tools or improved statistical know-how, data analysis is most imperative to determining a strategy to aid in decision-making, while leveraging the use of cutting-edge software to increase efficiency and minimize costs. One such methodology employed by business professionals is the Regression Model Analysis which uses a set of statistical analysis to provide an estimation of the relationships between dependent and independent variables to show the strength of the dependency and by extension the resulting relationship. The aim is to develop, implement, and structure a model for future analysis with the platform to forecast relationship models, especially where one dependent variable can be analyzed then compared with one or more independent variables; to show causation and a marker for predication in line with future decision-making strategies.

Regression & Assumptions. Distributive data patterns are key to providing critical analysis from input to the model. Applying assumptions as a result of distribution patterns allows plots to be developed and used to check these assumptions in tandem with the influential observations. The purpose of the observations is to measure how much they can/may affect the estimate of the regression coefficient based on a given value. By inputting the dependent variable and a set of independent variables of interest relating to your model, it is possible to show the relationship between the two variables. For example, the relationship between “Y” and “X”, whether it is linear: random, systematic, and linked function.

As observed during testing, Regression Analysis provides answers to the investigating relationships between variables. The variables are independent (an input, driver, or factor) that has an impact/effect on a dependent variable. The regression line is, therefore, a straight line that describes how a response variable “Y” will behave to changes, and in which direct the change will likely occur when the explanatory variable changes. The regression line (slope) is the heart of the equation as it informs you how much you can expect Y  to change as X changes.

The most commonly used techniques for investigating the relationship between two quantitative variables are correlation and linear regression. Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation. For a specific outcome, the choice for regression model depends on the expected outcome. Linear regression for example, should be used when analysis favors numeric variables. Apply other types of regression models if your outcome variables are not numeric.

Regression Model Building
Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated Procedures and all possible regressions: Backward Elimination (Top-down approach) Forward Selection (Bottom-up approach) Stepwise Regression (Combines Forward/Backward) Cp Statistic - Summarizes each possible model, where “best” model can be selected based on statistic
Click Here
Backward Elimination
Select a significance level to stay in the model (e.g. SLS=0.20, generally .05 is too low, causing too many variables to be removed) Fit the full model with all possible predictors Consider the predictor with lowest t-statistic (highest P-value). If P > SLS, remove the predictor and fit model without this variable (must re-fit model here because partial regression coefficients change) If P  SLS, stop and keep current model Continue until all predictors have P-values below SLS
Click Here
Forward Selection
Choose a significance level to enter the model (e.g. SLE=0.20, generally .05 is too low, causing too few variables to be entered) Fit all simple regression models. Consider the predictor with the highest t-statistic (lowest P-value) If P SLE, keep this variable and fit all two variable models that include this predictor If P > SLE, stop and keep previous model Continue until no new predictors have P SLE
Click Here
Stepwise Regression
Select SLS and SLE (SLE
All Possible Regression -Cp
Fits every possible model. If K potential predictor variables, there are 2K-1 models. Label the Mean Square Error for the model containing all K predictors as MSEK For each model, compute SSE and Cp where p is the number of parameters (including intercept) in model Cp = SSE/MSE -^(n -2p) Select the model with the fewest predictors that has Cp  p
Click Here
Regression Diagnostics
Model Assumptions: Regression function correctly specified (e.g., linear) Conditional distribution of Y is normal distribution Conditional distribution of Y has constant standard deviation Observations on Y are statistically independent Residual plots can be used to check the assumptions Histogram (stem-and-leaf plot) should be mound-shaped (normal) Plot of Residuals versus each predictor should be random cloud U-shaped (or inverted U)  Nonlinear relation Funnel shaped  Non-constant Variance Plot of Residuals versus Time order (Time series data) should be random cloud. If pattern appears, not independent.
Click Here
Detecting Influential Observations
Studentized Residuals – Residuals divided by their estimated standard errors (like t-statistics). Observations with values larger than 3 in absolute value are considered outliers. Leverage Values (Hat Diag) – Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations with values larger than 2(k+1)/n are considered to be potentially highly influential, where k is the number of predictors and n is the sample size. DFFITS – Measure of how much an observation has affected its fitted value from the regression model. Values larger than 2*sqrt((k+1)/n) in absolute value are considered highly influential. Use standardized DFFITS in SPSS. .
Click Here
Detecting Influential Observations
DFBETAS – Measure of how much an observation has affected the estimate of a regression coefficient (there is one DFBETA for each regression coefficient, including the intercept). Values larger than 2/sqrt(n) in absolute value are considered highly influential. Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. Values larger than 4/n are considered highly influential. COVRATIO – Measure of the impact of each observation on the variances (and standard errors) of the regression coefficients and their covariances. Values outside the interval 1 +/- 3(k+1)/n are considered highly influential.
Click Here
Obtaining Influence Statistics and Studentized Residuals in SPSS
.Choose ANALYZE, REGRESSION, LINEAR, and input the Dependent variable and set of Independent variables from your model of interest (possibly having been chosen via an automated model selection method). .Under STATISTICS, select Collinearity Diagnostics, Casewise Diagnostics and All Cases and CONTINUE .Under PLOTS, select Y:*SRESID and X:*ZPRED. Also choose HISTOGRAM. These give a plot of studentized residuals versus standardized predicted values, and a histogram of standardized residuals (residual/sqrt(MSE)). Select CONTINUE. .Under SAVE, select Studentized Residuals, Cook’s, Leverage Values, Covariance Ratio, Standardized DFBETAS, Standardized DFFITS. Select CONTINUE. The results will be added to your original data worksheet.
Click Here
Variance Inflation Factors
Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is with the other predictors in the model. Used to identify Multicollinearity. Values larger than 10 for a predictor imply large inflation of standard errors of regression coefficients due to this variable being in model. Inflated standard errors lead to small t-statistics for partial regression coefficients and wider confidence intervals
Click Here
Nonlinearity: Polynomial Regression
When relation between Y and X is not linear, polynomial models can be fit that approximate the relationship within a particular range of X General form of model: E(Y) =a + B1X+…+BkXk Second order model (most widely used case, allows one “bend”): E (Y) = a + B1X + B2X^2 Must be very careful not to extrapolate beyond observed X levels
Click Here
Generalized Linear Model (GLM)
General class of linear models that are made up of 3 components: Random, Systematic, and Link Function Random component: Identifies dependent variable (Y) and its probability distribution Systematic Component: Identifies the set of explanatory variables (X1,...,Xk) Link Function: Identifies a function of the mean that is a linear function of the explanatory variables g(u) = a + B1 (lower beta) K1 +...+BkXk.
Click Here
Random Component
Conditionally Normally distributed response with constant standard deviation - Regression models we have fit so far. Binary outcomes (Success or Failure)- Random component has Binomial distribution and model is called Logistic Regression. Count data (number of events in fixed area and/or length of time)- Random component has Poisson distribution and model is called Poisson Regression Continuous data with skewed distribution and variation that increases with the mean can be modeled with a Gamma distribution
Click Here
Common Link Functions
Identity link (form used in normal and gamma regression models): g (u) = u Log link (used when m cannot be negative as when data are Poisson counts): g(u) = log (u) Logit link (used when m is bounded between 0 and 1 as when data are binary): g(u) = log u/1-u
Click Here
Exponential Regression Models
Often when modeling growth of a population, the relationship between population and time is exponential: E(Y) = U = aB^x Taking the logarithm of each side leads to the linear relation: log (u) = log(a) + X log (B) = a' +B'X Procedure: Fit simple regression, relating log(Y) to X. Then transform back: log^(Y) = a + bX a^=e^a, B^ = e^b Y^ = a^B^X
Click Here
Previous slide
Next slide

3 thoughts on “Statistical Analysis: Regression Models”

Leave a Comment

Your email address will not be published. Required fields are marked *

SUBSCRIBE NOW

We value your privacy and will never send irrelevant information.