# Regression analysis

Contents:

## What is regression analysis?

Regression analysis is a statistical method used to examine the relationship between one dependent variable and one or more independent variables. It helps to understand how the value of the dependent variable changes when any of the independent variables are varied, while keeping the other independent variables constant.

The primary goal of regression analysis is to model the relationship between the variables, make predictions based on this relationship, and determine the strength and direction of the relationship. There are several types of regression analysis, including linear regression, multiple regression, logistic regression, and others. Each type serves different purposes and is suitable for different situations.

In a linear regression, for example, a straight line is used to represent the relationship between a dependent variable (y) and an independent variable (x). The line is defined by an equation that describes how changes in the independent variable affect the dependent variable. This equation can then be used to make predictions about the dependent variable’s values based on new independent variable data.

Regression analysis is used across various fields such as business, economics, medicine, and social science. Uses include:

1. Predicting the value of a dependent variable based on the values of independent variables.
2. Estimating the strength of the relationship between two variables.
3. Testing hypotheses concerning the relationship between two variables.
4. Controlling for the impact of confounding variables.

In sum, regression analysis is an invaluable tool for anyone working with data.

## Regression analysis example

In statistics, it can be challenging to make sense of a set of random numbers in a table. For instance, if you’re asked to predict this year’s snowfall in your town, considering global warming may be reducing average snowfall.

You might estimate around 10-20 inches. While that’s a reasonable guess, you could improve it by using regression analysis.

Essentially, regression analysis is the “best guess” approach for making predictions based on a given dataset. It involves fitting data points to a graph. There are numerous tools available for running regression, including Excel, which can help clarify the snowfall data. The above graph was created in Excel.

By examining the regression line running through the data, you can refine your initial guess. You’ll notice that the original estimate (about 20 inches) was significantly off. For 2015, the line seems to fall between 5 and 10 inches. While this might be “good enough,” regression also provides a helpful equation, such as:

y = -2.2923x + 4624.4

This equation allows you to input an x value (the year) and obtain a reasonably accurate snowfall estimate for any given year. For example, for 2005:

y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is quite close to the actual figure of 30 inches for that year.

Best of all, you can use the equation to make predictions, like estimating snowfall for 2017:

y = 2.2923(2017) + 4624.4 = 0.8 inches.

Additionally, regression analysis provides an R-squared value, which for this graph is 0.702. This number indicates how well your model fits the data. R-squared values range from 0 to 1, with 0 representing a poor model and 1 indicating a perfect model. As you can see, 0.7 is a fairly decent model, so you can be reasonably confident in your weather prediction.

Overall, regression analysis is a widely used technique in various fields, such as economics, finance, social sciences, and natural sciences, to analyze data and make informed decisions.

## What is multiple regression?

Multiple regression analysis is used to determine if there is a statistically significant relationship between sets of variables and to identify trends in those data sets.

Multiple regression analysis is quite similar to simple linear regression, with the key difference being the number of predictors (“x” variables) used. Simple regression analysis has a single X variable for each dependent “Y” variable (e.g., X1, Y1). In contrast, multiple regression uses multiple “X” variables for each independent variable (e.g.,  (X1, X2, X3, Y1).

In general, you should use linear regression when yo

In single-variable linear regression, you input one dependent variable (e.g., “sales”) against an independent variable (e.g., “profit”). However, you may be interested in examining how different types of sales affect the regression. In this case, you could designate X1 as one type of sales, X2 as another type, and so on.

Ordinary linear regression often fails to account for all real-life factors affecting an outcome. For example, a graph might display a relationship between life expectancy of women and the number of doctors in a population. While it may seem that increasing the number of doctors would improve life expectancy, other factors, such as doctors’ education levels, experience, or access to medical facilities, should also be considered. Incorporating these additional factors requires adding more dependent variables to the regression analysis, resulting in a multiple regression analysis model.

Some thingss to keep in mind when deciding whether to use linear regression or multiple regression:

• Number of observations: If you have a small number of observations, you may not have enough power to detect the effects of multiple independent variables. In this case, you may want to use linear regression.
• Complexity of the relationship between the variables: If the relationship between the dependent variable and the independent variables is nonlinear, you may want to use multiple regression. Linear regression can only model linear relationships.
• Expertise of the analyst: If the analyst is not familiar with multiple regression, they may want to use linear regression. Multiple regression is more complex than linear regression and can be more difficult to interpret.

## Running multiple regression

Regression analysis is usually conducted using software such as IBM SPSS, R or MATLAB. The output varies depending on the number of variables but is fundamentally similar to the output found in simple linear regression – just more extensive:

• Simple regression: Y = b0 + b1 x.
• Multiple regression: Y = b0 + b1 x1 + b0 + b1 x2…b0…b1  xn.

The output would include a summary, similar to a summary for simple linear regression, containing:

• R (the multiple correlation coefficient),
• R squared (the coefficient of determination),
• The standard error of the estimate.

These statistics help determine how well a regression model fits the data. The ANOVA table in the output provides the p-value and f-statistic.

One area that can be a challenge is deciding on the correct sample size for multiple regression. The answer to the sample size question seems to partly rely on the researcher’s objectives, the research questions being addressed, and the type of model used. Numerous research articles and textbooks offer recommendations for minimum sample sizes in multiple regression, but few reach a consensus on what constitutes a large enough sample size, and not many discuss the prediction aspect of MLR [1].

If you want to run multiple regression to gain a general understanding of trends without needing highly precise estimates, you can use a rule of thumb: It is commonly recommended in the literature to have over 100 items in your sample. While this may sometimes suffice, it is safer to have at least 200 observations, or even better, more than 400.

## How to avoid overfitting

Overfitting occurs when a model is too complex for the given data, often due to a small sample size. Including numerous predictor variables in a regression model can result in a seemingly significant model that fits the data’s peculiarities well but fails to fit additional test samples or the overall population. Consequently, the model’s p-values, R-squared, and regression coefficients can be misleading, as they overstate the model’s performance based on a limited dataset.

To avoid overfitting in modeling, ensure you have at least 10-15 observations for each term being estimated. If this guideline is not followed, overfitting may occur. Terms include interaction effects, polynomial expressions (for modeling curved lines), and predictor variables. Green [3] suggests a minimum sample size of 50 for any regression, with an additional eight observations per term.

Exceptions to the “10-15” rule of thumb exist, such as when multicollinearity is present in the data or if the effect size is small. In these cases, more terms may be required, though there is no specific rule of thumb for determining the number of extra terms needed.

To detect and avoid overfitting, consider the following strategies:

1. Increase your sample size by collecting more data or reduce the number of predictors in your model by combining or eliminating them.
2. Use cross-validation to detect overfitting, partition your data, generalize your model, and select the best-performing model. Predicted R-squared is one form of cross-validation.
3. Employ shrinkage and resampling techniques to determine how well your model might fit a new sample.
4. Avoid relying on automated stepwise regression as a solution for small datasets, as it can lead to numerous issues.

Remember, increasing the sample size or reducing the number of predictors are the most effective ways to prevent overfitting in regression models.

## History of regression analysis

Regression analysis boasts a rich and extensive history, originating in the 18th century. Carl Friedrich Gauss, a German mathematician known as the father of statistics, was the first to develop a regression model. He created the least squares method, a statistical technique for estimating model parameters based on a given dataset.

In the 19th century, British scientist Francis Galton introduced the concept of correlation, a statistical measure reflecting the strength of the relationship between two variables. Additionally, Galton developed the idea of regression to the mean, a phenomenon where offspring of parents with extreme traits tend to be closer to the average for that trait.

Throughout the 20th century, several statisticians, including Ronald Fisher, Karl Pearson, and Jerzy Neyman, contributed further advancements to regression analysis. Fisher introduced the concept of the p-value, a statistical measure indicating the probability of obtaining observed results if the null hypothesis is true. Pearson developed the chi-square test, a statistical comparison of observed data to expected data. Neyman established the concept of hypothesis testing, a statistical framework for evaluating hypotheses about variable relationships.

## Related articles

Least squares method

Logistic regression

Regression coefficients

## References

[1] Glen, S. (2013). Statisticshowto.com. Used with permission.

[3] Green S.B., (1991) “How many subjects does it take to do a regression analysis?” Multivariate Behavior Research 26:499–510.

Scroll to Top