< Regression analysis < Logistic regression
What is logistic regression?
Logistic regression, also called the logit model, estimates the probability of event occurring based on given data. This S-shaped distribution function is similar to the standard normal distribution, but the probabilities are easier to calculate .
Logistic regression is often used to predict the probability of a binary outcome (e.g., yes/no, pass/fail, etc.), given a single measurement variable (e.g., height, weight, etc.). Logistic regression can also be applied to ordinal data — variables with more than two ordered categories, such as survey data — but this is not common. This type of regression is especially useful for classification problems, where you are trying to determine if new data fits best into a category. For example, in cyber security, logistic regression is useful in binary classification problems, such as detecting threats, where there are only two classes (threat, or not a threat) . In natural language processing, this method is the baseline supervised machine learning algorithm for classification .
The logistic regression model is a non-linear transformation of linear regression. More specifically, it is a transformation of log p with an unbounded range. Formally, it is defined as
Solving for p, we get
Logistic regression predicts probabilities rather than placing data neatly into classes. In other words, it estimates the probability of belonging to a certain category, given a set of predictor variables. The decision boundary is the solution of β0 + x · β = 0 and separates the two predicted classes.
Types of logistic regression
Which type you choose depends on the nature of the categorical response variable .
- Binary logistic regression predicts the probability of an outcome with two possible values, such as passing or failing a test, responding yes or no to a survey, or having high or low blood pressure.
- Multinomial logistic regression can model scenarios with more than two possible discrete outcomes:
- Nominal logistic regression is used for three or more categories with no natural ordering to levels. Examples of nominal responses include business departments (e.g., HR, IT, Sales), type of degree sought (e.g., computer science, math, English), and hair color (black, brown, red).
- Ordinal Logistic Regression is used when there are three or more categories with a natural ordering to the levels, but the ranking of the levels doesn’t have to have equal intervals. For example, ordinal responses could be how employees rate their manager’s effectiveness (e.g., good, fair, poor), or levels of flavors for hot sauces.
Why not use ordinary linear regression?
- With linear regression, predicted values will become greater than one and less than zero if you move far enough on the x-axis. As probabilities can only range from 0 to 1, this is nonsensical and can lead to issue when predicted values are used in subsequent analysis. Logistic regression does not have this issue.
- One of the assumptions of regression is homoscedasticity — the variance of the dependent variable is constant across all values of the independent variable. This assumption doesn’t hold for binary variables, because the variance for binary variables is PQ; the maximum variance is .25 when 50 percent of the data are 1s. The variance decreases as you move toward more extreme values. For example, when P =.10, the variance is .1*.9 = .09, so as P approaches 1 or zero, the variance approaches zero.
- Significance testing of the logistic regression coefficient weights assumes that prediction errors (Y – Y’) are normally distributed. Because the dependent variable Y only takes the values 0 and 1, this assumption is difficult to justify. Therefore, tests of regression weights are suspect if you use linear regression with a binary dependent variable .
Logistic Regression vs ANOVA and T-Test
A one-way ANOVA or Student’s t-test can be used to compare the means of two groups on a measurement variable, when groups are defined by a nominal variable. In comparison, logistic regression is used to predict the probability of a binary outcome (e.g., whether or not a person has a heart attack) given a measurement variable (e.g., BMI). That’s not to say that you can’t use all three with the same data; the difference is in what types of questions you want answered.
For example, suppose you want to investigate the relationship between BMI and heart attack risk in 60-year-old women. You could:
- Is BMI is associated with heart attack risk? Use a one-way ANOVA to test the hypothesis that there is no difference in mean BMI between women who have had a heart attack and those who have not.
- Is the difference in mean BMI between the two groups statistically significant? Use a Student’s t-test to compare the mean BMIs of the two groups.
- Which women are at high risk of heart attack? Use logistic regression to predict the probability of a woman of a certain age having a heart attack in the next decade, given her BMI.
In general, logistic regression is more powerful than ANOVA or t-tests for predicting binary outcomes. However, ANOVA and t-tests can be useful for understanding the relationship between a measurement variable and a nominal variable, even if the relationship is not strong enough to be statistically significant in a logistic regression model.
 Hilbe, J. (2016). Practical Guide to Logistic Regression. CRC Press.
 An Introduction to Logistic Regression. Retrieved October 27, 2023 from: https://www.appstate.edu/~whiteheadjc/service/logit/intro.htm
 Edgar, R. & Manz, O. (2017). Exploratory Study. Research Methods for Cyber Security.
 Logistic Regression. Retrieved October 27, 2023 from: http://faculty.cas.usf.edu/mbrannick/regression/Logistic.html
 Penn State. Logistic regression. Retrieved October 28, 2023 from: https://online.stat.psu.edu/stat462/node/207/
 Jurafsky, D. & Martin, J. (2023). Speech and Language Processing. Draft. Retrieved October 27, 2023 from: https://web.stanford.edu/~jurafsky/slp3/5.pdf