< Statistics and Probability Definitions

When it comes to statistical data, the term **degrees of freedom** (**df**) is a measure of how much freedom you have when selecting values for your data sample. refers to the More specifically, it is the maximum number of values that can be independently varied in a given sample.

## Calculating Degrees of Freedom

There are two methods for calculating degrees of freedom, though both involve subtracting 1:

- Using the formula
**df = N – 1**, where N is the number of items in your data sample. So, if your sample contains four items, your degrees of freedom would be 3 (4 – 1 = 3). - Using the formula
**df = k – 1**, where k is the number of parameters being estimated. For example, if you’re estimating the mean weight loss for a low-carb diet, k would be 2 (one for the mean and one for the population standard deviation). Therefore, df = 2 – 1 = 1.

It’s important to note that degrees of freedom are not always whole numbers. For example, if you’re using a continuous distribution like the Normal Distribution, your degrees of freedom will be infinity because there are an infinite number of values between any two given points.

## Why do we subtract 1?

Degrees of freedom is the number of values in a dataset that are free to vary. What does “free to vary” mean? Consider an example using the mean (average):

- Choose a set of numbers with a mean (average) of 10. A. Some possible sets of numbers include: 9, 10, 11 or 8, 10, 12 or 5, 10, 15.
- Once you’ve selected the first two numbers in the set, the third one becomes fixed. In other words, you cannot choose the third item in the set freely. The only numbers that can vary freely are the first two. You can select 9 + 10 or 5 + 15, but once you’ve made that decision, you must choose a specific number that will result in the desired mean. Therefore, the degrees of freedom for a set of three numbers is two.

For instance, when finding a confidence interval for a sample, the degrees of freedom are n – 1, where ‘n’ represents the number of items, classes, or categories.

We only subtract 1 for single sample analysis. For two samples, subtract 2, use this formula:

Degrees of Freedom (Two Samples): (N

_{1}+ N_{2}) – 2.

## ANOVA degrees of freedom

Degrees of freedom become slightly more complex in ANOVA tests. Unlike a simple parameter (such as a mean), ANOVA tests involve comparing known means within data sets. For instance, in a one-way ANOVA, you compare two means in two cells. The grand mean (the average of the averages) would be: Mean 1 + Mean 2 = grand mean. If you chose Mean 1 and knew the grand mean, you wouldn’t have a choice regarding Mean 2, so your **degrees of freedom for a two-group ANOVA is 1.**

**Two Group ANOVA df1 = n – 1**

For a three-group ANOVA, you can vary two means, so the degrees of freedom are 2.

In reality, it’s a bit more complicated because there are two degrees of freedom in ANOVA: df1 and df2. The explanation above is for df1. In ANOVA, df2 is the total number of observations in all cells minus the degrees of freedom lost because the cell means are set.

**Two Group ANOVA df2 = n – k**

In this formula, “k” represents the number of cell means or groups/conditions. For example, let’s say you had 200 observations and four cell means. Degrees of freedom in this case would be: Df2 = 200 – 4 = 196.

## Why do critical values decrease when degrees of freedom increase?

The short answer:

Degrees of freedom are related to the sample size (n-1). If the degrees of freedom increase, then the sample size must also be increasing (because you’re subtracting 1 from the sample size; the graph of the t-distribution will have lighter tails, pushing the critical value towards the mean.

Let’s examine the t-score formula in a hypothesis test: t-score As *n *(sample size) increases, the t-score also increases. This occurs due to the square root in the denominator: as it becomes larger, the fraction s/√n decreases, resulting in a larger t-score. Although degrees of freedom are defined as *n*-1, the t-critical value doesn’t increase; instead, it decreases, which may seem counter-intuitive.

However, consider the purpose of a t-test. It is used when the population standard deviation is unknown, and consequently, the graph’s shape is uncertain. The graph could have short, fat tails or long, thin ones. Degrees of freedom influence the shape of the t-distribution graph; as the degrees of freedom increase, the area in the tails shrinks. As degrees of freedom approach infinity, the t-distribution resembles a normal distribution, allowing certainty about the standard deviation (which is 1 in a standard normal distribution).

Suppose you repeatedly sampled weights from four individuals, drawn from a population with an unknown standard deviation. After measuring their weights and calculating the mean difference between the sample pairs, you repeat the process. The small sample size of four will result in a t-distribution with fat tails, indicating a higher probability of extreme values (in this context, extreme values are data points or observations significantly distant from a central point such as the mean or median) in your sample. You test your hypothesis at a 5% alpha level, cutting off the last 5% of the distribution.

In comparison to the normal distribution, there is a lower chance of extreme values. The 5% alpha level cuts off at a critical value of 2 for the normal distribution.

## Degrees of freedom history

The idea of degrees of freedom (DF) was first introduced by Carl Friedrich Gauss, a German mathematician and astronomer, in the early 1800s. As one of the founders of modern statistics, Gauss used the concept to create numerous statistical methods, including the least squares method, which remains widely used in statistics today.

Degrees of freedom are applicable in various statistical methods such as hypothesis testing, confidence intervals, and regression analysis:

- In hypothesis testing, DF is used to compute the p-value, a measure of evidence against the null hypothesis. A low p-value signifies strong enough evidence to
**reject the null hypothesis.** - In confidence intervals, DF helps determine the interval’s width. A broader interval suggests greater uncertainty about the true value of the parameter.
- In regression analysis, DF is used to calculate the standard errors of the estimated coefficients. A higher DF indicates smaller standard errors, meaning the estimated coefficients are more precise.

The concept of DF is fundamental in statistics, being used in numerous statistical methods and crucial for interpreting statistical analysis results.

Other key events in the history of degrees of freedom include:

- 1820s: Adrien-Marie Legendre further develops the least squares method.
- 1870s: William Sealy Gosset creates the t-distribution.
- 1900s: Ronald Aylmer Fisher develops the chi-square distribution.
- 1920s: Jerzy Neyman and Egon Pearson establish hypothesis testing.
- 1930s: Harold Hotelling formulates confidence intervals.
- 1940s: John Tukey develops robust statistics.
- 1950s: William Kruskal and William Wallis create nonparametric statistics.
- 1960s: John W. Tukey introduces exploratory data analysis.
- 1970s: John Hartigan develops the bootstrap.

The concept of degrees of freedom is a potent tool for analyzing various data types, making it an invaluable resource for anyone working with data.

## References

[1] Jerome laurens, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

Pingback: Helmert’s Distribution - P-Distribution