Empirical distribution function EDF, ECDF

< Probability and statistics definition < Empirical distribution

What is an empirical distribution?

While many probability distributions are theoretical, an empirical distribution has its values and associated probabilities determined by observation or experiment [1]. The term “empirical” refers to something that is observed or experienced. Empirical distributions, therefore, are distributions of data that have been observed or collected, like data from random samples.

die roll empirical distribution
The distribution of a fair dice rolling experiment is uniform.

Consider a basic experiment: rolling a six-sided die multiple times and recording the results. We are assuming the die is fair, meaning each face (numbered 1 through 6) has an equal chance of appearing. When we visualize the distribution of probabilities for each possible outcome (each face of the die), we find that each face has a 1/6 probability of appearing. This distribution follows a uniform distribution because all outcomes have the same probability. This is a theoretical probability distribution – it is not based on observed data but on the theoretical probability of each outcome.

Empirical distributions, in contrast, are based on observed data. For example, if roll a die 10 times and plot the results, it may not not like the theoretical probability distribution. However, as we increase the number of rolls, the empirical histogram will begins to look like the uniform probability distribution. This observation leads us to a general principle called the law of averages, which states that if a chance experiment is repeated independently under identical conditions, the proportion of times an event occurs will get closer to the theoretical probability of the event over time. In this die-rolling example, if we roll a die a large number of times, the proportion of times we roll a four will get closer and closer to 1/6. This principle applies under the condition that each repetition of the experiment is performed the same way, regardless of the results of all other repetitions.

The empirical distribution is fairly basic: if we have n data points, the empirical distribution of the data places probability 1/n on each of the n data points, and probability 0 elsewhere. More formally, we can define it with a vector of numbers x = {x1, … , xn). The empirical distribution of x is the probability distribution with expectation [2]:

expectation of  empirical distribution

What is an empirical distribution function (EDF)?

The empirical distribution function (EDF) maps each value in a sample to the probability that a randomly selected sample member will be less than or equal to that value. It is a step function that jumps up by 1 at each value in the sample.

The EDF is defined as:

F(x) = number of observations ≤ x / total number of observations

where:

  • x is a value in the sample

To calculate the EDF:

  • Order the data from smallest to largest values
  • Count the number of observations that are less than or equal to each value.

The EDF is then the fraction of observations that are less than or equal to each value.

For example, suppose we have a sample of 10 observations: (−15, −8, 8, 3, −7, 4, −12, 5, −10, −11). The EDF is:

xF(x)
< -150
-150.1
-120.2
-110.3
-100.4
-80.5
-70.6
30.7
40.8
50.9
8 ≥1
#≤ x = number of observations less than or equal to x.

The EDF shows that 10% of the observations are less than or equal to 1, 20% of the observations are less than or equal to 2, and so on.

The EDF is a useful tool for understanding the distribution of data. It can be used to visualize the data, to get a better understanding of the shape of the distribution, and to make inferences about the population from which the sample was drawn.

What Is an Empirical cumulative distribution function (ECDF)?

The empirical cumulative distribution function (ECDF) is a type of probability model used to represent observed data. The ECDF is basically the same as a cumulative distribution function (CDF), but it models actual (observed) data rather than hypothetical data; the word “Empirical” means the function deals with real observations rather than theoretical ones.

The ECDF can more formally be defined as follows. Given a set of order statistics  (y1 < y2 < … < yn) from an observed (empirical) random sample, then the ECDF is a sum of iid random variables:

definition of empirical cumulative distribution function

where I is the indicator function [3].

The main purpose of an empirical cumulative distribution function is to visualize the overall shape of your sample’s distribution and compare it with other distributions. For example, you can use an ECDF to compare two different samples to see if they have similar shapes or if one looks more “normal” than another. Additionally, you can use an ECDF to identify outliers in your sample, as these points are often easily spotted on an ECDF graph due to their large distances from neighboring points.

A graph of the ECDF plots the values from a sample on the x-axis and their corresponding cumulative probabilities on the y-axis. In other words, it plots each value from your sample against the probability that any given value will be less than or equal to that value. Let’s say you have a set of experimental (observed) data x1, x2 …,xn. The EDF will give you the fraction of sample observations less than or equal to a particular value of x.

For example, the next image shows the ECDF for deaths by horsekick in Prussian cavalry corps, 1875-94:

The ECDF plots cumulative probabilities of real life data.

The ECDF does not take into account any theoretical distributions; instead, it simply models what has been observed in your sample data. The ECDF is typically used to determine things like the frequency of occurrence for data points in a given sample. It can also be used to compare different samples of data and to analyze their distributions.

The ECDF works by plotting each data point in your sample on a graph where x-axis represents the observations, and y-axis represents the proportion of observations that are less than or equal to that particular value. To calculate each point on the graph, you count up all the values that are less than or equal to that particular value and divide it by n – where n is total number of observations in your sample. This provides you with an estimate of what percentage of observations are below that particular value. Once you have calculated all these points, you can plot them on your graph which will give you your ECDF curve – this curve can then be used to analyze your data set further.

What is an ECDF? And how to read an ECDF.

Cumulative vs CDF vs ECDF

Cumulative probabilities, cumulative distribution function (CDF), and empirical cumulative distribution function (ECDF) are terms that are sometimes used interchangeably, but they have different meanings.

  • Cumulative probabilities refer to the probability that a random variable is less than or equal to a certain value. For example, the cumulative probability that a rolling a fair six-sided die and getting a value less than or equal to 3 is 3/6 or 0.5.
  • The cumulative distribution function (CDF) maps each random variable to its cumulative probability. It is defined as F(x) = P(X x), where X is a random variable and F(x) is the cumulative probability that Xx.
  • An empirical cumulative distribution function (ECDF) is a non-parametric estimate of the underlying CDF of a sample of data. In other words, it is an approximation or estimate of the CDF and tells you what theoretical distribution your actual data might fit. It is defined as the proportion of data points that are less than or equal to a certain value, and is built by ordering the data points in increasing order and plotting the proportion of data points less than or equal to each value. Thus, the ECDF is a step function that jumps up by 1/n at each data point, where n is the sample size.

In conclusion, the empirical cumulative distribution function (ECDF) is a useful tool for visualizing and analyzing sample data. By plotting each value from your sample against its corresponding probability, you can quickly compare two different samples and identify any outliers in your data set.

More on CDF vs EDF

A cumulative distribution function (CDF) is also known as an analytical distribution function because it models hypothetical distributions rather than observed ones like an ECDF does. The main difference between these two graphs is that while both are used to plot probabilities against values, the CDF uses theoretical probabilities rather than actual counts as seen in an ECDF graph. This makes it useful for analyzing continuous variables such as age or height whereas an ECDF is better suited for discrete variables such as gender or political affiliations.

graph of empirical distribution function
A visualization of an empirical cumulative distribution function (ECDF). The grey bars show the samples corresponding to the ECDF and the green curve is the theoretical distribution from which the samples have been drawn.

How to create an empirical distribution ECDF in Excel

You probably won’t want to calculate the ECDF by hand, especially for a large number of data points. Here’s how to create an ECDF in Excel:

Suppose you had a sample of 50 observations. The following example uses Excel to demonstrate how to work the formula:

  1. Enter your data into column A, then sort in ascending order.
  2. In column B, type k/n, where k = number of the observation and n is the sample size (50, in this example).

Then, to compare your data to another distribution, enter the values into column C:

Comparing values to a gamma distribution.

In this example, I’m comparing to the gamma distribution. To enter values from the gamma distribution into Excel:

  • Type =GAMMA.DIST( into an empty cell.
  • Type the probability value. For example, if you want to find the probability at x=1, the function becomes =GAMMA.DIST( 6
  • Type your α and β values, separated by a comma. For example, if your α is 3 and β is 2, the function becomes: =GAMMA.DIST( 6, 3, 2
  • Type FALSE, close the parentheses and then hit the enter key. The function =GAMMA.DIST( 6, 3, 2, FALSE will return the probability as 0.112020904.

Now that you have two columns of probabilities (your EDF in column B and the gamma in column C), you can compare them. One way to do this is with a scatter plot:

References

[1] Empirical distributions. Retrieved July 5, 2023 from: https://www.unf.edu/~cwinton/html/cop4300/s09/class.notes/EmpiricalDistributions_ppt.pdf

[2] Geyer, C. Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions, Asymptotic Sampling Distributions. Retrieved July 5, 2023 from: https://www.stat.umn.edu/geyer/5102/slides/s1.pdf

[3] Mahmoud, H. (2000). Sorting: A Distribution Theory. John Wiley & Sons.

Scroll to Top