< Statistics and Probability Definitions

A **random variable** is a variable whose value is unknown and determined by chance. They are similar to the familiar *x* and *y *of algebra, but are represented by capital letters, such as *X* or *Y*, and are connected to random processes. A random process is an event or experiment with unpredictable outcomes, such as rolling a die, drawing a ball from an urn, or testing an experimental drug.

More formally, a random variable is a function that assigns values to each of an experiment’s outcomes. A function is a mathematical relationship that assigns each input value to exactly one output value, often represented as an equation or a graph.

In mathematical terms, random variables are a mapping from a sample space — the set of all possible outcomes of an experiment — to the real numbers. For example, let’s say you flip a coin three times. The sample space for this experiment would be {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}, which are all the possible ways the coin could land after three flips. The random variable *X* could represent the number of heads that come up in those three flips. In this case, *X* would take on the values 0, 1, 2, or 3 corresponding to the number of heads. So, if *X* = 0 then {HHH} would be the only outcome in the sample space; if *X* = 1 (a result of two heads) then {HHT, HTH, HTT} would be the only outcomes; and so on.

## Discrete vs Continuous random variables

Random variables can be classified as either *discrete *or *continuous*:

**Discrete random variables**have countable values (i.e., they can’t be divided into smaller units) such as integers 1 through 10, the number of people in a room or the number of cars on a street.**Continuous random variables**can have an infinite number of values within a continuous range (i.e., from 10 to infinity). Height and weight are two examples of continuous random variables.

Feature | Discrete Random Variable | Continuous Random Variable |
---|---|---|

Possible values | Finite or countable | Infinite |

Probability of a single value | Between 0 and 1 | 0 |

Probability distribution | Discrete probability distribution | Continuous probability distribution |

Example | Number of tails when flipping a coin | Temperature |

Applications | Counting, gambling, statistics | Engineering, physics, chemistry, finance |

*Table comparing discrete and continuous random variables [1].*

## Use of Random Variables

Random variables are used in many areas of statistics, from hypothesis testing to regression analysis. They can also be used to model a wide variety of phenomena from coin flip outcomes to the number of customers who arrive at a store in a given hour.

A couple of use case examples:

**Econometric analysis:**This is used to study economic phenomena using mathematical and statistical techniques. In econometric analysis, regressions are commonly run using data from surveys or experiments in order to establish relationships between different economic quantities (e.g., employment and inflation). The dependent variable in these regressions is typically a linear function of independent random variables representing different economic quantities (e.g., output).**Regression analysis:**This technique is used to predict future events based on past events that are represented by variables in a dataset (i.e., it establishes cause-and-effect relationships). Common applications include sales forecasting and demographic trend predictions. The dependent variable in regression analysis is also typically a linear function of independent random variables representing different predictor variables (e.g., advertising spend).

## Calculating variance of a random variable

The formula for calculating the variance of a discrete random variable is:

σ^{2} = Σ(x_{i} – μ)^{2} *f*(*x*)

Where

- Σ = summation notation (add everything up),
- μ = expected value,
- x
_{i}= random variable, *f*(*x*) = the probability (in function notation). Also written as P_{i}.

**Example**: Find the variance of a random variable X, obtained from a television factory line where the following set of probability distribution data represents the number of rejects for every 100 televisions produced:

- x: 2, 3, 4, 5, 6
*f*(*x*): 0.01, 0.25, 0.4, 0.3, 0.4.

Step 1: Multiply each value of *x* by *f*(*x*) and add them up to find the mean, μ:

- 2 * 0.1 +
- 3 * 0.25 +
- 4 * 0.4 +
- 5 * 0.3 +
- 6 * 0.4 =
- 4.11

Step 2: Insert each *x*-value and probability into the variance formula **σ ^{2} = Σ(x_{i}-μ)^{2} f(x) **along with the mean from Step 1:

- (2 – 4.11)
^{2}(0.01) + - (3 – 4.11)
^{2}(0.25) + - (4 – 4.11)
^{2}(0.4) + - (5 – 4.11)
^{2}(0.3) + - (6 – 4.11)
^{2}(0.04) = - 0.74

The variance of the random variable is 0.74

## Continuous random variable PDF/CDF

To formula for the **variance **of a continuous random variable uses integration (from calculus):

The probability density function (PDF) of a continuous random variable is defined by the integral [1]:

The PDF, f(x), adheres to these two properties:

- f(x) ≥ 0 (f cannot be negative)
- ∫ f(x) dx = 1 (meaning the area under the curve equals 1)

However, the PDF doesn’t provide the probabilities of specific events (e.g., P(X < 5) or P(X = 6)). To calculate those probabilities for continuous random variables, we need a different formula involving the integral [2]:

Here, f(x) represents the PDF.

Since this involves an integral, it’s logical that the probability of any single outcome is zero. Another perspective: if a car’s length is measured with infinite precision, the likelihood of another car having the exact same length is zero.

The cumulative distribution function (CDF) is determined by the integral:

## Random variable history

The history of random variables can be traced back to the early days of probability theory. Pafnuty Chebyshev first introduced the concept of a random variable in the mid-19th century, defining it as “a real variable which can assume different values with different probabilities.”

Karl Pearson further developed the concept of random variables in the late 19th century, creating various statistical techniques for analyzing data involving random variables, such as the chi-squared test and the correlation coefficient.

Andrey Kolmogorov established the modern theory of random variables in the early 20th century, providing a formal mathematical definition and demonstrating their application in modeling diverse phenomena. Since Kolmogorov’s work, random variables have become indispensable tools in probability theory, statistics, and numerous other fields, modeling a wide range of phenomena from coin flip outcomes to stock prices.

Key events in the history of random variables include:

- 1869: Karl Pearson publishes his first paper on the chi-squared test.
- 1895: Francis Galton publishes his first paper on correlation.
- 1933: Andrei Kolmogorov publishes his book
*Grundbegriffe der Wahrscheinlichkeitsrechnung*, introducing the modern definition of a random variable. - 1954: John von Neumann and Oskar Morgenstern publish their book
*Theory of Games and Economic Behavior*, using random variables to model economic behavior. - 1965: William Feller publishes his book
*An Introduction to Probability Theory and Its Applications*, a classic text on probability and statistics.

Today, random variables are crucial tools in various fields, including probability theory, statistics, machine learning, finance, and economics, used to model diverse phenomena ranging from coin flip outcomes to stock prices.

## References

[1] Kjos-Hanssen, B. Statistics for Calculus Students.

[2] Orloff, J. & Bloom, J. Continuous Random Variables. Retrieved April 29, 2021 from: https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading5b.pdf

Pingback: Helmert’s Distribution - P-Distribution

Pingback: Compound distribution - P-Distribution

Pingback: Cumulative Distribution Function - P-Distribution