Probability refresher

The goal of probability theory is to quantify uncertainty. Once an event occurs, the outcome is known (e.g. it rained today or it didn’t). However, before an event happens, probability allows us to assess the likelihood of various potential outcomes (e.g. the chance that it will rain). Depending on the relationship between events, probability rules help us calculate the likelihood of a given outcome in a straightforward manner. A fundamental concept in probability is that of random variables, which assign numerical values to the outcomes of random processes. These random variables can take different forms: discrete, where the outcomes are countable (such as the number of heads in a series of coin flips), or continuous, where the outcomes can take any value within a given range (such as the height of a person). Understanding and applying probability theory equips us to make informed decisions based on the inherent uncertainty in real-world phenomena.

Random Variables

A random variable (also called a stochastic variable) is a function that assigns numerical values to the outcomes of a random event. These outcomes are drawn from a sample space, which is the set of all possible results. For instance, when flipping a fair coin, there are two equally likely outcomes: heads or tails. If we define a random variable X that assigns 1 to heads and 0 to tails, the probabilities for each are: P(H)=P(T)=½, which add to 1. Random variables can be:

In statistics, random variables are usually considered to take real number values, which allows us to define important quantities like the expected value (mean), and variance.

Distributions

A probability distribution lists all possible outcomes of a random variable along with their corresponding probabilities (must sum to 1). Rather than listing all possible probabilities, we can define a function f(x) that gives the likelihood of certain outcomes in a random process. Some of the most common probability distributions are shown below:

distributions

Another one that comes up is the power law distribution: one quantity varies as a power of another (\(y=x^\text{\alpha}\), where y is the dependent, x is the independent and \(\alpha\) is a constant). It has heavy tails (few large values dominate, while many smaller values are common) and scale invariance (the system behaves similarly at different scales, so the pattern remains consistent even when zooming in or out). The extreme values or outliers (like the largest city or the richest person) have a disproportionate effect compared to the “average” values, making it important to understand and account for these distributions in analysis and modelling. Some examples:

When plotted, it typically produces a curve that starts high and quickly tapers off, which can make it difficult to directly see the relationship. However, if we plot the logarithm of both the variable and the frequency, the relationship becomes linear:

power law

PMF / PDF / CDF

The distributions shown above are either a PMF or PDF depending on the outcome type:

Metric Formula Description
Probability Mass Function (PMF) \(P(X = x)\) Describes the probability distribution of a discrete random variable. The PMF gives the probability that a discrete random variable is exactly equal to some value. For example, in a coin toss, the probability of heads or tails.
Probability Density Function (PDF) \(f(x) = \frac{d}{dx} P(X \leq x)\) Describes the probability distribution of a continuous random variable. The PDF defines the probability of the random variable falling within a particular range of values, and its integral over an interval gives the probability. The total area under the curve equals 1.
Cumulative Distribution Function (CDF) \(F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t) dt\) Gives the probability that a random variable is less than or equal to a certain value. For a discrete variable, it’s the sum of the probabilities of all outcomes less than or equal to (x). For a continuous variable, it’s the area under the PDF curve up to (x).

What they might look like: PMF vs PDF

Moments

Moments are used to describe the shape of the distribution. Each “moment” provides different information about the distribution’s characteristics. Here’s the first few moments:

Expected Value

The average value over the long run. We can find an expected value by multiplying each numerical outcome by the probability of that outcome, and then summing those products together. Consider a game where we have a \(4/10\) chance of winning $2, a \(4/10\) chance of losing $5, and a \(2/10\) chance of winning $10. The EV is:

\(4/10 * 2 + 4/10 * -5 + 2/10 * 10 = 0.8-2+2=0.8\).

So on average, we expect to win 80 cents per game in the long run. In statistics, a similar concept is the population mean, which is often estimated from the sampling distribution—the probability distribution of outcomes obtained from many samples.

Set notation

A set is simply a collection of distinct elements. In probability, we often deal with sets to define events and sample spaces. The sample space (denoted as S) is the set of all possible outcomes (e.g. S={H, T}). An event is any subset of the sample space (e.g. the event that the coin lands on heads can be represented as A={H}). When describing relationships between events, we use set notation:

sets

Marginal (unconditional) probability

This is just the probability of A (P(A)).

Complement rule

The probability that something doesn’t happen is 1 minus the probability that it does happen: \(P(A^c)= 1-P(A)\). The probability of something not happening is often easier to compute, so we can take advantage of that.

Example: Assuming the days are independent, what is the probability of at least 1 day of rain in the next 5 days if the P(rain)=0.2 each day?

Conditional Probability

The probability of event A occurring, given that event B has already occurred. The formula for conditional probability is:

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}\]

Divide joint probability of A and B, by the marginal probability of B.

Impact of dependence

Independent events are events where the occurrence of one event does not affect the occurrence of another. For example, in two independent coin flips, the outcome of the first flip does not affect the outcome of the second flip.

Dependent events are events where the occurrence of one event affects the probability of the other event. An example of this is drawing cards without replacement from a deck. If you draw a card and don’t replace it, the probability of drawing another specific card changes.

Odds

Odds are a different way of expressing the likelihood of an event. While probability is the ratio of favorable outcomes to total outcomes, odds are typically expressed as the ratio of favorable outcomes to unfavorable outcomes. For example, if you have a 1 in 5 chance of winning, the odds are 1:5. Odds magnify probabilities, so they allow us to compare probabilities more easily by focusing on large whole numbers. E.g. comparing 0.01 to 0.005 it’s hard to see one is twice as large, but comparing 99 to 1 versus 199 to 1 makes it easier to see the relative difference.

Combinations and permutations

Combinations and permutations are used to count the number of ways outcomes can be arranged:

\[\binom{n}{k} = \frac{n!}{k!(n-k)!}\] \[P(n, k) = \frac{n!}{(n-k)!}\]

Bayes’ Theorem

Bayes’ Theorem is the consequence of rearranging the conditional probability formula:

\[P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}\]

Where:

It can also be written as: posterior= likelihood*prior/evidence.

Example: A box contains 8 fair coins (heads and tails) and 1 coin which has heads on both sides. I select a coin randomly and flip it 4 times, getting all heads. If I flip this coin again, what is the probability it will be heads?

Simpson’s paradox

Simpson’s paradox is a counterintuitive phenomenon that occurs when trends observed in different groups of data reverse when the data is combined. Meaning the aggregated data may lead to a different conclusion. This is a specific type of confounding, where the confounding variable might be the group size or way the data is aggregated across different groups.

Example: Consider two doctors performing two types of surgeries. Doctor A has a higher success rate than Doctor B for both types of surgery individually, but performs the more difficult surgery more often (which has a lower succes rate for both doctors), resulting in a lower overall success rate:

This is why it’s often useful to break down data into meaningful sub groups to get a more accurate understanding of relationships. We need to consider both the whole and the parts to get the full story.

Conclusion

Probability allows us to handle uncertainty by quantifying the likelihood of different outcomes. In statistics, it underpins the process of making inferences, testing hypotheses, and predicting future events. For hands-on practice, I would reccommend using something like brilliant.org as they have plenty of examples to work through.