To divide by n-1, or to not divide by n-1
Dec. 30, 2023 | Categories: Education InferenceLast Modified: May 29, 2024, 8:19 p.m.
The Standard Deviation
If you've taken some introductory or elementary level statistics class, you will be familiar with the equation:
\begin{equation}
s = \sqrt{\frac{\sum_i (x_i - \bar{x})^2}{n - 1}}
\end{equation}
For the sample standard deviation of a set of data. However, at the same time, you will be told that the population standard deviation is written as:
\begin{equation}
\sigma = \sqrt{\frac{\sum_i (x_i - \mu)^2}{n}}
\end{equation}
So, here's the idea: when we want to estimate the standard deviation of the population from a sample of data, we use the first. And when we actually want to calculate the standard deviation of the data, we use the second. Sure, sounds great. But here's the problem: why? Too often in introductory statistics is this issue brushed over and it leaves many students (such as myself) in this state of limbo where you kinda just guess which formula to use and when. So, in this post, I intend to explain the rationale of these formulas at 5 (pending) different levels. For the purposes of demonstration, I will be using the height and weight dataset found here. Since there may be some issue with different distributions, we will only be working with female heights.
Level 1: No Statistics Knowledge
Let's say we have some data that we want to draw some conclusions from. For example, let's say we want to make some conclusions about the height of women in the US. Obviously, we can't collect the data for every single woman in America, so instead we settle for an estimate by sampling a smaller group.
Let's say we sample a group of around 100 woman and we want to then ask two questions:
- What is the average height for a woman in the US?
- How does the data vary over heights? That is, how far on average does a measurement vary from the average (since obviously, not every single woman has the same height)?
To answer the first, we use something called the sample mean, a measurement most people are familiar with. Essentially, we sum up all of our 100 data points and divide by 100, the amount of people we sampled. That will give us our best estimate for the average height of woman. Given enough samples, we will eventually approach the true mean!
Now, to answer the second, we use something called the sample standard deviation. Like the sample mean, we use the sample standard deviation to estimate the actual population standard deviation. This is defined as the square root average of the sum of the squared distances between each point and the actual mean of the dataset (in other words, the farther away from the mean, the more we penalize the points).
When we estimate something (like the sample mean), we want to make sure our estimate is accurate. Unfortunately, if we define the sample standard deviation in the same way as the population standard deviation, our estimate won't be as accurate as using 1 less than the number of points!
So, when estimating the standard deviation, we use:
\begin{equation}
\text{estimate of standard deviation}= \sqrt{\frac{\text{sum of squared distances in sample}}{\text{(number of points)} - 1}}
\end{equation}
Level 2: Elementary Statistics
When estimating the mean of a dataset, we use the common formula of sample average:
\begin{equation}
\hat{\mu} = \frac{\sum_i x_i}{n}
\end{equation}
where \( n \) is the number of datapoints in your sample. As \( n \) grows infinitely large, our sample estimate will approach our actual mean of \( \mu \). However, if we do the same with the standard deviation:
\begin{equation}
\hat{\sigma}_{\text{naive}} = \sqrt{\frac{\sum_i (x_i - \bar{x})^2}{n}}
\end{equation}
as we increase \( n \), we don't approach \( \sigma \), instead we approach some scaled value of \( \sigma \). So, in order to account for this scaling, we multiply by the scaled factor. This ends up being:
\begin{equation}
\hat{\sigma}_{\text{accurate}} = \sqrt{\frac{\sum_i (x_i - \bar{x})^2}{n - 1}}
\end{equation}
Level 3: Introductory Statistics/AP Statistics
When discussing how good an estimator is in a sample dataset, we typically first start by defining the bias of the estimator. That is, on average, how far away are we from the actual value as \( n \) increases?
Formally, we use the Expectation operator "\( E \)", which essentially means "the average over all possibilities". So, let \( \hat\theta \) be our estimator and \( \theta \) be our estimand (thing we are trying to estimate). We define bias as:
\begin{equation}
\text{Bias} = E[ \hat\theta - \theta ]
\end{equation}
To help visualize bias, let's use the dataset introduced in the beginning. We can plot a histogram of the data and, for the sake of exploring the dataset, let's assume that our dataset is the total population of woman. The data looks like the following:
Which gives us our true mean and standard deviation. Now, let's assume that we can't get the actual values and instead sample 10 randomly selected woman from the population and find their averages. Let's assume there are two ways of doing so: taking the sum and dividing by n (the correct way), or taking the sum and dividing by n - 1 (the incorrect way). We get:
So, on average, we have less bias for using the estimator we are familiar with (in fact, this estimator has 0 bias).
Now, let's do the same thing for standard deviation. And we'll compare two estimators:
$$\hat\sigma_{n - 1} = \sqrt{\frac{\sum_i (x_i - \bar{x})^2}{n - 1}}$$
$$\hat\sigma_{n} = \sqrt{\frac{\sum_i (x_i - \bar{x})^2}{n}}$$
And we get:
The use of \( n - 1 \) ever so slightly gives a more accurate estimate on average. In fact, the bias for this estimator is also 0. Intuitively, why do we divide by \( n - 1 \)?
Intuitively, the real reason for this is a bit complicated and requires some more knowledge of probability above the AP Statistics level. However, at the level of introductory statistics, you should be familiar with the concept of degrees of freedom, which essentially translates into "how many of the datapoints can vary in our formula?"
Let's take two examples. The first will be the sample average. If we collect \( n \) datapoints with \( x_1, x_2, \ldots, x_n \), all \( n \) of these values can vary! So, we have \( n \) degrees of freedom and when we make our estimator, we divide by our degrees of freedom:
$$\hat\mu = \frac{\text{sum of points}}{\text{degrees of freedom}} = \frac{\sum_i x_i}{n}$$
The second example is our sample standard deviation. Our sample standard deviation contains \( n \) datapoints as well, but also requires we know the sample mean \( \bar{x} \). Because of this, if we know \( n - 1 \) of the data points already, we can solve for the last one form the formula of \( \bar{x} \). So, when making our estimator for \( \sigma \), we divide by \( n - 1 \) instead!
Level 4: College Probability and Inference
At the level of college probability and inference, we can now use knowledge of some statistical tools to make some pretty precise claims.
For one, it's important to note that expectation is linear. Second, we assume that the data collected in samples is i.i.d. (independent identically distributed). It is also nice to know that, if we assume our data is collected from a purely normal distribution, the MLE of \( \sigma \) is actually defined using \( n \) instead of \( n - 1 \). The proof for this is omitted, but can easily be found online!
Essentially, we can base our proof on the idea that we want our bias to be 0. That is,
$$E[\hat\sigma - \sigma] = 0$$
However, in statistics, we consider variance to be the actual value we care about, and the standard deviation is just a function of that. And by Slutsky's theorem, we should see that if we find an unbiased estimator of \( \sigma^2 \), this will also give us an asymptotically unbiased estimator of \( \sigma \):
$$E[\hat\sigma^2 - \sigma^2] = 0$$
So, we need to find a constant \( c \) such that
$$E[c \sum_{i} (x_i - \bar{x})^2] = \sigma^2$$
We can write this out as:
\begin{align}
E[c \sum_{i} (x_i - \bar{x})^2] = \sigma^2 \\
= c E[ \sum_{i} (x_i - \bar{x})^2] \\
= c \left(E[\sum_{i} x_i^2] - 2E[\bar{x} \sum_i x_i] + E[\sum_i \bar{x}^2]\right) \\
= c \left( nE[x_i^2] - 2n[E[\bar{x}^2]+ n E[\bar{x}^2] \right) \\
= cn \left(E[x_i^2] - E[\bar{x}^2] \right)
\end{align}
by the i.i.d. assumption. Then, note that \(Var(X) = E(X^2) - (EX)^2 \) to get:
\begin{align}
cn \left(E[x_i^2] - E[\bar{x}^2] \right) = cn \left(\sigma^2 + \mu^2 - \frac{\sigma^2}{n} + \mu^2 \right) \\
= cn \left(\sigma^2 -\frac{\sigma^2}{n} \right) = c \left(n \sigma^2 - \sigma^2 \right) \\
= c \left((n-1)\sigma^2 \right)
\end{align}
And we get that:
$$c \left((n-1)\sigma^2 \right)= \sigma^2$$
$$\boxed{c = \frac{1}{n - 1}}$$
So, indeed, our unbiased estimator is:
$$\hat\sigma = \sqrt{\frac{\sum_i (x_1 - \bar{x})^2}{n - 1}}$$
Level 5: Graduate Inference
To be added when I take a graduate inference class (haha).
Conclusion
Hopefully this post can be beneficial to you no matter the level of statistical knowledge. I always try to think of statistics as telling a story, and what you learn in class are typically accepted methods through we we tell stories (similar to how you may learn different accepted styles of writing in an English class).