'n-1' versus 'n' in sampling variance

November 12, 201411 yr

My book says on average the sampling variance [math]\dfrac{\sum (x- \bar{x})}{n} [/math] is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: [math]\dfrac{\sum (x- \bar{x})}{n-1} [/math]

Is there a more intuitive or formal explanation of this?

Edited November 12, 201411 yr by DylsexicChciken

November 12, 201411 yr

It didn't make sense to me either. The sample could be considered a population in its own right, so why treat it differently? My guess is that even the best sampling methods tend to reduce variability slightly, but then why not divide by [math]n-(n/?)[/math] instead?

Edited November 12, 201411 yr by MonDie

November 12, 201411 yr

This is called Bessel's correction. The associated Wikipedia article has a section explaining the source of the bias when dividing by n, as well as three proofs, the third of which includes a subsection dealing with the intuition behind the proof.

November 12, 201411 yr

I've always intuited it thusly: since you are only taking a sample, you need to be more conservative in the possible variance in the whole population. Whereas if you know the entire population, you don't need to be conservative because you know the variance by definition. To me, it is simply a way of being a little more sure that your sample variance range has captured the true population variance.

November 12, 201411 yr

John, your link doesn't go to a Wiki page.

Edited November 12, 201411 yr by MonDie

November 12, 201411 yr

Additional to the Wiki article

I was hoping to post a table of Bessel's correction, but I have had to ask how to post a table (http://www.scienceforums.net/topic/86509-posting-a-table/

Bessels correction and the (n-1) is also associated with statistical 'degrees of freedom'.

The ultimate for this is Goset's "Student's t distribution'

Edit I now have the table (thanks Acme) and it is interesting how quickly the correction approaches 1 as the number of samples increases.

Number in sample,n	Bessel's Correction, n/(n-1)
2	2.0
5	1.25
10	1.11111
100	1.01010
1000	1.00100

Edited November 12, 201411 yr by studiot

November 12, 201411 yr

My book says on average the sampling variance [math]\dfrac{\sum (x- \bar{x})}{n} [/math] is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: [math]\dfrac{\sum (x- \bar{x})}{n-1} [/math]

Is there a more intuitive or formal explanation of this?

[math]E(\dfrac{\sum (x- \bar{x})^2}{n-1}) [/math] = variance, when [math]\bar{x}[/math] is the sample average.

Note your expressions left out the squaring.

November 12, 201411 yr

Number in sample,n Bessel's Correction, n/(n-1)
22.0
51.25
101.11111
1001.01010
10001.00100

That's exactly my point. It's going to exaggerate the standard deviation less if your sample is larger. How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)?

November 13, 201411 yr

Chicken

Is there a more intuitive or formal explanation of this?

Mondie

How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)?

Both the full population and the sample have a mean and a variance (or standard deviation).

There is no reason for these two parameters to be the same in both the population and the sample or between two samples unless the sample size equals the whole population.

If we take the variance of the sample to be

[math]{T^2} = \sum {\frac{{{{\left( {{X_i} - \overline X } \right)}^2}}}{n}} [/math]

We would wish it to be equal to the variance of the population, [math]{\sigma ^2}[/math]

However some several lines of algebra shows it to be actually equal to

[math]{\sigma ^2} - \frac{{{\sigma ^2}}}{n} = \frac{{n - 1}}{n}{\sigma ^2}[/math]

So if we 'correct' this deficiency by multiplying this by

[math]\frac{n}{{n - 1}}[/math]

we obtain the wanted equality.

You can see that Bessel's correction is equivalent to using (n-1) instead of n in the calculation of sample variance.

Do you really want the algebra proof?

Edited November 13, 201411 yr by studiot

November 13, 201411 yr

Do you really want the algebra proof?

I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes.

Edited November 13, 201411 yr by MonDie

November 13, 201411 yr

OK so what do we actually want to measure when we sample?

In other words why do we sample?

Well we don't want the actual value for one item. We want a single number that will best represent the whole population.

So we want the population average or mean, [math]\mu [/math].

This is given by the formula

[math]\mu [/math] = [math]\sum {\frac{{\left( {{X_i}} \right)}}{N}} [/math]

That is we add all the individual values, x_i up and divide by the number of values in the population.

But we also (often) want an idea of the spread of the data.

We obtain this as the variance (often reported as the standard deviation, [math]\sigma [/math] or square root of the variance) and given by the formula

[math]{\sigma ^2} = \sum {\frac{{\left( {{X_i} - \mu } \right)}}{N}} ^2[/math]

That is we add up all the deviations, square, and divide the result by the number of values in the population.

But what about the sample?

Using upper case letter to denote values from the population, and lower case for values from the sample:

If we did the same for only some of the values would be be fairly representing the population mean and variance?

Well it turns out that if we took every possible sample of size n < N we find that the average of all the sample means of size n is the same as the population average, [math]\mu [/math], although the sample mean for any particular sample may not be the same as that of the population.

But

If we take the average variance of all possible samples of size n < N we find is it smaller than the population variance. [math]{\sigma ^2}[/math].

Remembering that we are really interested in the parameter for the population, not the individual sample we find that we can take the sample average as a fair representation of the population average,

But, and this is what we want to 'fix'

We cannot take the variance of the sample as calculated by the formula

[math]{\sigma _s}^2 = \sum {\frac{{\left( {{x_i} - \mu } \right)}}{n}}^2 [/math]

as a fair representation of the population variance.

Instead of algebra to prove this for all cases the attachment shows a worked example for a very simple case of the population being three numbers {10,20,30}

and the sample size being two numbers. So N = 3 and n = 2

It can be seen that the mean of all the sample means is the same as the population mean,

but the average variance of all the samples is only half that of the population variance.

It can also be seen that bessels correction for this is exactly 2.

[math]\frac{n}{{\left( {n - 1} \right)}} = \frac{2}{{\left( {2 - 1} \right)}} = 2[/math]

Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1).

Edited November 13, 201411 yr by studiot

November 13, 201411 yr

John, your link doesn't go to a Wiki page.

Thanks for letting me know. The forum software is a bit weird with links containing apostrophes. Since I can no longer edit my post, here is the corrected link: Bessel's correction.

Edited November 13, 201411 yr by John

November 13, 201411 yr

Studiot, the attachment is probably exactly what I wanted, but I'll have time later.

btw, your formula doesn't square the deviations.

^{2}

November 13, 201411 yr

btw, your formula doesn't square the deviations.

^{2}

Glad you were awake enough to spot the deliberate mistake, now corrected.

Hope the rest is helpful, read the attachment in conjunction with the text in the post.

November 14, 201411 yr

Studiot, I understand your attachment. Calculating the sample variances (s) with n-1 leads to an average s that equals the population variance ([math]\mu[/math]). However, giving an average that is closer does not mean it's a better estimate.

Using those same numbers... If you calculate the average absolute deviation of s from [math]\mu[/math], i.e. the value of [math]\frac{\sum|s_{i} - \mu|}{n}[/math], or even if you find the square root of the average of the error squared, [math](\frac{\sum(s_{i} - \mu)^{2}}{n})^{0.5}[/math], you find that n-1 produces more error (59.44, 74.5) than n-0 (48.3, 50.22).

Edited November 14, 201411 yr by MonDie

November 14, 201411 yr

I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes.

The correction corrects for the fact that the sample average is not the true mean. The definition of variance is based on sample differences from the true mean. However when we don't know the true mean we estimate it by using the sample average. Using n-1 results in the estimate of the sample variance being fair, that is the average of the sample variance equals the true variance.

November 14, 201411 yr

Yes, I see that now (although you seem to define "variance" differently). That last post was an edit rollercoaster because I was confusing the population n with the sample n.

From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample.

Edited November 14, 201411 yr by MonDie

November 14, 201411 yr

From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample.

You do not correct the sample mean, only the sample variance.

So you always use n-0 to calculate the sample mean.

Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1).

The issue is, as mathematic pointed out, that the mean of a single sampling will probably not match the mean of the whole population exactly.

In my example, although the population mean is the most common value amongst the sample means, an individual sample mean equals the population mean in only 1/3 of the possible samples.

With only one sample you cannot estimate the variance or standard deviation, unless N = n = 1.

November 14, 201411 yr

I made a mistake. I was using [math]\mu[/math] where I should have used [math]\sigma^{2}[/math], but I was still talking about variance. If you take the average value of [math]|s_{i}-\sigma|[/math] or [math]|s_{i}^{2}-\sigma^{2}|[/math], you find that Bessel's correction results in a higher average error (with the numbers given).

Edited November 14, 201411 yr by MonDie

November 14, 201411 yr

MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful.

Take a look at http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html

You are expecting

[math]\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy [/math]

where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample).

And there is no real reason why the two should be equal in the general case.

November 14, 201411 yr

BigNose, in the link, are those brackets for absolute value or a floor/ceiling function?

November 14, 201411 yr

BigNose, in the link, are those brackets for absolute value or a floor/ceiling function?

Neither. Just square brackets used so that they look different than regular parentheses.

November 14, 201411 yr

MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful.

Take a look at http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html

You are expecting

[math]\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy [/math]

where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample).

And there is no real reason why the two should be equal in the general case.

Regarding the link, I don't understand why they're summing the sample variance [math]S_{i}^{2}[/math] with the difference between means squared [math](\bar{x}_{i} - \bar{X})^{2}[/math]

November 15, 201411 yr

Regarding the link, I don't understand why they're summing the sample variance [math]S_{i}^{2}[/math] with the difference between means squared [math](\bar{x}_{i} - \bar{X})^{2}[/math]

You dropped the subscript c in on the X term. That is important. See the definition of X_c

November 16, 201411 yr

I didn't represent it correctly, but I had the correct meaning in my head.

BigNose, I didn't know any calculus until your integrals spurred some independent learning. Now I understand that the integral of an interval of a function is the mean y-value multipled by the interval (I avoided the words "area" and "volume" because my understanding is that the lower quandrants are scored negatively). I also read the first six lessons of Capn's derivatives tutorial. Although your equation may not be true, I do want to know how you were using integrals to represent variance. I'm not familiar with integral notation yet.

Edited November 16, 201411 yr by MonDie

'n-1' versus 'n' in sampling variance

Featured Replies

Archived

Important Information

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)