## Recommended Posts

My book says on average the sampling variance $\dfrac{\sum (x- \bar{x})}{n}$ is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: $\dfrac{\sum (x- \bar{x})}{n-1}$

Is there a more intuitive or formal explanation of this?

Edited by DylsexicChciken

##### Share on other sites

It didn't make sense to me either. The sample could be considered a population in its own right, so why treat it differently? My guess is that even the best sampling methods tend to reduce variability slightly, but then why not divide by $n-(n/?)$ instead?

Edited by MonDie

##### Share on other sites

This is called Bessel's correction. The associated Wikipedia article has a section explaining the source of the bias when dividing by n, as well as three proofs, the third of which includes a subsection dealing with the intuition behind the proof.

##### Share on other sites

I've always intuited it thusly: since you are only taking a sample, you need to be more conservative in the possible variance in the whole population. Whereas if you know the entire population, you don't need to be conservative because you know the variance by definition. To me, it is simply a way of being a little more sure that your sample variance range has captured the true population variance.

##### Share on other sites

I was hoping to post a table of Bessel's correction, but I have had to ask how to post a table (http://www.scienceforums.net/topic/86509-posting-a-table/

Bessels correction and the (n-1) is also associated with statistical 'degrees of freedom'.

The ultimate for this is Goset's "Student's t distribution'

Edit I now have the table (thanks Acme) and it is interesting how quickly the correction approaches 1 as the number of samples increases.

 Number in sample,n Bessel's Correction, n/(n-1) 2 2.0 5 1.25 10 1.11111 100 1.01010 1000 1.00100
Edited by studiot

##### Share on other sites

My book says on average the sampling variance $\dfrac{\sum (x- \bar{x})}{n}$ is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: $\dfrac{\sum (x- \bar{x})}{n-1}$

Is there a more intuitive or formal explanation of this?

$E(\dfrac{\sum (x- \bar{x})^2}{n-1})$ = variance, when $\bar{x}$ is the sample average.

Note your expressions left out the squaring.

##### Share on other sites

 Number in sample,n Bessel's Correction, n/(n-1)

22.0

51.25

101.11111

1001.01010

10001.00100

That's exactly my point. It's going to exaggerate the standard deviation less if your sample is larger. How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)?

##### Share on other sites

Chicken

Is there a more intuitive or formal explanation of this?

Mondie

How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)?

Both the full population and the sample have a mean and a variance (or standard deviation).

There is no reason for these two parameters to be the same in both the population and the sample or between two samples unless the sample size equals the whole population.

If we take the variance of the sample to be

${T^2} = \sum {\frac{{{{\left( {{X_i} - \overline X } \right)}^2}}}{n}}$

We would wish it to be equal to the variance of the population, ${\sigma ^2}$

However some several lines of algebra shows it to be actually equal to

${\sigma ^2} - \frac{{{\sigma ^2}}}{n} = \frac{{n - 1}}{n}{\sigma ^2}$

So if we 'correct' this deficiency by multiplying this by

$\frac{n}{{n - 1}}$

we obtain the wanted equality.

You can see that Bessel's correction is equivalent to using (n-1) instead of n in the calculation of sample variance.

Do you really want the algebra proof?

Edited by studiot

##### Share on other sites

Do you really want the algebra proof?

I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes.

Edited by MonDie

##### Share on other sites

OK so what do we actually want to measure when we sample?

In other words why do we sample?

Well we don't want the actual value for one item. We want a single number that will best represent the whole population.

So we want the population average or mean, $\mu$.

This is given by the formula

$\mu$ = $\sum {\frac{{\left( {{X_i}} \right)}}{N}}$

That is we add all the individual values, xi up and divide by the number of values in the population.

But we also (often) want an idea of the spread of the data.

We obtain this as the variance (often reported as the standard deviation, $\sigma$ or square root of the variance) and given by the formula

${\sigma ^2} = \sum {\frac{{\left( {{X_i} - \mu } \right)}}{N}} ^2$

That is we add up all the deviations, square, and divide the result by the number of values in the population.

Using upper case letter to denote values from the population, and lower case for values from the sample:

If we did the same for only some of the values would be be fairly representing the population mean and variance?

Well it turns out that if we took every possible sample of size n < N we find that the average of all the sample means of size n is the same as the population average, $\mu$, although the sample mean for any particular sample may not be the same as that of the population.

But

If we take the average variance of all possible samples of size n < N we find is it smaller than the population variance. ${\sigma ^2}$.

Remembering that we are really interested in the parameter for the population, not the individual sample we find that we can take the sample average as a fair representation of the population average,

But, and this is what we want to 'fix'

We cannot take the variance of the sample as calculated by the formula

${\sigma _s}^2 = \sum {\frac{{\left( {{x_i} - \mu } \right)}}{n}}^2$

as a fair representation of the population variance.

Instead of algebra to prove this for all cases the attachment shows a worked example for a very simple case of the population being three numbers {10,20,30}

and the sample size being two numbers. So N = 3 and n = 2

It can be seen that the mean of all the sample means is the same as the population mean,

but the average variance of all the samples is only half that of the population variance.

It can also be seen that bessels correction for this is exactly 2.

$\frac{n}{{\left( {n - 1} \right)}} = \frac{2}{{\left( {2 - 1} \right)}} = 2$

Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1). Edited by studiot

##### Share on other sites

Thanks for letting me know. The forum software is a bit weird with links containing apostrophes. Since I can no longer edit my post, here is the corrected link: Bessel's correction.

Edited by John

##### Share on other sites

Studiot, the attachment is probably exactly what I wanted, but I'll have time later.

btw, your formula doesn't square the deviations.

^{2}

##### Share on other sites

btw, your formula doesn't square the deviations.

^{2}

Glad you were awake enough to spot the deliberate mistake, now corrected. Hope the rest is helpful, read the attachment in conjunction with the text in the post.

##### Share on other sites

Studiot, I understand your attachment. Calculating the sample variances (s) with n-1 leads to an average s that equals the population variance ($\mu$). However, giving an average that is closer does not mean it's a better estimate.

Using those same numbers... If you calculate the average absolute deviation of s from $\mu$, i.e. the value of $\frac{\sum|s_{i} - \mu|}{n}$, or even if you find the square root of the average of the error squared, $(\frac{\sum(s_{i} - \mu)^{2}}{n})^{0.5}$, you find that n-1 produces more error (59.44, 74.5) than n-0 (48.3, 50.22).

Edited by MonDie

##### Share on other sites

I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes.

The correction corrects for the fact that the sample average is not the true mean. The definition of variance is based on sample differences from the true mean. However when we don't know the true mean we estimate it by using the sample average. Using n-1 results in the estimate of the sample variance being fair, that is the average of the sample variance equals the true variance.

##### Share on other sites

Yes, I see that now (although you seem to define "variance" differently). That last post was an edit rollercoaster because I was confusing the population n with the sample n.

From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample.

Edited by MonDie

##### Share on other sites

From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample.

You do not correct the sample mean, only the sample variance.

So you always use n-0 to calculate the sample mean.

Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1).

The issue is, as mathematic pointed out, that the mean of a single sampling will probably not match the mean of the whole population exactly.

In my example, although the population mean is the most common value amongst the sample means, an individual sample mean equals the population mean in only 1/3 of the possible samples.

With only one sample you cannot estimate the variance or standard deviation, unless N = n = 1.

##### Share on other sites

I made a mistake. I was using $\mu$ where I should have used $\sigma^{2}$, but I was still talking about variance. If you take the average value of $|s_{i}-\sigma|$ or $|s_{i}^{2}-\sigma^{2}|$, you find that Bessel's correction results in a higher average error (with the numbers given).

Edited by MonDie

##### Share on other sites

MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful.

You are expecting

$\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy$

where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample).

And there is no real reason why the two should be equal in the general case.

##### Share on other sites

BigNose, in the link, are those brackets for absolute value or a floor/ceiling function?

Neither. Just square brackets used so that they look different than regular parentheses.

##### Share on other sites

MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful.

You are expecting

$\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy$

where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample).

And there is no real reason why the two should be equal in the general case.

Regarding the link, I don't understand why they're summing the sample variance $S_{i}^{2}$ with the difference between means squared $(\bar{x}_{i} - \bar{X})^{2}$

##### Share on other sites

Regarding the link, I don't understand why they're summing the sample variance $S_{i}^{2}$ with the difference between means squared $(\bar{x}_{i} - \bar{X})^{2}$

You dropped the subscript c in on the X term. That is important. See the definition of X_c

##### Share on other sites

I didn't represent it correctly, but I had the correct meaning in my head.

BigNose, I didn't know any calculus until your integrals spurred some independent learning. Now I understand that the integral of an interval of a function is the mean y-value multipled by the interval (I avoided the words "area" and "volume" because my understanding is that the lower quandrants are scored negatively). I also read the first six lessons of Capn's derivatives tutorial. Although your equation may not be true, I do want to know how you were using integrals to represent variance. I'm not familiar with integral notation yet.

Edited by MonDie

## Create an account

Register a new account