Jump to content

'n-1' versus 'n' in sampling variance


DylsexicChciken

Recommended Posts

My book says on average the sampling variance [math]\dfrac{\sum (x- \bar{x})}{n} [/math] is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: [math]\dfrac{\sum (x- \bar{x})}{n-1} [/math]

 

Is there a more intuitive or formal explanation of this?

Edited by DylsexicChciken
Link to comment
Share on other sites

It didn't make sense to me either. The sample could be considered a population in its own right, so why treat it differently? My guess is that even the best sampling methods tend to reduce variability slightly, but then why not divide by [math]n-(n/?)[/math] instead?

Edited by MonDie
Link to comment
Share on other sites

I've always intuited it thusly: since you are only taking a sample, you need to be more conservative in the possible variance in the whole population. Whereas if you know the entire population, you don't need to be conservative because you know the variance by definition. To me, it is simply a way of being a little more sure that your sample variance range has captured the true population variance.

Link to comment
Share on other sites

Additional to the Wiki article

I was hoping to post a table of Bessel's correction, but I have had to ask how to post a table (http://www.scienceforums.net/topic/86509-posting-a-table/

Bessels correction and the (n-1) is also associated with statistical 'degrees of freedom'.

The ultimate for this is Goset's "Student's t distribution'

Edit I now have the table (thanks Acme) and it is interesting how quickly the correction approaches 1 as the number of samples increases.

 

Number in sample,n Bessel's Correction, n/(n-1)
2 2.0
5 1.25
10 1.11111
100 1.01010
1000 1.00100
Edited by studiot
Link to comment
Share on other sites

My book says on average the sampling variance [math]\dfrac{\sum (x- \bar{x})}{n} [/math] is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: [math]\dfrac{\sum (x- \bar{x})}{n-1} [/math]

 

Is there a more intuitive or formal explanation of this?

[math]E(\dfrac{\sum (x- \bar{x})^2}{n-1}) [/math] = variance, when [math]\bar{x}[/math] is the sample average.

 

Note your expressions left out the squaring.

Link to comment
Share on other sites

Number in sample,n Bessel's Correction, n/(n-1)

22.0

51.25

101.11111

1001.01010

10001.00100

 

That's exactly my point. It's going to exaggerate the standard deviation less if your sample is larger. How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)?

Link to comment
Share on other sites

 

Chicken

Is there a more intuitive or formal explanation of this?

 

 

 

Mondie

How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)?

 

 

Both the full population and the sample have a mean and a variance (or standard deviation).

 

There is no reason for these two parameters to be the same in both the population and the sample or between two samples unless the sample size equals the whole population.

 

If we take the variance of the sample to be

 

[math]{T^2} = \sum {\frac{{{{\left( {{X_i} - \overline X } \right)}^2}}}{n}} [/math]

 

We would wish it to be equal to the variance of the population, [math]{\sigma ^2}[/math]

 

However some several lines of algebra shows it to be actually equal to

[math]{\sigma ^2} - \frac{{{\sigma ^2}}}{n} = \frac{{n - 1}}{n}{\sigma ^2}[/math]

 

So if we 'correct' this deficiency by multiplying this by

 

[math]\frac{n}{{n - 1}}[/math]

 

we obtain the wanted equality.

 

You can see that Bessel's correction is equivalent to using (n-1) instead of n in the calculation of sample variance.

 

Do you really want the algebra proof?

 

 

Edited by studiot
Link to comment
Share on other sites

 

Do you really want the algebra proof?

 

 

I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes.

Edited by MonDie
Link to comment
Share on other sites

OK so what do we actually want to measure when we sample?

 

In other words why do we sample?

 

Well we don't want the actual value for one item. We want a single number that will best represent the whole population.

 

So we want the population average or mean, [math]\mu [/math].

 

This is given by the formula

[math]\mu [/math] = [math]\sum {\frac{{\left( {{X_i}} \right)}}{N}} [/math]

That is we add all the individual values, xi up and divide by the number of values in the population.

But we also (often) want an idea of the spread of the data.

 

We obtain this as the variance (often reported as the standard deviation, [math]\sigma [/math] or square root of the variance) and given by the formula

 

[math]{\sigma ^2} = \sum {\frac{{\left( {{X_i} - \mu } \right)}}{N}} ^2[/math]

 

That is we add up all the deviations, square, and divide the result by the number of values in the population.

 

But what about the sample?

Using upper case letter to denote values from the population, and lower case for values from the sample:

 

If we did the same for only some of the values would be be fairly representing the population mean and variance?

 

Well it turns out that if we took every possible sample of size n < N we find that the average of all the sample means of size n is the same as the population average, [math]\mu [/math], although the sample mean for any particular sample may not be the same as that of the population.

But

 

If we take the average variance of all possible samples of size n < N we find is it smaller than the population variance. [math]{\sigma ^2}[/math].

 

Remembering that we are really interested in the parameter for the population, not the individual sample we find that we can take the sample average as a fair representation of the population average,

 

But, and this is what we want to 'fix'

We cannot take the variance of the sample as calculated by the formula

[math]{\sigma _s}^2 = \sum {\frac{{\left( {{x_i} - \mu } \right)}}{n}}^2 [/math]

as a fair representation of the population variance.

 

Instead of algebra to prove this for all cases the attachment shows a worked example for a very simple case of the population being three numbers {10,20,30}

and the sample size being two numbers. So N = 3 and n = 2

 

It can be seen that the mean of all the sample means is the same as the population mean,

but the average variance of all the samples is only half that of the population variance.

 

It can also be seen that bessels correction for this is exactly 2.

[math]\frac{n}{{\left( {n - 1} \right)}} = \frac{2}{{\left( {2 - 1} \right)}} = 2[/math]

 

Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1).

post-74263-0-48019000-1415878050_thumb.jpg

Edited by studiot
Link to comment
Share on other sites

Studiot, I understand your attachment. Calculating the sample variances (s) with n-1 leads to an average s that equals the population variance ([math]\mu[/math]). However, giving an average that is closer does not mean it's a better estimate.

 

Using those same numbers... If you calculate the average absolute deviation of s from [math]\mu[/math], i.e. the value of [math]\frac{\sum|s_{i} - \mu|}{n}[/math], or even if you find the square root of the average of the error squared, [math](\frac{\sum(s_{i} - \mu)^{2}}{n})^{0.5}[/math], you find that n-1 produces more error (59.44, 74.5) than n-0 (48.3, 50.22).

Edited by MonDie
Link to comment
Share on other sites

 

I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes.

The correction corrects for the fact that the sample average is not the true mean. The definition of variance is based on sample differences from the true mean. However when we don't know the true mean we estimate it by using the sample average. Using n-1 results in the estimate of the sample variance being fair, that is the average of the sample variance equals the true variance.

Link to comment
Share on other sites

Yes, I see that now (although you seem to define "variance" differently). That last post was an edit rollercoaster because I was confusing the population n with the sample n.

 

From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample.

Edited by MonDie
Link to comment
Share on other sites

 

From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample.

 

 

 

You do not correct the sample mean, only the sample variance.

 

So you always use n-0 to calculate the sample mean.

 

 

Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1).

 

The issue is, as mathematic pointed out, that the mean of a single sampling will probably not match the mean of the whole population exactly.

In my example, although the population mean is the most common value amongst the sample means, an individual sample mean equals the population mean in only 1/3 of the possible samples.

 

 

With only one sample you cannot estimate the variance or standard deviation, unless N = n = 1.

Link to comment
Share on other sites

I made a mistake. I was using [math]\mu[/math] where I should have used [math]\sigma^{2}[/math], but I was still talking about variance. If you take the average value of [math]|s_{i}-\sigma|[/math] or [math]|s_{i}^{2}-\sigma^{2}|[/math], you find that Bessel's correction results in a higher average error (with the numbers given).

Edited by MonDie
Link to comment
Share on other sites

MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful.

 

Take a look at http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html

 

You are expecting

 

[math]\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy [/math]

 

where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample).

 

And there is no real reason why the two should be equal in the general case.

Link to comment
Share on other sites

MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful.

 

Take a look at http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html

 

You are expecting

 

[math]\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy [/math]

 

where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample).

 

And there is no real reason why the two should be equal in the general case.

 

Regarding the link, I don't understand why they're summing the sample variance [math]S_{i}^{2}[/math] with the difference between means squared [math](\bar{x}_{i} - \bar{X})^{2}[/math]

Link to comment
Share on other sites

Regarding the link, I don't understand why they're summing the sample variance [math]S_{i}^{2}[/math] with the difference between means squared [math](\bar{x}_{i} - \bar{X})^{2}[/math]

You dropped the subscript c in on the X term. That is important. See the definition of X_c

Link to comment
Share on other sites

I didn't represent it correctly, but I had the correct meaning in my head.

 

BigNose, I didn't know any calculus until your integrals spurred some independent learning. Now I understand that the integral of an interval of a function is the mean y-value multipled by the interval (I avoided the words "area" and "volume" because my understanding is that the lower quandrants are scored negatively). I also read the first six lessons of Capn's derivatives tutorial. Although your equation may not be true, I do want to know how you were using integrals to represent variance. I'm not familiar with integral notation yet.

Edited by MonDie
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.