Stat Analysis

Recommended Posts

Not sure if this belongs in the Physics or Math forum. I'm leaning towards Math more.

Say, when scientists do experiments, record data, and come up wtih a formula, how do they prove that the formula is correct? I'm talking about those formulae that cannot be mathematically derived. One example might by Hooke's Law.There will inevitably be some error so how do they prove, besides just looking at the graph, that the formula should still stand in spite of the errors affecting the data? I know that they can't prove that the formulae is 100% true, but they can show, using statistical analysis that the formula is highly unlikely to be untrue. My question is, HOW?

Say I've been experimenting on Hooke's Law F=-kx with a spring, a ruler, and some masses. There will be imprecision and random errors from the ruler. And if I collect the data, plot it, it will not assume the shape of a perfect line. So my question is: how do I know that I should fit a linear regression to it? How can I prove that it is very likely to be a linear relationship and not some other, like an exponential relationship? I look at the graph, yes, it looks linear, but how do I prove on paper or with a software mathematically that it is highly likely that this is a linear relationship?

Thanks a lot.

Share on other sites

I don't know about that particular equation you speak about but in general there is no 100% certainty that an equation is correct, it has just managed to stand up against experimental data better than any contenders.

And it is not always a case that such equations cannot be derived from simpler premises. I found it interesting that newton's law of cooling and equations for motion etc can all be derived from very simple differential equations.

Share on other sites

Okay, let's make the scenario into deciding whether there is a linear relationship.

However, after you plot your points there are usually some high-degree polynomial curves (with some extremely strange coefficients) that fit the points better than a line. Of course a line looks pretty close to the data, but then the higher-degree polynomials are even closer. Say we know that the relationship is indeed linear, and the higher-degree polynomials are caused by inevitable random errors in the sample. It looks linear, but a high-degree polynomial can fit more points than the line. In this case, how can we show that the relationship is indeed linear despite the fact that a high-degree polynomial curve fit the data points better?

Share on other sites

Okay, let's make the scenario into deciding whether there is a linear relationship.

However, after you plot your points there are usually some high-degree polynomial curves (with some extremely strange coefficients) that fit the points better than a line. Of course a line looks pretty close to the data, but then the higher-degree polynomials are even closer. Say we know that the relationship is indeed linear, and the higher-degree polynomials are caused by inevitable random errors in the sample. It looks linear, but a high-degree polynomial can fit more points than the line. In this case, how can we show that the relationship is indeed linear despite the fact that a high-degree polynomial curve fit the data points better?

You can fit anything to a polynomial of sufficient order, so that's really a non-starter — it's ad-hoc. You want a minimum of free parameters. If you have a model already, you fit to it and see if it's reasonable. If you have no clue about the relationship, you go with the standards: linear, quadratic or exponential. You don't pull out a 23rd degree polynomial just because it fits well.

Share on other sites

I'm talking about those formulae that cannot be mathematically derived. One example might by Hooke's Law.

This may be a bit off topic, but Hooke's law can be derived. It is a conservation of momentum equation for a very specific case. Conservation laws have become laws since as far as every case that has ever been studied, the appropriate quantites (mass, momentum, energy) have obeyed the conservation law. Starting with the conservation principle, you can get pretty close to most of the fluid mechanics and solid mechanics equations. Most of the time, a constitutive realtionship needs to be assumed (like defining/assuming what the viscosity of a fluid or the elasticity of a solid is.) but with the conservation and constitutive equations, the equations that describe nature are derived.

I guess in these cases, if you don't consider the conservation laws as a basis for the derivations, then they cannot be completely derived, but starting with conservation and using appropriate constitutive realtions quite a lot can be done.

All that said, there are statistical methods that give the probabilities of linear fit versus quadratic fit. Look into almost any statistical design book. R.A.Fisher's Statistical Methods, Experimental Design, and Scientific Inference is very well thought of. Especially see the section "Tests of Goodness of Fit"

Share on other sites

Really? Hooke's Law can be derived? Can you show me how?

Also is comparing the coefficient of determination for different fits a valid method of deciding on whether the relationship is linear?

Thanks.

Share on other sites

Please don't think of this as ducking the question but any book on Mechanics of Solids will have a derivation.

I can do it, but it is rather lengthy, and (please forgive me if I am making a hugely wrong assumption about a 16 year old here) but the math is probably beyond what you have learned. The derivation all the way from the conservation of momentum involves integrals of vector and tensor fields and 3-D calculus operators like the gradient and mathematical relations that turn volume integrals into surface integrals (The Divergrence Theorem).

Share on other sites

Never mind.

Thanks a lot though.

By the way, the other question. Is comparing the coefficient of determination for different fits a valid method of deciding on whether the relationship is linear?

Share on other sites

coefficient of determination, that is R^2 right? If so, R^2 is the most abused statistic ever.

Let me propose a thought experiment. Take the limit again when you have 20 data points. Fit that with a 21st order polynomial. What is the R^2 in this case? Exactly 1. Does this indicate to you that the 21st order polynomial is the best fit? What about a 22nd order or 23rd or 100th order. Each of those should have an R^2 of exactly 1 also.

R^2 tells you how good of a fit to your guessed function the data fits. And this is all it does. Period. It does not compare for different trial functions. It does let you compare for guesses between the same form of the trial function. I.e. is will change the m & b in the linear model y = mx +b, but it does not tell you if y=mx + b or if y = mx^2 + nx + b. I suspect you know, or you can certainly contrive examples, where over a small range of x, the y=x and the y=x^2 models (and the y=x^3 and so on...) all look pretty similar. Which one is right? Hopefully the physics of the situation will lead you in the right path.

Share on other sites

Swansont just said I shouldn't fit something like a 23rd degree polynomial to it just because it fits well...

Well, thanks. But can you give me an example of what method WOULD be appropriate in determining whether a relationship is linear?

Share on other sites

Yes, you have to look up the Goodness of Fit test, many statistical design of experiments books will have info on it, like that R.A.Fisher book I cited above.

Swansot and I are saying the same thing. I was just trying to show you how meaningless R^2 values become. Because R^2 of the (n+1) oder polynomial will be exactly R^2=1 for n data points (as will all n+m, m>=1, order polynomials), what does R^2 tell you? Basically nothing whatsoever at all, since for every (n+m) order polynomial, it is one. This was a contrived example to show you how meaningless R^2 values really are.

Share on other sites

Swansont didn't state it strongly enough. If I have 100 data points, I will get amost always get a better fit with a higher order polynomial than a lower order polynomial. That does not mean the higher order polynomial is better. It is often much, much worse.

The best model is, of course, the right model. Suppose you know a-priori the general characteristic of some curve; you just don't know the exact details. Fitting the data to the known model will provide the details and will also provide a test of whether the a-priori model is correct.

If no a-priori model is available then fitting the data to some model is more-or-less a "watch me pull a rabbit out of this hat" game. If you at least have estimates of the errors in the data points, you can tell whether the shiny 23rd order rabbit has any more meaning than the dull first order rabbit. You are done if the first order rabbit (the linear model) explains the variance in the data to within the measurement errors. The idea of a polynomial model becomes dubious if you have to go higher than 2nd or 3rd order to get a good fit.

The problem with high-order polynomial fits is that they have a tendency to shoot to infinity very quickly outside the bounds of the dataset: they have extremely poor extrapolative capabilities. They also have poor interpolative capabilities. The high-order polynomial will hit the input data points very accurately, but it also has a good chance of shooting off to never-never land at ordinate points intermediary between the input values.

Create an account

Register a new account