Jump to content

Determining the type of probability distribution


Recommended Posts

I have collected 53000 data samples, and I want to found out how they are distributed i.e. normal distribution, laplace distribution etc.

 

Is there a general way of determining this?

Link to post
Share on other sites

The ideal, though not always possible, best place to start would be from the physical phenomena that generated the sample. Is the physical phenomena going to have a certain distribution? If it is, then the data should as well, or be of a "similar" function.

 

Secondly, examining the measuring device of the data can also sometimes yield insight. I remember a buddy taking a psychology course where the teacher confidently proclaimed that every human train is distributed normally. That is utter nonsense, because I don't think that hetero-homo sexuality is normal with a large number "sort of straight and sort of gay" in the middle -- it's probably pretty bimodal with large lumps straight and large lumps gay and only a few in the middle who are unsure. But, the teacher can make that proclaimation because every trait in psychology is measured by filling out the "strongly agree, agree, neutral, disagree, strongly disagree" questionnaire (or something similar) and then the trait the test is supposed to measure is gotten from the average of how the person answers those questions.

 

It is the Central Limit Theorem completely at practice there. It isn't that every trait is normal, but the sum of a large number of independent observations from any distribution will approach normality. The test measured the trait over and over and over again, enough that the CLT took over and make it appear like every trait is distributed normally. But, the psychology teacher didn't have the math knowledge to understand that.

 

That aside is just to show that the measuring device can influence the expected statistics as well. If it was a temperature probe that averages the temperature over a period of time, for instance, you may run into CLT issues again, especially if the averaging period is much longer than the fluctuation period.

 

However, all that said, if you don't have any clues, there is a branch of statistics that deal with goodness of fit. You should be able to find topics on goodness of fit in more advanced statistics books.

Link to post
Share on other sites

I feel like providing you with a bit more information about my experiment. It would be nice with a small discussion.

 

I am running a thread (in .NET) on a computer (running windows). Every time the thread is executed (every 40 ms), I timestamp the instance of the execution, leaving me with the time between each thread execution; my samples.

 

Now, I have, because of the intrinsic structure of windows thread management and time slicing (distribution of CPU-time to each running process), assumed that these samples are normally distributed around 40 ms. The assumption is more or less based on CLT, because it more or less supports that unknown phenomena are probably normally distributed.

 

Take a look at the histrograms of the samples.

The first one has a resolution of 10 ms. At first glance this might be argued to look normally distributed.

q10ms.png

Now, the resolution is 1 ms.

q1ms.png

And 1/10 ms.

q1_10thms.png

 

These look more like a Laplacian distribution. However, I cannot verify that they indeed are Laplacian, and not logarithmic or some other distribution.

Link to post
Share on other sites

The thing is, hobz, is that just fitting your data to a curve is fairly easy. You define some fitness measure -- sun of squares is usually popular -- and then adjust the parameters of the distribution to minimize the fitness measure and then find which curve with its best best set of parameter fits overall the best.

 

But, I want to return again to the nature of the phenomena being measured. This should give you the best insight as to what distribution is the most meaningful. I won't repeat the examples from my first post, but just give you another. The Weibull distribution is used when analyzing failures (of anything, mechanical, electrical, etc.) because it is based on the idea of the weakest link in the chain breaks first.

 

A good statistics and probability book should have some good discussions on the origins of the distributions.

 

A curve fit is also kind of a trivial exercise without further data gathering -- what curve best fits not only the current data but the future data as well. As more data comes in, it may become more apparent which shape of curve is better.

 

Finally without any resolution of whether the maximum is peaked or rounded, the exponential and Laplace distributions are both going to fit that data pretty well.

Link to post
Share on other sites

Thanks for the insight!

 

It would appear that in this case the maximum is peaked. This is a bit unfortunate, since the pdf with the most meaning assigned to it (prior to the experiment being performed) would be the Gaussian distribution.

Link to post
Share on other sites

You already have the distribution.

Unfortunately for you it might not fit very well to any of the "usual" distruibutions. The data might not be normally distributed, they might not fit a laplace or Weibul distribution either.

Link to post
Share on other sites
I am running a thread (in .NET) on a computer (running windows). Every time the thread is executed (every 40 ms), I timestamp the instance of the execution, leaving me with the time between each thread execution; my samples.

One thing is certain: These data cannot truly be normally-distributed because the intervals are non-negative. That does not mean you cannot model it as having a normal distribution. The normal distribution has so many useful mathematical properties that people frequently use it to model processes that technically cannot be normal. Nothing per say is wrong with this; the normal distribution is often approximately correct in many instances where it is technically invalid.

 

In this case, the skewed nature of the histogram (10 ms resolution), the very long tail, and the multimodal behavior (1 ms resolution) pretty much rules out the normal distribution. In fact, the last feature (multimodal behavior) pretty much rules out any textbook distribution. You appear to have extremely narrow peaks at about 45, 50, 55, and 60 ms. Are these real or are they artifacts of the measurement process?

Link to post
Share on other sites
You appear to have extremely narrow peaks at about 45, 50, 55, and 60 ms. Are these real or are they artifacts of the measurement process?

They are real. Windows assigns CPU-time in slices of size tens of milliseconds. This accounts for the "preferred" intervals of 40, 50 and 60 ms. The intervals in between are, to the best of my knowledge, a random phenomenon occurring from the intrinsic CPU scheduling algorithms of Windows, along with all the other processes also competing for CPU-time at the same time (simulated by multithreading at least) as the thread I am running.

 

I read somewhere that over time the CPU burst times (i.e. the time a process actually uses the CPU) are exponentially distributed, thus leading to some understanding of the results shown in the graph. Assuming that we displace this exponential distribution, which in fact we do by requesting CPU-time every 40 ms, then a double sided exponential distribution (Laplace distribution) would seem a good approximation. Of course, the Laplace does not account for the spikes at 50 and 60 ms.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.