Jump to content

A probability problem


Alfred001

Recommended Posts

I'm trying to figure out how likely it is that abbreviations would get picked out of a dictionary by a particular sampling method. We can limit ourself to one section (one letter).

 

The sampling method is that you pick every fifth word out of a double page of the dictionary, By double page I mean that when you reach the end of one page you don't restart the count on the next page, but you keep counting up to five.

 

Now, my question is, if you list all the abbreviations at the beginning of the section rather than listing them alphabetically like other words, does that make them less, more or equally likely to get picked than any other word?

Link to comment
Share on other sites

There is no simple anser to this, it is going to depend upon the sizes of the abbreviation and other word populations.

 

Say there are 6 abbreviations and 30 words ( a short section).

 

If you lump all the abbreviations together at the beginning you are guaranteed only 1 hit regardless of the subsequent number of words, but your accuracy will decrease as the number of words increases as you will never get another hit.

 

If however the abbreviations are distributed it is possible to either get them all or miss them all depending upon the size of the rest of the word population.

With 30 or greater as I offered you could get none or all six.

Edited by studiot
Link to comment
Share on other sites

I recommend that you make your question 's parameters suitable to one of these distributions to calculate probbaility

 

--->> binomial distribution

---->> bernoulli distribution

--->> nominal distribution

--->> geometric distribution

--->> hypergeometric distribution.

 

-->> ..etc.any distribution which is not continuous and is the most suitable to your inquiry.

 

Note : Although normal distribution has relation with binomial distribution under convenient conditions. you may not use this one.

the best thing what you may do is to schoose one of distributions which appears above.


note : I wrote types there very roughly.

Edited by blue89
Link to comment
Share on other sites

For a large word population I don't think it matters what distribution you assume; sampling every 5th word will divide it into uniform blocks, a bit like Latin Squares, and the result will tell you what the distribution is.

 

This assumes all blocks are equally treated according to the distribution function. in the seeding of abbreviations.

 

But this population is being compared to a second circumstance where the first bock or few blocks are treated differently from the rest.

 

And the question was essentially which circumstance will lead to a better answer?

 

The answer to this is that neither will always lead to a better answer.

 

For very small word populations putting the abbreviations first is more likely to lead to a better answer.

For very large word populations putting the abbreviations first is more likely to lead to a significantly worse answer, but will only be definitely worse for an infinite word population.

The actual crossover point depends upon the actual populations.

Link to comment
Share on other sites

For a large word population I don't think it matters what distribution you assume; sampling every 5th word will divide it into uniform blocks, a bit like Latin Squares, and the result will tell you what the distribution is.

 

This assumes all blocks are equally treated according to the distribution function. in the seeding of abbreviations.

 

But this population is being compared to a second circumstance where the first bock or few blocks are treated differently from the rest.

 

And the question was essentially which circumstance will lead to a better answer?

 

The answer to this is that neither will always lead to a better answer.

 

For very small word populations putting the abbreviations first is more likely to lead to a better answer.

For very large word populations putting the abbreviations first is more likely to lead to a significantly worse answer, but will only be definitely worse for an infinite word population.

The actual crossover point depends upon the actual populations.

 

I don't know what you mean by better answer, representative sample?. What I'm wondering about is this:

 

If the abbreviations are distributed alphabetically, like all other words, then their likelyhood of being picked is the same as any other type of word, but if they are distributed all at the beginning of the section, does that make them less, more or equally likely to get picked.

 

Now, I appreciate what you're saying, that the answer depends on the actual numbers, but can this somehow be expressed mathematically?

I recommend that you make your question 's parameters suitable to one of these distributions to calculate probbaility

 

--->> binomial distribution

---->> bernoulli distribution

--->> nominal distribution

--->> geometric distribution

--->> hypergeometric distribution.

 

-->> ..etc.any distribution which is not continuous and is the most suitable to your inquiry.

 

Note : Although normal distribution has relation with binomial distribution under convenient conditions. you may not use this one.

the best thing what you may do is to schoose one of distributions which appears above.

note : I wrote types there very roughly.

 

I'm virtually illiterate mathematically so I've no idea what any of those different distributions are.

Link to comment
Share on other sites

 

 

I'm virtually illiterate mathematically so I've no idea what any of those different distributions are.

 

Don't worry ,or don't feel negatively.

mathematics is nice area and entertaining.

write any of them on google and study if you have enough time.

note please , probability is never known as void subject or is never known as a type of probability which we detect via using our predictions/emotions in mathematics.

every subject is systematic in mathematics.

 

(we may find only probabilities ,not predictions in its actual descriptions) :)

Link to comment
Share on other sites

 

 

I'm virtually illiterate mathematically so I've no idea what any of those different distributions are.

 

That got me to wonder whether one could be mathematically illiterate, or whether it had to be mathematically innumerate. I guess that has a more specific meaning related to number manipulation.

Link to comment
Share on other sites

Alfred, have you checked up on Latin Squares?

 

They are not difficult to understand.

 

Remember I said the question is like them, not exactly the same.

 

Personally I don't think it is possible to develop a single mathematical expression that covers all possible cases with the front loading described.

You would have to chop up the word count range and perhaps the % of abbreviations as well into segments and develop a separate formula for each.

Link to comment
Share on other sites

Alfred, have you checked up on Latin Squares?

 

They are not difficult to understand.

 

Remember I said the question is like them, not exactly the same.

 

Personally I don't think it is possible to develop a single mathematical expression that covers all possible cases with the front loading described.

You would have to chop up the word count range and perhaps the % of abbreviations as well into segments and develop a separate formula for each.

 

I did look into Latin Squares, but I don't understand how they apply here.

 

Do other math savy folks agree that this can't all be expressed through a single formula?

Link to comment
Share on other sites

 

I did look into Latin Squares, but I don't understand how they apply here.

 

Do other math savy folks agree that this can't all be expressed through a single formula?

 

I think if you were to narrow the format of the book down a bit then it could well be expressed as a single formula made up of estimations and projections. If all the abbreviations come at the beginning of the section for each letter (ie A.Z.X comes before aardvark) then you have 26 instances in which you are likely (how likely?) to get abbreviations. If the dictionary starts a new page for a new letter then that likelihood becomes a certainty on 26 pages and a possibility on 26 more (over two pages of abbreviations) or 52 (+4 pages) etc.

 

The easiest way to proceed with the task is of course to skip any page with any abbreviations on it. This of course would lower the percentage of words with second letter near the beginning of the alphabet and near the end of the alphabet

Link to comment
Share on other sites

Imatfaal is right, you need to put some meat on the bones and demonstrate that you follow the answers you are receiving.

 

In particular I am concerned that you will use the sampling in a statistically unsatisfactory manner.

 

Let us say that 10% of the picked out words are abbreviations

 

Do you understand that it is unsatisfactory to simply say 10% of the words in the dictionary (or this section of the dictionary) are abbreviations?

 

Because you are comparing two different distributions - the sample and the original population you also need to estimate the probability that you are right (or wrong).

 

That is what confidence intervals are all about.

Link to comment
Share on other sites

 

I think if you were to narrow the format of the book down a bit then it could well be expressed as a single formula made up of estimations and projections. If all the abbreviations come at the beginning of the section for each letter (ie A.Z.X comes before aardvark) then you have 26 instances in which you are likely (how likely?) to get abbreviations. If the dictionary starts a new page for a new letter then that likelihood becomes a certainty on 26 pages and a possibility on 26 more (over two pages of abbreviations) or 52 (+4 pages) etc.

 

The easiest way to proceed with the task is of course to skip any page with any abbreviations on it. This of course would lower the percentage of words with second letter near the beginning of the alphabet and near the end of the alphabet

 

I don't understand what you mean by the last paragraph, I think we misunderstood each other.

Imatfaal is right, you need to put some meat on the bones and demonstrate that you follow the answers you are receiving.

 

In particular I am concerned that you will use the sampling in a statistically unsatisfactory manner.

 

Let us say that 10% of the picked out words are abbreviations

 

Do you understand that it is unsatisfactory to simply say 10% of the words in the dictionary (or this section of the dictionary) are abbreviations?

 

Because you are comparing two different distributions - the sample and the original population you also need to estimate the probability that you are right (or wrong).

 

That is what confidence intervals are all about.

 

No, no, I'm not going to be doing the sampling, its already done, I'm trying to figure out whether there was bias in it.

Let me explain fully what's going on so there aren't any misunderstandings:

 

A guy did a study on word formation processes in English, wanted to see what % of words are created by each word formation process, of which abbreviation is one.

 

So he sampled the Supplement to the Oxford English Dictionary (OEDS) in the way I described and then he analyzed the results.

 

So what I'm wondering is whether his sample was biased given that the OEDS lists all abbreviations at the beginning of each letter section, rather than alphabetically, and if there WAS bias, was it against or towards the inclusion of abbreviations.

Link to comment
Share on other sites

 

I don't understand what you mean by the last paragraph, I think we misunderstood each other.

...

 

Sorry Alfred I was remembering your other thread - to wit

 

A sample was taken from The Supplement to the OxJord

English Dictionary (OEDS) (1972-86) using the following

method. The single-digit number 5 was chosen at random

from a table of random numbers. Every fifth word was

taken from each double page of the OEDS,...

 

I was assuming the two threads were connected and that the abbreiviation question was seeking to quantify potential errors with this method.

Link to comment
Share on other sites

Definitely there are some misunderstandings.

 

And yes I remember your other thread.

 

The problem is lack of precision.

 

What do you mean by bias?

 

Selection bias

Self selection bias

some other process

 

And as to the sampling technique.

 

Just opening the OED at the letter D shows imprecision of definition.

 

I had assumed that you had not included in your 'every 5th word' all the text associated with each entry.

But you seem to include all the text.

 

If you stuck with the main entry heading

 

eg

dab, dabble, dabby, dabchick, dabitis

 

being a sequence of 5 entries

 

Then all abbreviations are listed at the beginning under abbreviations, which is a sub heading under D,d (the first entry in the D section)

As such no abbreviations so listed at the beginning would be picked up.

 

So the word 'dab' has five separate entries.

How many would you count or reject?

 

The text associated with entries usually includes abbreviations that are repeated many times

eg v.t. ; colloq ; adv ; f.

Link to comment
Share on other sites

Imatfaal, that IS what I'm asking :)

 

Definitely there are some misunderstandings.

 

And yes I remember your other thread.

 

The problem is lack of precision.

 

What do you mean by bias?

 

Selection bias

Self selection bias

some other process

 

And as to the sampling technique.

 

Just opening the OED at the letter D shows imprecision of definition.

 

I had assumed that you had not included in your 'every 5th word' all the text associated with each entry.

But you seem to include all the text.

 

If you stuck with the main entry heading

 

eg

dab, dabble, dabby, dabchick, dabitis

 

being a sequence of 5 entries

 

Then all abbreviations are listed at the beginning under abbreviations, which is a sub heading under D,d (the first entry in the D section)

As such no abbreviations so listed at the beginning would be picked up.

 

So the word 'dab' has five separate entries.

How many would you count or reject?

 

The text associated with entries usually includes abbreviations that are repeated many times

eg v.t. ; colloq ; adv ; f.

 

Regarding what I mean by bias, I don't know what the difference is between the different types of bias, but what I mean is are the abbreviations, given how they are listed, more, less or equally likely to be sampled than the words that are listed alphabetically.

 

Regarding the question about one entry having multiple words, just to see if I'm understanding you:

You're saying that some entries have multiple words, like:

 

dab, dabble, dabby, dabchick, dabitis

and you're asking do you count just the entry or would you, in your every fifth word count, count off all the words in that entry, so that if an entry has 20 words in it you take every fifth from it and you end up with 4 words from that single entry. You're asking because abbreviations are listed in the same way.
Right?

I would have to assume you count each of the words in an entry, because, obviously, if you didn't then you'd just skip over all the abbreviations, as you said. I doubt the guy'd pick a method that foolish. Also, in the dab example, all of those different words are instances of a different word formation process, so they would each have to be counted, because, again, the idea is to investigate what % of the English word stock is being created by what word formation process.
You would certainly not count the abbreviations that are part of the writeup of the entry, the dictionary's own abbreviations. The idea of the research is to take an inventory of the word stock, so you just count the actual words in the dictionary. So, regarding what I said earlier, that you take every word in the entry, that does not apply to the text of the entry, the explanation of the word, it only applies to all the words in the heading, like all the dab words.
Incidentally, you seem to have the OED at hand. Do you happen to have the Supplement to the OED?
Edited by Alfred001
Link to comment
Share on other sites

 

Imatfaal, that IS what I'm asking :)

 

 

Regarding what I mean by bias, I don't know what the difference is between the different types of bias, but what I mean is are the abbreviations, given how they are listed, more, less or equally likely to be sampled than the words that are listed alphabetically.

 

Regarding the question about one entry having multiple words, just to see if I'm understanding you:

 

You're saying that some entries have multiple words, like:

 

dab, dabble, dabby, dabchick, dabitis

and you're asking do you count just the entry or would you, in your every fifth word count, count off all the words in that entry, so that if an entry has 20 words in it you take every fifth from it and you end up with 4 words from that single entry. You're asking because abbreviations are listed in the same way.
Right?

 

I would have to assume you count each of the words in an entry, because, obviously, if you didn't then you'd just skip over all the abbreviations, as you said. I doubt the guy'd pick a method that foolish. Also, in the dab example, all of those different words are instances of a different word formation process, so they would each have to be counted, because, again, the idea is to investigate what % of the English word stock is being created by what word formation process.

You would certainly not count the abbreviations that are part of the writeup of the entry, the dictionary's own abbreviations. The idea of the research is to take an inventory of the word stock, so you just count the actual words in the dictionary. So, regarding what I said earlier, that you take every word in the entry, that does not apply to the text of the entry, the explanation of the word, it only applies to all the words in the heading, like all the dab words.
Incidentally, you seem to have the OED at hand. Do you happen to have the Supplement to the OED?

 

 

 

I can see you are having as much trouble understanding me as I am understanding you.

 

:)

 

Yes I have the OED, but unfortunately not the supplement so I don't know what the layout is for that part.

 

As regards my other questions I have annotated the page in question.

 

1) Are you only choosing from the head word as I noted, 1,2,3,4,5 all in red?

 

(These are different words not from the same stem.)

 

2) Where the word appears multiple times as a head word as I have underlined dab several times, how many times does it count?

 

3) If you are only choosing head words you will never see the abbreviations they are in the body of the text, as ringed.

 

4) if you are including the body of the text in the count, note the many standard abbreviations as arrowed in red. How are these reckoned?

 

post-74263-0-12019200-1473180934_thumb.jpg

Link to comment
Share on other sites

Let me copy the part where the guy talks about his sampling method:

 

A sample was taken from The Supplement to the OxJord
English Dictionary (OEDS) (1972-86) using the following
method. The single-digit number 5 was chosen at random
from a table of random numbers. Every fifth word was
taken from each double page of the OEDS, providing that
1. The word was not an addition to an entry in the first
edition of the OxJord English Dictionary (OED 1).
2. The word was not spelled in precisely the same way as
a word already listed in OED 1.
Q What kinds oJ innovation will be missed by this method?
A . New meanings of old forms will be missed, quite
delIberately. The experiment is about new forms, not about
new meanings of old forms. Vocabulary change involves
both of these aspects, but only one is considered here.
/quote
So this is slightly ambiguous, because he says that words that are spelled like words already in OED 1 are skipped, because he's interested in new forms, not new meanings to old forms, so does that mean he would only count the one "dab" (we see multiple dabs in the picture you give), or that, provided that all those dabs were not in the OED 1 he would count all of them. I have to assume its the former since the latter doesn't make sense.

 

 

I can see you are having as much trouble understanding me as I am understanding you.

 

:)

 

Yes I have the OED, but unfortunately not the supplement so I don't know what the layout is for that part.

 

As regards my other questions I have annotated the page in question.

 

1) Are you only choosing from the head word as I noted, 1,2,3,4,5 all in red?

 

(These are different words not from the same stem.)

 

2) Where the word appears multiple times as a head word as I have underlined dab several times, how many times does it count?

 

3) If you are only choosing head words you will never see the abbreviations they are in the body of the text, as ringed.

 

4) if you are including the body of the text in the count, note the many standard abbreviations as arrowed in red. How are these reckoned?

 

attachicon.gifOED1.jpg

 

1) You mean do I skip all those abbreviations because they are in the text and not at the beginning of an entry as a head word?

No, I have to assume the guy would have collected all those abbrevs with the every fifth method.

 

This answers 3), too

 

2) so the stuff above the quote answers this, Yeah, you only take the one dab

 

4) no, you discount those. Only abbrevs that are entries in the dictionary. Only abbrevs that are head words.

Edited by Alfred001
Link to comment
Share on other sites

Sorry but I think you have only answered one of my 4 questions.

 

1) is the most important and not the same as 3

 

post-74263-0-91935500-1473192147.jpg

 

It is still not clear what words or symbols that are on the page are discarded and what words or symbols are used to form a list from which one in 5 is then chosen.

 

So I have reproduced number 2 of the words on my earlier list - note you originally rejected all of them,

 

but the attachment contains every word associated with what I call the head word - dabble- in this case.

 

Now I have asked and received directly conflicting answers to this question

 

do you just consider Dabble as a candidate word or all the words in my attachment , for example are Mitchell, moisten, or, soil counted??

 

Link to comment
Share on other sites

Sorry but I think you have only answered one of my 4 questions.

 

1) is the most important and not the same as 3

 

attachicon.gifOED2.jpg

 

It is still not clear what words or symbols that are on the page are discarded and what words or symbols are used to form a list from which one in 5 is then chosen.

 

So I have reproduced number 2 of the words on my earlier list - note you originally rejected all of them,

 

but the attachment contains every word associated with what I call the head word - dabble- in this case.

 

Now I have asked and received directly conflicting answers to this question

 

do you just consider Dabble as a candidate word or all the words in my attachment , for example are Mitchell, moisten, or, soil counted??

 

 

From your attachment you only take the words in bold. You don't take anything from any text that is part of a description of a word or that gives examples on it.

 

So if you had a dictionary that was all just a list of words with no description or explanation you'd count everything there (provided its not spelled the same as another word already in there), or every fifth of those words. Anything that is not that, but is part of an explanation of a word or citations of texts exemplifying the use of that word, you disregard all of that.

Link to comment
Share on other sites

 

 

From your attachment you only take the words in bold. You don't take anything from any text that is part of a description of a word or that gives examples on it.

 

Thank you.

 

In which case you need to clarify the layout in supplement, since it can't be the same as in the OED itself.

Sampling the OED according to these rules will yield precisely zero abbreviations.

Link to comment
Share on other sites

But why do you say that?

Maybe you misunderstood what I meant when I said you don't take anything from the text that is part of the description of a word, that doesn't apply to a bolded word in the text, you take everything that is bolded, as it represents another entry, another word.

 

So if you look at that earlier attachment, all the abbreviations, they are not listed at the beginning of a paragraph, rather in the text, but they are bolded. So you WOULD count them.

 

You count every word that is an entry in the dictionary, you don't count any of the text describing the word.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.