Welcome to ScienceForums.Net!
|
After you've registered, come in and introduce yourself, or visit the forum index. If you need any help registering, posting, or if you just have some questions about our site, please feel free to contact us at staff at scienceforums dot net.
|
|
| Guest Message © 2012 DevFuse | |
Probability puzzle Picking samples from sets
#1 3 December 2011 - 07:54 PM
Suppose I have a set with T members. I randomly choose a subset with X members. Independently of that, someone else randomly chooses a subset with Y members. Without loss of generality, assume X <= Y. Given a number N (N <= X), what is the probability that the two subsets have exactly N members in common?
Now for the real-life context. Historians often find multiple copies of very old manuscripts but don't know how they came into existence. One technique they use is to look for common mistakes. If two documents have exactly the same mistake, then the suspicion arises that one was copied from the other. The more common mistakes there are, the more certain it becomes that one document was copied from the other, because it would be too much of a coincidence for the same mistakes to get made independently.
I am trying to do the maths behind all this but struggling. I thought someone might enjoy this as a probability exercise.
- Posts: 13 | Joined: 03-December 11
Reply
#2 4 December 2011 - 03:53 AM
beaumontfletcher, on 3 December 2011 - 07:54 PM, said:
Suppose I have a set with T members. I randomly choose a subset with X members. Independently of that, someone else randomly chooses a subset with Y members. Without loss of generality, assume X <= Y. Given a number N (N <= X), what is the probability that the two subsets have exactly N members in common?
I would try this:
Suppose you have your set A, and the other has their set B.
For an arbitrary subset of A, with N members, what is the chance that each of the subset's members is in B?
Also what is the chance, assuming that each of the subset's members is in B, that each of the remaining members in A is not in B?
Then use combination to find the number of ways to choose a subset of A with N members.
Without filling in the details, I don't know if this work properly and simply enough. Is it enough to inspire a solution???
This post has been edited by md65536: 4 December 2011 - 04:23 AM
- Posts: 790 | Joined: 28-July 10
Reply
#3 4 December 2011 - 05:18 PM
Assume you've picked your first subset and have picked N members from that subset. Now look in the second subset, the one that has Y members. The probability that all N of your members got picked for the second subset as well must be the product of (Y/T), (Y-1 / T-1), (Y-2 / T-2)...and so on, down to (Y-(N-1) / T-(N-1)). This evaluates to:
Y!(T-N)! / T!(Y-N)!
This formula looks right to me. E.g. if you simplify by setting N=Y, then the formula becomes Y!(T-Y)!/T! which is obviously the probability of picking a given subset of Y members from a set containing T members (i.e. the inverse of the number of ways of choosing Y from T).
I think that's what the answer to your first suggestion is. I got a bit stuck thereafter, trying to tie this probability with the probability of those N members being chosen from the first subset, the one that contains X members.
Unless someone hands me a solution, I'm going to try to find some textbook which discusses problems like this. I doubt it will have the same problem exactly, but it should help me get unstuck.
This post has been edited by beaumontfletcher: 4 December 2011 - 05:20 PM
- Posts: 13 | Joined: 03-December 11
Reply
#4 4 December 2011 - 05:59 PM
beaumontfletcher, on 4 December 2011 - 05:18 PM, said:
I haven't worked through the math or logic of the first part, but I'll go on to this part.
The first part deals with the probability P of an arbitrary selection of N elements matching in sets A and B (and importantly if you haven't done it*, it must also include the probability that the remain elements don't match).
http://en.wikipedia....iki/Combination
X C N ("X choose N") tells you the number of different ways of choosing N elements from a set of X members.
Each of these possible subsets are similar and each has a probability of P of matching exclusively to sets A and B.
Each of these possible subsets is equally likely to be chosen as an "arbitrary selection of N elements".
Each of these possible subsets are unique, so the possibility of choosing one vs another is mutually exclusive. That means that you can just add up the probabilities of any one of them matching (which is P), for each of the possibilities of actually choosing that group.
So the final answer should be: P * (X C N)
If it's confusing, consider it to be rephrasing the question to be: "What is the probability that this group of N elements from A matches exclusively with B, or that this other group of N elements does, or that this other group does (and so on for each possible group of N)?"
* Glancing at your solution to the first part, I'm guessing that it doesn't take this into account? If you calculate the probability that all N elements match, but leave open the possibility that additional elements will match, then the probabilities of one combination matching vs another are not mutually exclusive or whatever, and the second part won't work.
- Posts: 790 | Joined: 28-July 10
Reply
#5 5 December 2011 - 08:43 PM
The XCN formula you mention might be correct (I haven't got my head round it yet) but, if so, it means that my formula was wrong, because if I combine yours with mine, the result is demonstrably wrong. What you have written as P * XCN evaluates as follows if I substitute P by my formula given above:
( Y!(T-N)! * X! ) / ( T!(Y-N)! * N! * (X-N)! )
This looks plausible because it is at least symmetrical in X and Y. But if we set N=1 then the formula collapses to XY/T. That's obviously wrong because it will exceed 1 if the two subsets are large enough.
So either my formula is wrong or the extension you suggested is wrong. I need to get my head round that!
- Posts: 13 | Joined: 03-December 11
Reply
#6 5 December 2011 - 09:11 PM
beaumontfletcher, on 5 December 2011 - 08:43 PM, said:
I think that using the number of combinations and simply multiplying by that, assumes that only one possible combination can work. Otherwise, the probability of one combination working can overlap with the probability of another combination working, so you can't just add their probabilities together.
As an example, say that X = Y = T = 10, and we choose N = 5.
Obviously, all 10 members in your chosen set (A) match all 10 members in the other's chosen set (B). Probability is 1.
Any subset of 5 that we choose will also match (but not exclusively; the remaining 5 will also match).
10 C 5 is 252, so if we just add up the probabilities that each possible set of 5 matches, then we get a probability of 252, which is wrong.
My formula is wrong for doing it this way. If your formula is right, mine won't work with it.
However, with the original question of matching only exactly 5, if say X = Y = 10 but T = 100, and say that the intersection of X and Y has 5 members... then there is only 1 possible combination of 5 members from X that matches exclusively with 5 members of Y. The chance that one combination matches excludes the chance that some other combination also matches.
I haven't wrapped my head around the first part of the puzzle enough to know whether either of our methods seems right or not.
That is... my strategy is to find the probability that some specific (but generic) set of size N will match exclusively, and then multiply by all the possible ways you can choose that set. If you can find the probability that any set of size N will match exclusively, then you don't have to worry about the number of ways you can choose such a set.
This post has been edited by md65536: 5 December 2011 - 09:25 PM
- Posts: 790 | Joined: 28-July 10
Reply
#7 6 December 2011 - 09:05 AM
- Posts: 13 | Joined: 03-December 11
Reply
#8 6 December 2011 - 06:42 PM
beaumontfletcher, on 6 December 2011 - 09:05 AM, said:
So, out of curiosity I tried working through the first part and I get the same answer you did.
I should warn you that I'm not a mathemagician!
ALSO NOTE: At the end of writing this I ran into a problem that puts the math for "P2" below in doubt. If you can follow my reasoning below you might be able to figure out if it's okay or if not, maybe you can see the right way to do it. I'll probably look at this again later and see if I can figure out the problem, but it might be beyond me!!!
First part:
Consider the first N elements of set A. The chance that the first element is in Y is: Y / T. The second (Y-1)/(T-1), etc.
The product of all N is: P1 = Y!/(Y-N)! / [T!/(T-N)!]
This is the same answer that you came up with.
So while we were doing this we removed matching elements from B and the set that had T elements so that they didn't affect the probability of matching subsequent elements.
We're left with (X-N) elements that we want to make sure are not in a set of (Y-N) elements, which can be chosen randomly from a set of (T-N) elements.
Sorry for the mess of notation but let's call C the matching subset with N members.
The chance that the first element in (A \ C) IS in (B \ C) would be: (Y-N)/(T-N)
So the chance that it's not is 1 - (Y-N)/(T-N)
Now, we don't have to remove any elements from (B \ C), because we didn't match any!
I'm not so sure if we should remove it from the "main set" that has (T-N) elements left in it. But I think that we should!
So the chance that the second element in (A \ C) is NOT in (B \ C) is 1 - (Y-N)/(T-N-1)
We do this for all (X-N) elements...
... exceeding my math abilities...
We get the product P2 = [1 - (Y-N)/(T-N)][1 - (Y-N)/(T-N-1)] ... [1 - (Y-N)/(T-N-(X-N-1)]
I don't know how to simplify that. There's gotta be a simpler expression!
Finally X C N = X! / (N! (X-N)!)
Then the final answer would be P1*P2*(X C N)
Note: If (X-N)+(Y-N) > (T-N) then P2 must be 0, because there is no way to divide the "leftover" elements of the full set between sets A and B, with no overlap between them.
If we allow such a case, then one of the products in P2 should be 0, but you could also end up dividing by 0.
Hrm.....................
- Posts: 790 | Joined: 28-July 10
Reply
#9 6 December 2011 - 08:32 PM
As I said, I will try the N=1 case or some other simplification. It's obviously a bit too ambitious to have a go at the general case yet.
This post has been edited by beaumontfletcher: 6 December 2011 - 09:28 PM
- Posts: 13 | Joined: 03-December 11
Reply
#10 7 December 2011 - 03:26 AM
beaumontfletcher, on 6 December 2011 - 08:32 PM, said:
As I said, I will try the N=1 case or some other simplification. It's obviously a bit too ambitious to have a go at the general case yet.
Well, it's an interesting puzzle and I think I must have OCD.
In the case that X=Y=T, the probability that N elements in X match exclusively with N elements in Y is 1 iff N = X, otherwise the probability is 0. I think it's okay for the formulae to fail in impossible cases, and usually you exclude impossible cases in the description of the problem (because we'll already know that it's impossible if N is too low relative to T, etc).
The reasoning is that "This formula only works when there is enough of the T elements to be split up among A and B so that after the N elements are matched, there are still enough unique elements to fill out A and B." Otherwise if it's impossible to do so the probability is 0. So we can say "For X-N+Y-N > T-N, the probability is 0, otherwise it's... (some formula)".
But I must have been on crack in trying to figure out the "P2" part in my last post. So I'll give it another go!
Basically, in the P1 part we match the first (in an arbitrarily chosen order) N elements in A with elements in B. We can remove this subset C of matching elements from A, B, and the full set with T elements. We're left with 3 smaller sets and we want to find the probability that none of the elements of the first are in the second, where both sets are subsets of the 3rd.
So let's look only at the P2 part with some new names. Let's call the sets E, F, and G, with size e,f,g respectively, where E = A \ C, F = B \ C, and G = {the full set with T elements} \ C.
We want the probability that no members of E are in F. I tried to do it one way above and the math didn't come out nice.
But I realized that the problem is equivalent to finding the probability that all the members of E are in the remainder of G after removing F. That is, if all members in E are in G \ F, then no members of E can be in F.
So like with figuring out P1, the chance that the first element in E is in G \ F is: (g-f)/g
And for the next element is (g-f-1)/(g-1)
etc for all e members... with a product of P2 = (g-f)!/(g-f-e)! / (g!/(g-e)!)
I just checked an example in a spreadsheet and I get the same value as the convoluted math of my last post!
Note that e=X-N, f=Y-N, g=T-N
So I think the P1*P2*(XCN) formula is now something manageable, when it's all put together!
This post has been edited by md65536: 7 December 2011 - 03:44 AM
- Posts: 790 | Joined: 28-July 10
Reply
#12 11 December 2011 - 03:51 PM
Firstly, I thought I'd reduce the number of symbols we need by using the same symbol to represent a set and the number of elements in it. So T is both the set of all elements and also the number of those elements. The context should make it obvious which meaning is being used at any point in the analysis.
I drew two diagrams, one to show the general case when we have a match, i.e. N is contained in both X and Y; and the general case when we do not have a match, i.e. the set N we picked from X is only partly contained in Y.

Using this, I agree with your formula for P2. It is:
P2 = (G-F)!(G-E)! / G!(G-F-E)!
We already know P1 and XCN. The problem is that we cannot work out the final answer unless we know what F, G and E are. Two of these are easy:
G = T - N
E = X - N
In the match case, obviously F = Y - N. But, as my question mark in the picture shows, we do not know what F is in the non-match case.
Without a formula for F, the P2 expression cannot be evaluated...
Ah! I think I was wrong to worry about the non-match case, since you are calculating P2 on the premise that we have a match (that is what P1 is for). Therefore, we have F = Y - N. So our final probability is
p = P1 * P2 * XCN
We know from above that P1 = Y!(T-N)! / T!(Y-N)!
From your post,
P2 = (G-F)!(G-E)! / G!(G-F-E)!
Substituting the expressions for F, G and E we get
P2 = (T-Y)!(T-X)! / (T-N)!(T-X-Y+N)!
So the final probability:
p = Y!(T-N)!(T-Y)!(T-X)!X! / T!(Y-N)!(T-N)!(T-X-Y+N)!N!(X-N)!
This simplifies to:
p = X!Y!(T-X)!(T-Y)! / T!N!(X-N)!(Y-N)!(T-X-Y+N)!
This is at least symmetric in X and Y. I will see how I can check this with some examples.
This post has been edited by beaumontfletcher: 11 December 2011 - 04:06 PM
- Posts: 13 | Joined: 03-December 11
Reply
#13 11 December 2011 - 05:21 PM
beaumontfletcher, on 11 December 2011 - 03:51 PM, said:
Yeah! I think that basically when you're multiplying probabilities, including with all the factorial factors, you're essentially saying "What is the probability that this first thing happens, and assuming that it does, what is the probability that this second thing also happens, and assuming that it does, what is the probability that this third thing also happens, etc..."
If you ever get a probability factor of 0 for any of the individual factors, then the whole thing comes out to 0. BUT ALSO, if the probability that "the second thing" is 0, then you can't assume that it happens when looking at the probability of the third thing. That is why the formula can fail in impossible cases, because it assumes impossible things. So you would state the domain of the function, and either state that you assume the problem is within the domain (ie assume that T is big enough) or you would handle the other cases separately.
I think the reasoning is good; I hope the math works out! I haven't tried putting it all together myself, but I usually end up making mistakes that need correcting.
- Posts: 790 | Joined: 28-July 10
Reply
#14 11 December 2011 - 05:58 PM
I will come back and update the thread when I have done the programs. In the meantime, thanks for all your help!
- Posts: 13 | Joined: 03-December 11
Reply
#15 14 December 2011 - 04:26 PM
If you use Excel and put T in cell A1, X in cell B1, Y in cell C1 and N in cell D1, then the Excel formula for the probability is:
=FACT(B1)*FACT(C1)*FACT(A1-B1)*FACT(A1-C1)/(FACT(A1)*FACT(D1)*FACT(B1-D1)*FACT(C1-D1)*FACT(A1-B1-C1+D1))
The surprising thing is that, at the start, the probability goes up as T goes up (with X, Y and N constant) before starting to fall:
T Probability
10 0.099206349
11 0.162337662
12 0.220959596
13 0.271950272
14 0.314685315
15 0.34965035
16 0.377747253
17 0.399967679
18 0.417250233
19 0.430426557
20 0.440208978
21 0.447196422
22 0.451887294
23 0.454694047
24 0.455957086
25 0.455957086
26 0.454925509
27 0.453053388
28 0.450498575
29 0.447391689
30 0.443840961
31 0.439936202
32 0.435752026
33 0.43135049
34 0.42678325
35 0.422093324
So the probability peaks at 24 and 25 before starting to fall. Why does it rise between 10 and 24? It doesn't make sense.
- Posts: 13 | Joined: 03-December 11
Reply
#16 14 December 2011 - 07:25 PM
beaumontfletcher, on 14 December 2011 - 04:26 PM, said:
My guess is that for small values of T, you're much more likely to match 2 or 3 or more, than just 1.
For T=5, the probability is 1 for N=5.
For T=9, the sum of the probabilities for N=[1,5] will be 1.
After that, the sum of probabilities for N >= 1 will start falling to 0 as T approaches infinity, and the value of N that has the highest probability of being the number of exact matches will go from 5 down to 1 and eventually to 0 as T goes from 5 to infinity (that is, with a large enough T the most likely number of matches will be 0).
This post has been edited by md65536: 14 December 2011 - 07:27 PM
- Posts: 790 | Joined: 28-July 10
Reply
#17 14 December 2011 - 10:04 PM
md65536, on 14 December 2011 - 07:25 PM, said:
You're right. I fixed T=10 and X=Y=5, and varied N instead. This yields:
N p
1 0.099206
2 0.396825
3 0.396825
4 0.099206
5 0.003968
So maybe we're OK after all. The only way to check is for me to go ahead and write my brute force program which will calculate these probabilities by using a random number generator, without using the formula, and see if I get the same values as above. I'll be back.
- Posts: 13 | Joined: 03-December 11
Reply
#18 14 December 2011 - 10:19 PM
beaumontfletcher, on 14 December 2011 - 10:04 PM, said:
1 0.099206
2 0.396825
3 0.396825
4 0.099206
5 0.003968
The symmetry in probabilities also makes sense, because if you chose Y from T and are left with (T \ Y) with 5 members, it is equivalent to choosing (T \ Y) first: taking 5 members away from T and leave the remaining ones as a set called Y.
If exactly 2 members of X match with Y, then the rest (exactly 3) must match with (T \ Y).
If Y and (T \ Y) are symmetrical or whatever, the probability of matching 2 in one and 3 in the other must equal the probability of matching 3 in one and 2 in the other.
Looks promising!
- Posts: 790 | Joined: 28-July 10
Reply
#19 15 December 2011 - 05:00 PM
- Posts: 13 | Joined: 03-December 11
Reply
#20 15 December 2011 - 07:10 PM
beaumontfletcher, on 15 December 2011 - 05:00 PM, said:
If you change one entry in each row to a 2, indicating where both members match, the number of 1s (where exactly 1 member matches) is 60.
Hmmm... the chart gets me thinking of a different strategy to solve the problem.
Say you already have the set X and T, and you want to construct Y so that exactly N members match. How many ways can you do this?
You would choose N members from X, and then fill out the rest of Y with remaining members from T that aren't in X. The number of ways to do this is "From X choose N" times "From |(T \ X)| choose Y-N".
Then "From T choose Y" is the total number of ways that Y can be chosen.
Then the probability of randomly getting one of the sets where exactly N match (being lazy with the notation) would be...
(x C n)(t-x C y-n) / (t C y)
which expands to... uh I don't want to bother. It might be interesting to expand and compare it to the previous formula.
But LibreOffice has a COMBIN function, so it's easier to keep it as combinations for spreadsheet math!...
Using X=Y=2 and N=1 and T=5, it comes to 2 * 3 / 10 = 0.6
This post has been edited by md65536: 15 December 2011 - 07:26 PM
- Posts: 790 | Joined: 28-July 10
Reply

Help
Sign In »
Register Now!












