beaumontfletcher 0 Posted December 3, 2011 I have a problem which I will first state formally and then give some real-life context. Any help appreciated! Suppose I have a set with T members. I randomly choose a subset with X members. Independently of that, someone else randomly chooses a subset with Y members. Without loss of generality, assume X <= Y. Given a number N (N <= X), what is the probability that the two subsets have exactly N members in common? Now for the real-life context. Historians often find multiple copies of very old manuscripts but don't know how they came into existence. One technique they use is to look for common mistakes. If two documents have exactly the same mistake, then the suspicion arises that one was copied from the other. The more common mistakes there are, the more certain it becomes that one document was copied from the other, because it would be too much of a coincidence for the same mistakes to get made independently. I am trying to do the maths behind all this but struggling. I thought someone might enjoy this as a probability exercise. 0 Share this post Link to post Share on other sites

md65536 373 Posted December 4, 2011 (edited) I have a problem which I will first state formally and then give some real-life context. Any help appreciated! Suppose I have a set with T members. I randomly choose a subset with X members. Independently of that, someone else randomly chooses a subset with Y members. Without loss of generality, assume X <= Y. Given a number N (N <= X), what is the probability that the two subsets have exactly N members in common? I would try this: Suppose you have your set A, and the other has their set B. For an arbitrary subset of A, with N members, what is the chance that each of the subset's members is in B? Also what is the chance, assuming that each of the subset's members is in B, that each of the remaining members in A is not in B? Then use combination to find the number of ways to choose a subset of A with N members. Without filling in the details, I don't know if this work properly and simply enough. Is it enough to inspire a solution??? Edited December 4, 2011 by md65536 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 4, 2011 (edited) Thanks. I had a similar idea to yours. You're using subsets A and B, having X and Y members respectively. Assume you've picked your first subset and have picked N members from that subset. Now look in the second subset, the one that has Y members. The probability that all N of your members got picked for the second subset as well must be the product of (Y/T), (Y-1 / T-1), (Y-2 / T-2)...and so on, down to (Y-(N-1) / T-(N-1)). This evaluates to: Y!(T-N)! / T!(Y-N)! This formula looks right to me. E.g. if you simplify by setting N=Y, then the formula becomes Y!(T-Y)!/T! which is obviously the probability of picking a given subset of Y members from a set containing T members (i.e. the inverse of the number of ways of choosing Y from T). I think that's what the answer to your first suggestion is. I got a bit stuck thereafter, trying to tie this probability with the probability of those N members being chosen from the first subset, the one that contains X members. Unless someone hands me a solution, I'm going to try to find some textbook which discusses problems like this. I doubt it will have the same problem exactly, but it should help me get unstuck. Edited December 4, 2011 by beaumontfletcher 0 Share this post Link to post Share on other sites

md65536 373 Posted December 4, 2011 I think that's what the answer to your first suggestion is. I got a bit stuck thereafter, trying to tie this probability with the probability of those N members being chosen from the first subset, the one that contains X members. I haven't worked through the math or logic of the first part, but I'll go on to this part. The first part deals with the probability P of an arbitrary selection of N elements matching in sets A and B (and importantly if you haven't done it*, it must also include the probability that the remain elements don't match). http://en.wikipedia.org/wiki/Combination _{X }C_{ N} ("X choose N") tells you the number of different ways of choosing N elements from a set of X members. Each of these possible subsets are similar and each has a probability of P of matching exclusively to sets A and B. Each of these possible subsets is equally likely to be chosen as an "arbitrary selection of N elements". Each of these possible subsets are unique, so the possibility of choosing one vs another is mutually exclusive. That means that you can just add up the probabilities of any one of them matching (which is P), for each of the possibilities of actually choosing that group. So the final answer should be: P * (_{X }C_{ N}) If it's confusing, consider it to be rephrasing the question to be: "What is the probability that this group of N elements from A matches exclusively with B, or that this other group of N elements does, or that this other group does (and so on for each possible group of N)?" * Glancing at your solution to the first part, I'm guessing that it doesn't take this into account? If you calculate the probability that all N elements match, but leave open the possibility that additional elements will match, then the probabilities of one combination matching vs another are not mutually exclusive or whatever, and the second part won't work. 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 5, 2011 Thanks again for giving time to this. You are right that I had not taken into account that the remaining elements must not match. What I worked out was actually the probability of "at least N" rather than "exactly N" common members. For my purposes, this will do for now. If I find the formula for the probability of at least N matches, it should be straightforward to work out the formula for exactly N. The _{X}C_{N} formula you mention might be correct (I haven't got my head round it yet) but, if so, it means that my formula was wrong, because if I combine yours with mine, the result is demonstrably wrong. What you have written as P * _{X}C_{N} evaluates as follows if I substitute P by my formula given above: ( Y!(T-N)! * X! ) / ( T!(Y-N)! * N! * (X-N)! ) This looks plausible because it is at least symmetrical in X and Y. But if we set N=1 then the formula collapses to XY/T. That's obviously wrong because it will exceed 1 if the two subsets are large enough. So either my formula is wrong or the extension you suggested is wrong. I need to get my head round that! 0 Share this post Link to post Share on other sites

md65536 373 Posted December 5, 2011 (edited) So either my formula is wrong or the extension you suggested is wrong. I need to get my head round that! I think that using the number of combinations and simply multiplying by that, assumes that only one possible combination can work. Otherwise, the probability of one combination working can overlap with the probability of another combination working, so you can't just add their probabilities together. As an example, say that X = Y = T = 10, and we choose N = 5. Obviously, all 10 members in your chosen set (A) match all 10 members in the other's chosen set (B). Probability is 1. Any subset of 5 that we choose will also match (but not exclusively; the remaining 5 will also match). 10 C 5 is 252, so if we just add up the probabilities that each possible set of 5 matches, then we get a probability of 252, which is wrong. My formula is wrong for doing it this way. If your formula is right, mine won't work with it. However, with the original question of matching only exactly 5, if say X = Y = 10 but T = 100, and say that the intersection of X and Y has 5 members... then there is only 1 possible combination of 5 members from X that matches exclusively with 5 members of Y. The chance that one combination matches excludes the chance that some other combination also matches. I haven't wrapped my head around the first part of the puzzle enough to know whether either of our methods seems right or not. That is... my strategy is to find the probability that some specific (but generic) set of size N will match exclusively, and then multiply by all the possible ways you can choose that set. If you can find the probability that any set of size N will match exclusively, then you don't have to worry about the number of ways you can choose such a set. Edited December 5, 2011 by md65536 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 6, 2011 I think I may try a less general case first, such as N=1 or X=Y, and see if I can work out the formula for that, and hope that it will help me see the general problem in the right way. The second part will have to be done in some such way as the one you suggest, but I haven't found a way to think about that yet. 0 Share this post Link to post Share on other sites

md65536 373 Posted December 6, 2011 I think I may try a less general case first, such as N=1 or X=Y, and see if I can work out the formula for that, and hope that it will help me see the general problem in the right way. The second part will have to be done in some such way as the one you suggest, but I haven't found a way to think about that yet. So, out of curiosity I tried working through the first part and I get the same answer you did. I should warn you that I'm not a mathemagician! ALSO NOTE: At the end of writing this I ran into a problem that puts the math for "P2" below in doubt. If you can follow my reasoning below you might be able to figure out if it's okay or if not, maybe you can see the right way to do it. I'll probably look at this again later and see if I can figure out the problem, but it might be beyond me!!! First part: Consider the first N elements of set A. The chance that the first element is in Y is: Y / T. The second (Y-1)/(T-1), etc. The product of all N is: P1 = Y!/(Y-N)! / [T!/(T-N)!] This is the same answer that you came up with. So while we were doing this we removed matching elements from B and the set that had T elements so that they didn't affect the probability of matching subsequent elements. We're left with (X-N) elements that we want to make sure are not in a set of (Y-N) elements, which can be chosen randomly from a set of (T-N) elements. Sorry for the mess of notation but let's call C the matching subset with N members. The chance that the first element in (A \ C) IS in (B \ C) would be: (Y-N)/(T-N) So the chance that it's not is 1 - (Y-N)/(T-N) Now, we don't have to remove any elements from (B \ C), because we didn't match any! I'm not so sure if we should remove it from the "main set" that has (T-N) elements left in it. But I think that we should! So the chance that the second element in (A \ C) is NOT in (B \ C) is 1 - (Y-N)/(T-N-1) We do this for all (X-N) elements... ... exceeding my math abilities... We get the product P2 = [1 - (Y-N)/(T-N)][1 - (Y-N)/(T-N-1)] ... [1 - (Y-N)/(T-N-(X-N-1)] I don't know how to simplify that. There's gotta be a simpler expression! Finally X C N = X! / (N! (X-N)!) Then the final answer would be P1*P2*(X C N) Note: If (X-N)+(Y-N) > (T-N) then P2 must be 0, because there is no way to divide the "leftover" elements of the full set between sets A and B, with no overlap between them. If we allow such a case, then one of the products in P2 should be 0, but you could also end up dividing by 0. Hrm..................... 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 6, 2011 (edited) Thanks again for having a go at this! I hope it's been some fun for you. I am not convinced by the reasoning of removing elements from one set but not the other. I think there must be an error because I think we can show that the formula is wrong by looking at the simple case X=Y=T, i.e. my two subsets are the whole set. In this case, it doesn't matter what N is, the probability has to be 1, because whatever element(s) you pick from one subset are bound to be in the other. But your formula for P2 gives 0 if Y=T, so the final probability also comes out to be 0. As I said, I will try the N=1 case or some other simplification. It's obviously a bit too ambitious to have a go at the general case yet. Edited December 6, 2011 by beaumontfletcher 0 Share this post Link to post Share on other sites

md65536 373 Posted December 7, 2011 (edited) Thanks again for having a go at this! I hope it's been some fun for you. I am not convinced by the reasoning of removing elements from one set but not the other. I think there must be an error because I think we can show that the formula is wrong by looking at the simple case X=Y=T, i.e. my two subsets are the whole set. In this case, it doesn't matter what N is, the probability has to be 1, because whatever element(s) you pick from one subset are bound to be in the other. But your formula for P2 gives 0 if Y=T, so the final probability also comes out to be 0. As I said, I will try the N=1 case or some other simplification. It's obviously a bit too ambitious to have a go at the general case yet. Well, it's an interesting puzzle and I think I must have OCD. In the case that X=Y=T, the probability that N elements in X match exclusively with N elements in Y is 1 iff N = X, otherwise the probability is 0. I think it's okay for the formulae to fail in impossible cases, and usually you exclude impossible cases in the description of the problem (because we'll already know that it's impossible if N is too low relative to T, etc). The reasoning is that "This formula only works when there is enough of the T elements to be split up among A and B so that after the N elements are matched, there are still enough unique elements to fill out A and B." Otherwise if it's impossible to do so the probability is 0. So we can say "For X-N+Y-N > T-N, the probability is 0, otherwise it's... (some formula)". But I must have been on crack in trying to figure out the "P2" part in my last post. So I'll give it another go! Basically, in the P1 part we match the first (in an arbitrarily chosen order) N elements in A with elements in B. We can remove this subset C of matching elements from A, B, and the full set with T elements. We're left with 3 smaller sets and we want to find the probability that none of the elements of the first are in the second, where both sets are subsets of the 3rd. So let's look only at the P2 part with some new names. Let's call the sets E, F, and G, with size e,f,g respectively, where E = A \ C, F = B \ C, and G = {the full set with T elements} \ C. We want the probability that no members of E are in F. I tried to do it one way above and the math didn't come out nice. But I realized that the problem is equivalent to finding the probability that all the members of E are in the remainder of G after removing F. That is, if all members in E are in G \ F, then no members of E can be in F. So like with figuring out P1, the chance that the first element in E is in G \ F is: (g-f)/g And for the next element is (g-f-1)/(g-1) etc for all e members... with a product of P2 = (g-f)!/(g-f-e)! / (g!/(g-e)!) I just checked an example in a spreadsheet and I get the same value as the convoluted math of my last post! Note that e=X-N, f=Y-N, g=T-N So I think the P1*P2*(_{X}C_{N}) formula is now something manageable, when it's all put together! Edited December 7, 2011 by md65536 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 8, 2011 Sorry, I need a fresh mind to go through this, so I will have to wait till the weekend. Thanks again for the new ideas! 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 11, 2011 (edited) Hello, md65536 (if you're still round). I sat down this weekend and went through your new analysis (for which, thanks again). Firstly, I thought I'd reduce the number of symbols we need by using the same symbol to represent a set and the number of elements in it. So T is both the set of all elements and also the number of those elements. The context should make it obvious which meaning is being used at any point in the analysis. I drew two diagrams, one to show the general case when we have a match, i.e. N is contained in both X and Y; and the general case when we do not have a match, i.e. the set N we picked from X is only partly contained in Y. Using this, I agree with your formula for P2. It is: P2 = (G-F)!(G-E)! / G!(G-F-E)! We already know P1 and _{X}C_{N}. The problem is that we cannot work out the final answer unless we know what F, G and E are. Two of these are easy: G = T - N E = X - N In the match case, obviously F = Y - N. But, as my question mark in the picture shows, we do not know what F is in the non-match case. Without a formula for F, the P2 expression cannot be evaluated... Ah! I think I was wrong to worry about the non-match case, since you are calculating P2 on the premise that we have a match (that is what P1 is for). Therefore, we have F = Y - N. So our final probability is p = P1 * P2 * _{X}C_{N} We know from above that P1 = Y!(T-N)! / T!(Y-N)! From your post, P2 = (G-F)!(G-E)! / G!(G-F-E)! Substituting the expressions for F, G and E we get P2 = (T-Y)!(T-X)! / (T-N)!(T-X-Y+N)! So the final probability: p = Y!(T-N)!(T-Y)!(T-X)!X! / T!(Y-N)!(T-N)!(T-X-Y+N)!N!(X-N)! This simplifies to: p = X!Y!(T-X)!(T-Y)! / T!N!(X-N)!(Y-N)!(T-X-Y+N)! This is at least symmetric in X and Y. I will see how I can check this with some examples. Edited December 11, 2011 by beaumontfletcher 0 Share this post Link to post Share on other sites

md65536 373 Posted December 11, 2011 Ah! I think I was wrong to worry about the non-match case, since you are calculating P2 on the premise that we have a match (that is what P1 is for). Yeah! I think that basically when you're multiplying probabilities, including with all the factorial factors, you're essentially saying "What is the probability that this first thing happens, and assuming that it does, what is the probability that this second thing also happens, and assuming that it does, what is the probability that this third thing also happens, etc..." If you ever get a probability factor of 0 for any of the individual factors, then the whole thing comes out to 0. BUT ALSO, if the probability that "the second thing" is 0, then you can't assume that it happens when looking at the probability of the third thing. That is why the formula can fail in impossible cases, because it assumes impossible things. So you would state the domain of the function, and either state that you assume the problem is within the domain (ie assume that T is big enough) or you would handle the other cases separately. I think the reasoning is good; I hope the math works out! I haven't tried putting it all together myself, but I usually end up making mistakes that need correcting. 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 11, 2011 I am going to write two software programs, one to calculate these probabilities according to your formula (for big numbers, even Excel can't do factorials). The other program will be to calculate the probability by brute force, i.e. given T, use an industrial-strength random number generator to pick sets X, Y and N, and see if N is common to X and Y. Do this a million times and see what the actual probability comes out as. If this equals the theoretical probability for a varied set of values of X, Y and N, then we can be fairly confident that your formula is correct. I will come back and update the thread when I have done the programs. In the meantime, thanks for all your help! 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 14, 2011 Hello again. I'm afraid the formula now looks wrong to me, after trying it out in Excel for small values. I set X=Y=5 and N=1 and varied only T. In other words, given a set T, pick two subsets of size 5 and ask what is the probability that they have exactly one element in common. If you use Excel and put T in cell A1, X in cell B1, Y in cell C1 and N in cell D1, then the Excel formula for the probability is: =FACT(B1)*FACT(C1)*FACT(A1-B1)*FACT(A1-C1)/(FACT(A1)*FACT(D1)*FACT(B1-D1)*FACT(C1-D1)*FACT(A1-B1-C1+D1)) The surprising thing is that, at the start, the probability goes up as T goes up (with X, Y and N constant) before starting to fall: T Probability 10 0.099206349 11 0.162337662 12 0.220959596 13 0.271950272 14 0.314685315 15 0.34965035 16 0.377747253 17 0.399967679 18 0.417250233 19 0.430426557 20 0.440208978 21 0.447196422 22 0.451887294 23 0.454694047 24 0.455957086 25 0.455957086 26 0.454925509 27 0.453053388 28 0.450498575 29 0.447391689 30 0.443840961 31 0.439936202 32 0.435752026 33 0.43135049 34 0.42678325 35 0.422093324 So the probability peaks at 24 and 25 before starting to fall. Why does it rise between 10 and 24? It doesn't make sense. 0 Share this post Link to post Share on other sites

md65536 373 Posted December 14, 2011 (edited) So the probability peaks at 24 and 25 before starting to fall. Why does it rise between 10 and 24? It doesn't make sense. My guess is that for small values of T, you're much more likely to match 2 or 3 or more, than just 1. For T=5, the probability is 1 for N=5. For T=9, the sum of the probabilities for N=[1,5] will be 1. After that, the sum of probabilities for N >= 1 will start falling to 0 as T approaches infinity, and the value of N that has the highest probability of being the number of exact matches will go from 5 down to 1 and eventually to 0 as T goes from 5 to infinity (that is, with a large enough T the most likely number of matches will be 0). Edited December 14, 2011 by md65536 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 14, 2011 My guess is that for small values of T, you're much more likely to match 2 or 3 or more, than just 1. You're right. I fixed T=10 and X=Y=5, and varied N instead. This yields: N p 1 0.099206 2 0.396825 3 0.396825 4 0.099206 5 0.003968 So maybe we're OK after all. The only way to check is for me to go ahead and write my brute force program which will calculate these probabilities by using a random number generator, without using the formula, and see if I get the same values as above. I'll be back. 0 Share this post Link to post Share on other sites

md65536 373 Posted December 14, 2011 N p 1 0.099206 2 0.396825 3 0.396825 4 0.099206 5 0.003968 The symmetry in probabilities also makes sense, because if you chose Y from T and are left with (T \ Y) with 5 members, it is equivalent to choosing (T \ Y) first: taking 5 members away from T and leave the remaining ones as a set called Y. If exactly 2 members of X match with Y, then the rest (exactly 3) must match with (T \ Y). If Y and (T \ Y) are symmetrical or whatever, the probability of matching 2 in one and 3 in the other must equal the probability of matching 3 in one and 2 in the other. Looks promising! 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 15, 2011 Sorry but I think I have disproved the formula. I took a small set T = {a,b,c,d,e} and let X=Y=2 and N=1. The matrix below shows all possible combinations of sets X and Y, with a 1 indicating that they have an element in common and 0 indicating that they do not. There are exactly sevens 1's in each row and column, totalling 70. Since there are 100 cells, the probability is 0.7 whereas the formula gives 0.6 0 Share this post Link to post Share on other sites

md65536 373 Posted December 15, 2011 (edited) Sorry but I think I have disproved the formula. I took a small set T = {a,b,c,d,e} and let X=Y=2 and N=1. The matrix below shows all possible combinations of sets X and Y, with a 1 indicating that they have an element in common and 0 indicating that they do not. There are exactly sevens 1's in each row and column, totalling 70. Since there are 100 cells, the probability is 0.7 whereas the formula gives 0.6 If you change one entry in each row to a 2, indicating where both members match, the number of 1s (where exactly 1 member matches) is 60. Hmmm... the chart gets me thinking of a different strategy to solve the problem. Say you already have the set X and T, and you want to construct Y so that exactly N members match. How many ways can you do this? You would choose N members from X, and then fill out the rest of Y with remaining members from T that aren't in X. The number of ways to do this is "From X choose N" times "From |(T \ X)| choose Y-N". Then "From T choose Y" is the total number of ways that Y can be chosen. Then the probability of randomly getting one of the sets where exactly N match (being lazy with the notation) would be... (x C n)(t-x C y-n) / (t C y) which expands to... uh I don't want to bother. It might be interesting to expand and compare it to the previous formula. But LibreOffice has a COMBIN function, so it's easier to keep it as combinations for spreadsheet math!... Using X=Y=2 and N=1 and T=5, it comes to 2 * 3 / 10 = 0.6 Edited December 15, 2011 by md65536 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 15, 2011 Doh! Sorry, you're right. I wrote what I wrote, went out and realized while I was out that I had worked out the probability of at least N matches rather than exactly N matches. I expanded your new formula and it comes out the same as the previous one, but I find the new reasoning behind it easier to understand. I will work on my brute force program tomorrow so I can try to calculate these probabilities in an independent way and compare with the formula answers. 0 Share this post Link to post Share on other sites

md65536 373 Posted December 15, 2011 (edited) Doh! Sorry, you're right. I wrote what I wrote, went out and realized while I was out that I had worked out the probability of at least N matches rather than exactly N matches. I expanded your new formula and it comes out the same as the previous one, but I find the new reasoning behind it easier to understand. I will work on my brute force program tomorrow so I can try to calculate these probabilities in an independent way and compare with the formula answers. No worries. Figuring out the problems is what's interesting. I actually expected the formula to come out different!, that it would be simpler with some obscure simplification. Working with combinations instead of factorials should be a lot easier (programmatically or spreadsheet-ically) because you can have large values yet avoid calculating the full factorials. Edit: Actually I guess the same type of shortcuts could be made with either. Edited December 15, 2011 by md65536 0 Share this post Link to post Share on other sites

beaumontfletcher 0 Posted December 16, 2011 Success! The formula is confirmed. I wrote my program, which works out probabilities as follows: Given integers T, X, Y and N: 1. Initialise a counter S to 0. 2. Pick X different random integers in the range 1-T. 3. Pick Y different random integers in the range 1-T. 4. Iff the sets picked in steps 2 and 3 have exactly N integers in common, then add 1 to S. 5. Repeat steps 2-4 a million times. 6. The actual probability = (S / 1,000,000). Doing this for a few examples today, I found the actual probability equals the theoretical probability from your formula every time, though sometimes I had to go as high as ten million iterations to get the actual probability to be the same as the theoretical probability to three decimal places. A million iterations usually gives an answer correct to two decimal places. I think we can be sure the formula is correct. I also did some small examples by hand, e.g. for T=6, X=Y=3 and N=2, and got the same answer as formula. Just to recap, the formula is: p = X!Y!(T-X)!(T-Y)! / T!N!(X-N)!(Y-N)!(T-X-Y+N)! Thanks once again for all your help with this over the past two weeks! I was going to say to you that if I get a publishable paper out of this, I will contact you so you can be credited under your real name. I am not one of those people who use other people's work without acknowledgement. That still applies, though for the kind of sets I am interested in (with large T but small N) the probabilities are so close to 0 as to be useless for any real-life work. But I've learnt from doing all this and I hope it's been stimulating for you. Merry Christmas! 0 Share this post Link to post Share on other sites