  # beaumontfletcher

Members

13

0 Neutral

• Rank
Quark

## Profile Information

• Favorite Area of Science
Maths
1. Success! The formula is confirmed. I wrote my program, which works out probabilities as follows: Given integers T, X, Y and N: 1. Initialise a counter S to 0. 2. Pick X different random integers in the range 1-T. 3. Pick Y different random integers in the range 1-T. 4. Iff the sets picked in steps 2 and 3 have exactly N integers in common, then add 1 to S. 5. Repeat steps 2-4 a million times. 6. The actual probability = (S / 1,000,000). Doing this for a few examples today, I found the actual probability equals the theoretical probability from your formula every time, though sometimes I had to go as high as ten million iterations to get the actual probability to be the same as the theoretical probability to three decimal places. A million iterations usually gives an answer correct to two decimal places. I think we can be sure the formula is correct. I also did some small examples by hand, e.g. for T=6, X=Y=3 and N=2, and got the same answer as formula. Just to recap, the formula is: p = X!Y!(T-X)!(T-Y)! / T!N!(X-N)!(Y-N)!(T-X-Y+N)! Thanks once again for all your help with this over the past two weeks! I was going to say to you that if I get a publishable paper out of this, I will contact you so you can be credited under your real name. I am not one of those people who use other people's work without acknowledgement. That still applies, though for the kind of sets I am interested in (with large T but small N) the probabilities are so close to 0 as to be useless for any real-life work. But I've learnt from doing all this and I hope it's been stimulating for you. Merry Christmas!
2. Doh! Sorry, you're right. I wrote what I wrote, went out and realized while I was out that I had worked out the probability of at least N matches rather than exactly N matches. I expanded your new formula and it comes out the same as the previous one, but I find the new reasoning behind it easier to understand. I will work on my brute force program tomorrow so I can try to calculate these probabilities in an independent way and compare with the formula answers.
3. Sorry but I think I have disproved the formula. I took a small set T = {a,b,c,d,e} and let X=Y=2 and N=1. The matrix below shows all possible combinations of sets X and Y, with a 1 indicating that they have an element in common and 0 indicating that they do not. There are exactly sevens 1's in each row and column, totalling 70. Since there are 100 cells, the probability is 0.7 whereas the formula gives 0.6
4. You're right. I fixed T=10 and X=Y=5, and varied N instead. This yields: N p 1 0.099206 2 0.396825 3 0.396825 4 0.099206 5 0.003968 So maybe we're OK after all. The only way to check is for me to go ahead and write my brute force program which will calculate these probabilities by using a random number generator, without using the formula, and see if I get the same values as above. I'll be back.
5. Hello again. I'm afraid the formula now looks wrong to me, after trying it out in Excel for small values. I set X=Y=5 and N=1 and varied only T. In other words, given a set T, pick two subsets of size 5 and ask what is the probability that they have exactly one element in common. If you use Excel and put T in cell A1, X in cell B1, Y in cell C1 and N in cell D1, then the Excel formula for the probability is: =FACT(B1)*FACT(C1)*FACT(A1-B1)*FACT(A1-C1)/(FACT(A1)*FACT(D1)*FACT(B1-D1)*FACT(C1-D1)*FACT(A1-B1-C1+D1)) The surprising thing is that, at the start, the probability goes up as T goes up (with X, Y and N constant) before starting to fall: T Probability 10 0.099206349 11 0.162337662 12 0.220959596 13 0.271950272 14 0.314685315 15 0.34965035 16 0.377747253 17 0.399967679 18 0.417250233 19 0.430426557 20 0.440208978 21 0.447196422 22 0.451887294 23 0.454694047 24 0.455957086 25 0.455957086 26 0.454925509 27 0.453053388 28 0.450498575 29 0.447391689 30 0.443840961 31 0.439936202 32 0.435752026 33 0.43135049 34 0.42678325 35 0.422093324 So the probability peaks at 24 and 25 before starting to fall. Why does it rise between 10 and 24? It doesn't make sense.
6. I am going to write two software programs, one to calculate these probabilities according to your formula (for big numbers, even Excel can't do factorials). The other program will be to calculate the probability by brute force, i.e. given T, use an industrial-strength random number generator to pick sets X, Y and N, and see if N is common to X and Y. Do this a million times and see what the actual probability comes out as. If this equals the theoretical probability for a varied set of values of X, Y and N, then we can be fairly confident that your formula is correct. I will come back and update the thread when I have done the programs. In the meantime, thanks for all your help!
7. Hello, md65536 (if you're still round). I sat down this weekend and went through your new analysis (for which, thanks again). Firstly, I thought I'd reduce the number of symbols we need by using the same symbol to represent a set and the number of elements in it. So T is both the set of all elements and also the number of those elements. The context should make it obvious which meaning is being used at any point in the analysis. I drew two diagrams, one to show the general case when we have a match, i.e. N is contained in both X and Y; and the general case when we do not have a match, i.e. the set N we picked from X is only partly contained in Y. Using this, I agree with your formula for P2. It is: P2 = (G-F)!(G-E)! / G!(G-F-E)! We already know P1 and XCN. The problem is that we cannot work out the final answer unless we know what F, G and E are. Two of these are easy: G = T - N E = X - N In the match case, obviously F = Y - N. But, as my question mark in the picture shows, we do not know what F is in the non-match case. Without a formula for F, the P2 expression cannot be evaluated... Ah! I think I was wrong to worry about the non-match case, since you are calculating P2 on the premise that we have a match (that is what P1 is for). Therefore, we have F = Y - N. So our final probability is p = P1 * P2 * XCN We know from above that P1 = Y!(T-N)! / T!(Y-N)! From your post, P2 = (G-F)!(G-E)! / G!(G-F-E)! Substituting the expressions for F, G and E we get P2 = (T-Y)!(T-X)! / (T-N)!(T-X-Y+N)! So the final probability: p = Y!(T-N)!(T-Y)!(T-X)!X! / T!(Y-N)!(T-N)!(T-X-Y+N)!N!(X-N)! This simplifies to: p = X!Y!(T-X)!(T-Y)! / T!N!(X-N)!(Y-N)!(T-X-Y+N)! This is at least symmetric in X and Y. I will see how I can check this with some examples.
8. Sorry, I need a fresh mind to go through this, so I will have to wait till the weekend. Thanks again for the new ideas!
9. Thanks again for having a go at this! I hope it's been some fun for you. I am not convinced by the reasoning of removing elements from one set but not the other. I think there must be an error because I think we can show that the formula is wrong by looking at the simple case X=Y=T, i.e. my two subsets are the whole set. In this case, it doesn't matter what N is, the probability has to be 1, because whatever element(s) you pick from one subset are bound to be in the other. But your formula for P2 gives 0 if Y=T, so the final probability also comes out to be 0. As I said, I will try the N=1 case or some other simplification. It's obviously a bit too ambitious to have a go at the general case yet.
10. I think I may try a less general case first, such as N=1 or X=Y, and see if I can work out the formula for that, and hope that it will help me see the general problem in the right way. The second part will have to be done in some such way as the one you suggest, but I haven't found a way to think about that yet.
11. Thanks again for giving time to this. You are right that I had not taken into account that the remaining elements must not match. What I worked out was actually the probability of "at least N" rather than "exactly N" common members. For my purposes, this will do for now. If I find the formula for the probability of at least N matches, it should be straightforward to work out the formula for exactly N. The XCN formula you mention might be correct (I haven't got my head round it yet) but, if so, it means that my formula was wrong, because if I combine yours with mine, the result is demonstrably wrong. What you have written as P * XCN evaluates as follows if I substitute P by my formula given above: ( Y!(T-N)! * X! ) / ( T!(Y-N)! * N! * (X-N)! ) This looks plausible because it is at least symmetrical in X and Y. But if we set N=1 then the formula collapses to XY/T. That's obviously wrong because it will exceed 1 if the two subsets are large enough. So either my formula is wrong or the extension you suggested is wrong. I need to get my head round that!
12. Thanks. I had a similar idea to yours. You're using subsets A and B, having X and Y members respectively. Assume you've picked your first subset and have picked N members from that subset. Now look in the second subset, the one that has Y members. The probability that all N of your members got picked for the second subset as well must be the product of (Y/T), (Y-1 / T-1), (Y-2 / T-2)...and so on, down to (Y-(N-1) / T-(N-1)). This evaluates to: Y!(T-N)! / T!(Y-N)! This formula looks right to me. E.g. if you simplify by setting N=Y, then the formula becomes Y!(T-Y)!/T! which is obviously the probability of picking a given subset of Y members from a set containing T members (i.e. the inverse of the number of ways of choosing Y from T). I think that's what the answer to your first suggestion is. I got a bit stuck thereafter, trying to tie this probability with the probability of those N members being chosen from the first subset, the one that contains X members. Unless someone hands me a solution, I'm going to try to find some textbook which discusses problems like this. I doubt it will have the same problem exactly, but it should help me get unstuck.
13. I have a problem which I will first state formally and then give some real-life context. Any help appreciated! Suppose I have a set with T members. I randomly choose a subset with X members. Independently of that, someone else randomly chooses a subset with Y members. Without loss of generality, assume X <= Y. Given a number N (N <= X), what is the probability that the two subsets have exactly N members in common? Now for the real-life context. Historians often find multiple copies of very old manuscripts but don't know how they came into existence. One technique they use is to look for common mistakes. If two documents have exactly the same mistake, then the suspicion arises that one was copied from the other. The more common mistakes there are, the more certain it becomes that one document was copied from the other, because it would be too much of a coincidence for the same mistakes to get made independently. I am trying to do the maths behind all this but struggling. I thought someone might enjoy this as a probability exercise.
×