Jump to content

Need help with code (to calculate mutual information)


ecoli

Recommended Posts

Hi all, I'm writing this code in R and it should be relative straight forward, but I keep running into negative numbers, which MI should NOT have! Not sure if the code is buggy or if I'm getting the math wrong. Am I correct in thinking there should be 1 value for MI between each number in an associated list?

 

First the pseudocode:

 

1 initialize parameters
2 Generate 2 lists of random numbers btw 0,1 called x and y
3 cut number list into bins
4 create frequency table for number of counts in each bin
5 create function to calculate marginal distribution of binned x
       for every x
            find the frequency of that value in x
            save as ki
Do the same for y and call it kj
6 create function to calculate joint distribution of x and y
  a. generate frequency table for every x and y combo
     for every x and associated y
          find the frequency of that combination in the data
          call it kij
7 calculate mutual information (mi) 
mi = 0
for each value taken by x
{  sum = 0
  for each value taken by y:
  {  kij = find freq where x and y are associated
     ki = find freq of x
     kj = find freq of y
     sum = sum + kij * log2(kij/(ki*kj))
  }
  mi = mi + sum
}

 

 

Here's the actual code in R

 

[font="Courier New"]#calculate MI of random variables. Notation is taken from Steuer, et al. 2002

#Set sample size & bin number
N <- 30
bins <- 2


#Generate N random "data" points for x and y in the space [0,1]
x <- runif(N, 0, 1)
y <- runif(N, 0, 1)

#Cut x and y into bins
xcut <- cut(x, bins)     #ai
ycut <- cut(y, bins)     #bj

#create frequency table for counts per bin
xfreq <- data.frame(table(xcut))  #table for kis
yfreq <- data.frame(table(ycut))  #table for kjs

#script for reading freq table into kis/kjs
#b/c cutting data into bins leads to weird character types
ki.find <- function(n) {        #### n <- xcut[i] 
   for (k in 1:bins) {
       if (as.character(n) == xfreq$xcut[k]) {
           ki <- xfreq$Freq[k]
           return(ki)
       }
    }
}            

kj.find <- function(m) {        #### n <- ycut[i] 
   for (l in 1:bins) {
       if (as.character(m) == yfreq$ycut[l]) {
           kj <- yfreq$Freq[l]
           return(kj)
       }
   }
}      

#kij is the joint prob distribution, 
#AKA freqency that 2 bins are associated at a (x,y) point 
cut.table <- as.data.frame(table(xcut,ycut))

kij.find <- function(p) {
   for (g in 1:(2*bins)) {
      if (as.character(xcut[p]) == cut.table$xcut[g] && as.character(ycut[p]) == cut.table$ycut[g]) {
           kij <- cut.table$Freq[g]
       }    
   }
   return(kij)
}        

#start with assumption of no MI
#then calculate MI

MI <- 0
for (i in 1:N) {
   sum.temp <- 0
   for (j in 1:N) {
       ki <- ki.find(xcut[j])
       kj <- kj.find(ycut[j])
       kij <- kij.find(j)
       sum.temp <-  sum.temp + (kij * log2((kij/(ki*kj))))
       sum.temp2 <- (sum.temp/N)
       }  
    MI <- c(MI, log2(N) + sum.temp2)   
}         
[/font]

 

sample output is:

> MI
[1]  82.288187  44.607130  36.576861  28.348042  20.119224  11.890405
[7]   3.661586  -4.567232 -12.796051 -21.024870 -29.253688

 

Please help me!!!

Link to comment
Share on other sites

I think if you adjust your last bit of code as follows:

 


MI <- 0
MI.run <- function() {
   for (i in 1:N) {
       sum.temp <- 0
       for (j in 1:N) {
           ki <- ki.find(xcut[j])
           kj <- kj.find(ycut[j])
           kij <- kij.find(j)
           sum.temp <-  sum.temp + (kij * log2((kij/(ki*kj))))
           sum.temp2 <- (sum.temp/N)
           print(ki)
           print(kj)
           print(kij)
           print(sum.temp)
           print(sum.temp2) }  
        MI <- c(MI, log2(N) + sum.temp2)   
   }
}

 

and maybe set N<-5 you will quickly see why kij * log2((kij/(ki*kj))) is giving you a negative number :/

Link to comment
Share on other sites

I think if you adjust your last bit of code as follows:

 

and maybe set N<-5 you will quickly see why kij * log2((kij/(ki*kj))) is giving you a negative number :/

 

So it's because i'm taking a log of a fraction < 1, which I guess means i'm doing the calculation incorrectly.

Link to comment
Share on other sites

You're applying a variation of Mutual Information that involves discrete variables and you are using continuous. Your cut is being made with no break and as such your bins are not all formed the same so if you are looking for the members of two sets that intersect but you are using a top half bottom half index to decide membership there might be some inconsistency. I wasn't really sure what "#b/c cutting data into bins leads to weird character types" meant exactly. Do you have a link/reference to this "#calculate MI of random variables. Notation is taken from Steuer, et al. 2002"?

Link to comment
Share on other sites

You're applying a variation of Mutual Information that involves discrete variables and you are using continuous. Your cut is being made with no break and as such your bins are not all formed the same so if you are looking for the members of two sets that intersect but you are using a top half bottom half index to decide membership there might be some inconsistency. I wasn't really sure what "#b/c cutting data into bins leads to weird character types" meant exactly. Do you have a link/reference to this "#calculate MI of random variables. Notation is taken from Steuer, et al. 2002"?

 

http://bioinformatics.oxfordjournals.org/content/18/suppl_2/S231.full.pdf

 

I'm not sure that it matters the bins are not formed the same. For example if your comparing two lists of different types the only thing that should matter is the probability distribution, not the exact numbers.

 

In R, the cut function creates a list that looks something that replaces an exact number with a range, like (0.43,0.616], which is not numeric type. I can create a frequency table that counts how many times each range is present in the list.

I have to do a character match to find the associated frequency. in the table, which I think is the marginal distribution (divided by N).

 

The probability of the intersection is the frequency that both ranges appear at the same time in both lists. Unless I've misunderstood.

 

Thanks a bunch for your help so far, btw.

Link to comment
Share on other sites

After reading the paper the idea has formed much clearer in my head. The entire purpose of binning, as you have been, is to discretize the continuous function(which I'm sure you already knew :) ) so that a discrete model may be applied. What I had said before becomes even more important in that you will only have discrete values if the bins form uniformly. In case you may have thought they were uniform, having directly observed these values myself they are in fact not(sometimes.) This requires a partitioning function as described in the three sections following 'NUMERICAL ESTIMATION' and this would be applied to the break value and correspondingly to the number of breaks produced.

 

Thanks for the paper, I think I might play with this for a little while!

 

break: either a vector of cut points or number giving the number of intervals which x is to be cut into.

Edited by Xittenn
Link to comment
Share on other sites

After reading the paper the idea has formed much clearer in my head. The entire purpose of binning, as you have been, is to discretize the continuous function(which I'm sure you already knew :) ) so that a discrete model may be applied. What I had said before becomes even more important in that you will only have discrete values if the bins form uniformly. In case you may have thought they were uniform, having directly observed these values myself they are in fact not(sometimes.) This requires a partitioning function as described in the three sections following 'NUMERICAL ESTIMATION' and this would be applied to the break value and correspondingly to the number of breaks produced.

 

Thanks for the paper, I think I might play with this for a little while!

 

break: either a vector of cut points or number giving the number of intervals which x is to be cut into.

 

ah shit. I hope this is the major problem I've been having. I'm looking for an easy solution in R to cut a vector into equal bins. Histogram does it, but only to plot data. weird.

Link to comment
Share on other sites

[math] k_i \neq freq_x \; k_i = \frac{freq_x}{N} [/math]

 

Correct me if I'm wrong but this is in fact the issue, if you hadn't already deduced it. I had already, yesterday, but I wasn't sure what you were doing.

Link to comment
Share on other sites

[math] k_i \neq freq_x \; k_i = \frac{freq_x}{N} [/math]

 

Correct me if I'm wrong but this is in fact the issue, if you hadn't already deduced it. I had already, yesterday, but I wasn't sure what you were doing.

 

Then is equation 17 in the paper incorrect?

Link to comment
Share on other sites

or better yet

 

[math] k_i = freq_x \land p(a) = \frac{k_i}{N} [/math]

 

which you weren't doing!

 

sorry i think we crossposted, but this just doesn't seem like what's going on in equation 17. He's taken N out of each k term, but adds it into the MI calculation separately.

Link to comment
Share on other sites

It isn't the same, I think multiple forms are being presented and you are equating them to being the same. Your code looks like it is of the form that is taken from wolfram which is not the form of 17 and where certain definitions seem to have taken different meaning on account of various mathematical assumptions.

 

My approach on this would be to start back at the beginning from scratch and build up again and to this point of knowledge. Sometimes it takes less time to figure out a problem if you throw away everything you think you know and start from square one. This is just my humble opinion however.

 

Sorry I can't give you a better explanation, I'm just trying this out for the first time myself. I took interest because it is something I was already working on, but in different areas.

Edited by Xittenn
Link to comment
Share on other sites

It isn't the same, I think multiple forms are being presented and you are equating them to being the same. Your code looks like it is of the form that is taken from wolfram which is not the form of 17 and where certain definitions seem to have taken different meaning on account of various mathematical assumptions.

 

My approach on this would be to start back at the beginning from scratch and build up again and to this point of knowledge. Sometimes it takes less time to figure out a problem if you throw away everything you think you know and start from square one. This is just my humble opinion however.

 

Sorry I can't give you a better explanation, I'm just trying this out for the first time myself. I took interest because it is something I was already working on, but in different areas.

 

 

Ahh. damn I see what's going on a bit better now, and rebinning into equal portions definitely helped.

Link to comment
Share on other sites

>cross posting there<

 

Yeah sorry, I see where I went off track .... :/

Yeah but you are right that he's estimating the probability distribution differently because he's using a histogram (fixed that problem, btw, my data points are being split up into equal bins now). That assumption probably won't work with real data though.

Link to comment
Share on other sites

I have two code samples, one fulfills Kullback Entropy equivalent of Mutual Information and the other fulfills the histogram model. The results of each leave me asking the value of the information provided by each value for which each model results.

 

The problem I am seeing with your original post is again that you have mixed the two concepts together. You say 'fixed' I'm curious how and if we have come to the same conclusion. I have concluded that

 

for (i in 1:N) {   <------  ******
   sum.temp <- 0
   for (j in 1:N) {   <------  ******
       ki <- ki.find(xcut[j])   <------  ******
       kj <- kj.find(ycut[j])

 

was the cause of the problem and that in fact kij * log2((kij/(ki*kj))) should be negative and that the sum of ij kij * log2((kij/(ki*kj))) should never be greater than log(N). You should be stepping through the bin divisions not the samples .... :/

 

note the little note on S233

 

and modify this appropriately

 


if ( ( as.character( ycut[i] ) == cut.table$xcut[g] ) && ( as.character( ycut[j] ) == cut.table$ycut[g] ) ) {

Edited by Xittenn
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.