Jump to content

How to use machine learning for this problem.


Ernst

Recommended Posts

Recently I have been trying to understand machine learning. I have a problem that machine learning maybe can solve.

 

In principle this is my problem( the number of individuals in the real case is more than 10 in each set )

Everyday I get 9 data sets that look like this:

 

1 4 9 3 2 0

2 1 4 6 8 1

3 2 1 4 3 0

4 10 3 1 5 0

5 3 2 5 1 0

6 5 6 8 7 0

.

.

9 . . . . 0

10 . . . . 0

 

 

The first column is always identical and let's call them identities. In the 9 data sets I want to get a solution that will predict the postive outcome(1) in column 6. So in this single data set, row 2 is of interest.

Column 2 to 5 could be regarded as ranks.

 

 

9 sets give 10 ^ 9 possible combinations and only one is correct. 1 / 10 ^ 9 is a small number.

 

I could make a logistic classification solution. I think that would give a poor solution.

 

On the other hand there are very strong indications that for each combination of the 9 data sets, it nearly always is 4-7 out of the 9, among the values ​​1 to 3 in the first column, so a typical result would be the following numbers in column 1, with 1:s in column 6, for the combination of 9 sets:

 

(2, 7, 1, 3, 2, 6, 1, 5, 3), i.e. 6 individuals in the interval [1- 3].

 

If I regard the logistic classification, which I don't think implicitly will capture "the 4-7-knowledge", as horisontal modeling, how could I get "the 4-7-knowledge", which I regard as vertical modeling, into the complete model?

 

Best regards

Ernst

Link to comment
Share on other sites

i'm afraid your problem is poorly worded, i can't make much heads or tails of what you're looking for.

it sounds like you're looking for a probabilistic model for determining what numbers occur with what frequency in each column.

here would be one way you could write such a program.

File = open("data.txt","r")
data =[]
for line in File:
  data += [line.split()]
max = 0
for i in range(0,len(data)):
  for j in range(0,len(data[0])):
 	data[i][j] = int(data[i][j])
 	if data[i][j] > max:
		max = data[i][j]

array = list(range(0,max))
x = input("which column would you like to investigate?")
for i in range(0,len(data)):
  array[data[i][x]] += 1
total = 0.0
for i in range(0,len(array)):
  total += array[i]
for i in range(0,len(array)):
  array[i] = array[i] /total
print("the probabilities are:")
for i in range(0,len(array)):
  print(i,array[i]*100)

Link to comment
Share on other sites

You can regard the 9 sets as 9 horse races and the 10 individuals as 10 different horses, with start number from 1 to 10. You are supposed to find the 9 winning start numbers and there is 10 ^ 9 possible combinations.

 

The data are historical data and I want to predict the outcome when I get the next 9 races with 10 horses, i.e. 9 sets with 5(five) columns of data on each of the 10 horses in each race. There are only numbers, so no real identity on the horses.

 

Column 1: Start numbers

Columns 2 - 5: Ranking of the 10 horses in each race , rank 1 would be the best and 10 the worst, based on certain given data concerning each horse.

Column 6: Winner = 1, loosers = 0.

 

Then I think you understand that I could do logistic classification on the historical data and use that model to predict the winners(1) and losers(0). I regard that as "horisontal modeling".

 

But I have plenty of data that strongly predicts that for the start numbers; column 1, very often 4 - 7 of the winning start numbers, out of 9, comes from the interval [ 1, 3]. I regard that as "vertical knowledge".

 

I hope this makes it more clear what I want.

 

I don't think a logistic classification could give a model that could predict, with any accuracy, which one of the 10 ^ 9 possible outcomes would be the right one. The "vertical knowledge" could maybe narrow the number of combinations that could be of interest. I would be happy if I had 7 of the races correct and if the model could pick out the 5 000 000 ( 0.5 % )best combinations to (randomly) choose among, since I of course can't afford to pay for 5 000 000 combinations.

 

 

Best regards

Ernst

Edited by Ernst
Link to comment
Share on other sites

okay that makes much more sense.

so basically what you're looking for is the number of times a "horse" gets ranked 1-3, with the best horses given the highest chances of winning.

for example, horse 3 in the data block at the initial post should receive a high probability, as should horse 5.

in particular, the first non-identity column gives the best chances for winning, so 2 should also receive a high probability.

with the reference to learning algorithms, I'm guessing you want the computer to figure out how valuable each column is, based on the number of wins and losses each horse receives with the given data.

give me a few days, let me see what i can drum up.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.