Jump to content

finding duplicates in huge list


Dak

Recommended Posts

I have a large list of words, with several duplicates, so i wrote a small python script to generate a list of all the dupes:

 

dicnohash = the list

dups = the list of duplicates in the list

 

dups = []			
for a in range(0,len(dicnohash)):
for b in range(a+1,len(dicnohash)):
	if dicnohash[a] == dicnohash[b]:
		dups.append(dicnohash[b])
		break

 

it just compares each item with each item after it, and if they match, it logs the match in the dups list (which is later written to file), and then breaks to the next item in the list.

 

however, there are nearly 100,000 items in the list, so i guess it's going to do a max of somewhere in the range of 5,000,000,000 comparisons :eek:

 

that's probably why it's taking so long :D

 

my question is, how could this be done faster? I honestly can't think of any way of doing it that'd take less comparisons than the way above. it only needs to be done once, so i'm using the above code + patience, but i'm still interested

Link to comment
Share on other sites

O.k. real computer science whilst still far more interesting isn't exactly my strong point, but I'm thinking that if you ran a bubble sort algorithm (augmented to record duplicates) then less comparisons would be made, possibly. Also, you would get the list sorted as a bonus.

Link to comment
Share on other sites

Things you could try:

1) eliminate words from the text that were already found as duplicates so they're not checked again later. That's little effort with probably little gain; in fact, the question whether you get a performance gain from that will be strongly dependent on how the text structure is stored and how it performs deletion of intermediate words.

2) read-in the whole text in a better-suited structure (e.g. alphabetically sorted binary tree where each node counts the number of occurences of the word). That changes complexity from O(n²) to O(n*X)+Y, where X is the complexity of writing to the structure (i think that typically log(n) for trees) and Y is the complexity of reading out your data. A lot of effort with questionable outcome but could be worthwhile investigating - even if it's just for getting a feeling for the advantages and disadvantages of the different approaches.

Link to comment
Share on other sites

Its all about data structure.

 

If your list is just a simple file or an array then without re-structuring you have to do it just how you are, checking every word with all duplicates.

 

As far as suggestions for deleting duplicates, that would take just as long as finding since in order to delete you have to first find.

 

I suppose the best method of attack would be like someone else mentioned, first alphabetize the list before removing duplicates that way you significantly reduce the compares that must be done with the duplicate words.

Link to comment
Share on other sites

As far as suggestions for deleting duplicates, that would take just as long as finding since in order to delete you have to first find.

If you take a close look at Dak's code then you'll see that he finds a duplicate ("if dicnohash[a] == dicnohash:"), correctly copies it to the list of duplicates ("dups.append(dicnohash)") and later checks it again over the range [a'+1,len(list)] when a'=b.

Link to comment
Share on other sites

he was. eg, if theres a duplicate at points 3 and 7, then, once you've compared 3 with 7 and id'd the duplicate, it's pointless to then compare 4,5,and6 with 7 (as you've allready compared them with 3, which == 7), and it's pointless to compare 7 with everything past 7, as you allready know the item entered at point 7 is a duplicate.

 

if, once you've compared 3 with 7 and id'd the duplication, you then delete item 7, you prevent the above. essentially, you'd save len(list)-currentListItem number of comparisons (tho i think what atheist was originally getting at was that deleting items from a list could take alot of re-arranging of data too, so might not be worth it)

 

oh, by-the-by, it took 3 hours in the end :-(

 

O.k. real computer science whilst still far more interesting isn't exactly my strong point, but I'm thinking that if you ran a bubble sort algorithm (augmented to record duplicates) then less comparisons would be made, possibly. Also, you would get the list sorted as a bonus.

 

ta. i didn't think of sorting them first.

 

but a quick google suggests that bubble-sort isn't very good:

 

http://en.wikipedia.org/wiki/Bubble_sort

http://en.wikipedia.org/wiki/Quicksort

http://www.cs.ubc.ca/spider/harrison/Java/sorting-demo.html

 

Look into hash tables

 

will do, but i don't think python has them.

 

guess this would be a good time to learn C...

 

===edit===

 

dups = []
dicnohash.sort()

for a in range(0,len(dicnohash)):
       if dicnohash[a] == dicnohash[a+1]:
               dups.append(dicnohash[a])

 

took, like, a few seconds :mad:

Link to comment
Share on other sites

There's a 4-line implementation of a quicksort for Python on Wikibooks. It might not be the best possible solution due to relatively large stack usage, but you could try that ... *looks it up* ... http://en.wikibooks.org/wiki/Algorithm_implementation/Sorting/Quicksort#Python

 

Considering that and the relatively large programming overhead in creating a usefull tree-structure I think doing a sort on the list would be option #1.

Link to comment
Share on other sites

ta. i didn't think of sorting them first.

 

but a quick google suggests that bubble-sort isn't very good:

Oh you're right that bubble-sort is a piece of crap for actually sorting anything, but I was suggesting finding duplicates as you sort.
Link to comment
Share on other sites

Assuming they are lists, the easiest way would be just to use python lists sort() method (which I think uses a variation on quicksort called samplesort (ref)) and then go through linearly to find the duplicates. You will still get O(nlogn) performance although the constant involved will be slightly bigger due to that additional linear search. You'll still get much better performance than you are getting with fairly minimal effort.

Link to comment
Share on other sites

Not to say this is best, but if I were you and I'd need a fast algo I'd build 24 lists (call them LA, LB, LC) and do one pass splitting them by first letter. That would give you a distribution of 24 and a weight of an average ~4000).

 

Each list can be sorted (sounds odd but a normal PC takes some 1-2 seconds to do that). Once that is done, scan each list from 0 to Size-2 and if LA=LA[i+1] then delete LA. (In real world this is done best counting backwards). This equates to 2 passes of the list and 24 QSorts, overall it should be quite fast.

 

This eliminates the overhead of a more complex (yet effective) hash. You can tweak it in real live by timing QSort versus the scanning of the 24 lists. If the sorting is slower, do a wider spread, say first 2 letters. It all depends on the input data. Note QSort performs VERY poorly if applied to small or sorted lists. If you expect semi-sorted lists then perhaps a double bubble would be better.

 

Though the simplest approach would most likely be sorting then scanning. String comparison is slower so you could do some dirty(er) tricks, like sorting by word length. That should speed things up a bit.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.