finding duplicates in huge list

Dak · July 7, 2007

I have a large list of words, with several duplicates, so i wrote a small python script to generate a list of all the dupes:

dicnohash = the list

dups = the list of duplicates in the list

dups = []			
for a in range(0,len(dicnohash)):
for b in range(a+1,len(dicnohash)):
	if dicnohash[a] == dicnohash[b]:
		dups.append(dicnohash[b])
		break

it just compares each item with each item after it, and if they match, it logs the match in the dups list (which is later written to file), and then breaks to the next item in the list.

however, there are nearly 100,000 items in the list, so i guess it's going to do a max of somewhere in the range of 5,000,000,000 comparisons :eek:

that's probably why it's taking so long

my question is, how could this be done faster? I honestly can't think of any way of doing it that'd take less comparisons than the way above. it only needs to be done once, so i'm using the above code + patience, but i'm still interested

the tree · July 7, 2007

O.k. real computer science whilst still far more interesting isn't exactly my strong point, but I'm thinking that if you ran a bubble sort algorithm (augmented to record duplicates) then less comparisons would be made, possibly. Also, you would get the list sorted as a bonus.

Sayonara · July 7, 2007

Look into hash tables:

the time spent searching for the required data is independent of the number of items stored

http://en.wikipedia.org/wiki/Hash_table#Time_complexity_and_common_uses_of_hash_tables

timo · July 7, 2007

Things you could try:

1) eliminate words from the text that were already found as duplicates so they're not checked again later. That's little effort with probably little gain; in fact, the question whether you get a performance gain from that will be strongly dependent on how the text structure is stored and how it performs deletion of intermediate words.

2) read-in the whole text in a better-suited structure (e.g. alphabetically sorted binary tree where each node counts the number of occurences of the word). That changes complexity from O(n²) to O(n*X)+Y, where X is the complexity of writing to the structure (i think that typically log(n) for trees) and Y is the complexity of reading out your data. A lot of effort with questionable outcome but could be worthwhile investigating - even if it's just for getting a feeling for the advantages and disadvantages of the different approaches.

theCPE · July 7, 2007

Its all about data structure.

If your list is just a simple file or an array then without re-structuring you have to do it just how you are, checking every word with all duplicates.

As far as suggestions for deleting duplicates, that would take just as long as finding since in order to delete you have to first find.

I suppose the best method of attack would be like someone else mentioned, first alphabetize the list before removing duplicates that way you significantly reduce the compares that must be done with the duplicate words.

timo · July 7, 2007

As far as suggestions for deleting duplicates, that would take just as long as finding since in order to delete you have to first find.

If you take a close look at Dak's code then you'll see that he finds a duplicate ("if dicnohash[a] == dicnohash:"), correctly copies it to the list of duplicates ("dups.append(dicnohash)") and later checks it again over the range [a'+1,len(list)] when a'=b.

theCPE · July 7, 2007

Oopps. I assumed by deleting duplicates you were referring to within the search list.

Yeh, adding already added duplicates to the check list is a waste of time.

Dak · July 8, 2007

he was. eg, if theres a duplicate at points 3 and 7, then, once you've compared 3 with 7 and id'd the duplicate, it's pointless to then compare 4,5,and6 with 7 (as you've allready compared them with 3, which == 7), and it's pointless to compare 7 with everything past 7, as you allready know the item entered at point 7 is a duplicate.

if, once you've compared 3 with 7 and id'd the duplication, you then delete item 7, you prevent the above. essentially, you'd save len(list)-currentListItem number of comparisons (tho i think what atheist was originally getting at was that deleting items from a list could take alot of re-arranging of data too, so might not be worth it)

oh, by-the-by, it took 3 hours in the end :-(

O.k. real computer science whilst still far more interesting isn't exactly my strong point, but I'm thinking that if you ran a bubble sort algorithm (augmented to record duplicates) then less comparisons would be made, possibly. Also, you would get the list sorted as a bonus.

ta. i didn't think of sorting them first.

but a quick google suggests that bubble-sort isn't very good:

http://en.wikipedia.org/wiki/Bubble_sort

http://en.wikipedia.org/wiki/Quicksort

http://www.cs.ubc.ca/spider/harrison/Java/sorting-demo.html

Look into hash tables

will do, but i don't think python has them.

guess this would be a good time to learn C...

===edit===

dups = []
dicnohash.sort()

for a in range(0,len(dicnohash)):
       if dicnohash[a] == dicnohash[a+1]:
               dups.append(dicnohash[a])

took, like, a few seconds :mad:

timo · July 8, 2007

There's a 4-line implementation of a quicksort for Python on Wikibooks. It might not be the best possible solution due to relatively large stack usage, but you could try that ... *looks it up* ... http://en.wikibooks.org/wiki/Algorithm_implementation/Sorting/Quicksort#Python

Considering that and the relatively large programming overhead in creating a usefull tree-structure I think doing a sort on the list would be option #1.

the tree · July 8, 2007

ta. i didn't think of sorting them first.

but a quick google suggests that bubble-sort isn't very good:

Oh you're right that bubble-sort is a piece of crap for actually sorting anything, but I was suggesting finding duplicates as you sort.

Aeternus · July 8, 2007

Assuming they are lists, the easiest way would be just to use python lists sort() method (which I think uses a variation on quicksort called samplesort (ref)) and then go through linearly to find the duplicates. You will still get O(nlogn) performance although the constant involved will be slightly bigger due to that additional linear search. You'll still get much better performance than you are getting with fairly minimal effort.

Ndi · July 11, 2007

Not to say this is best, but if I were you and I'd need a fast algo I'd build 24 lists (call them LA, LB, LC) and do one pass splitting them by first letter. That would give you a distribution of 24 and a weight of an average ~4000).

Each list can be sorted (sounds odd but a normal PC takes some 1-2 seconds to do that). Once that is done, scan each list from 0 to Size-2 and if LA=LA[i+1] then delete LA. (In real world this is done best counting backwards). This equates to 2 passes of the list and 24 QSorts, overall it should be quite fast.

This eliminates the overhead of a more complex (yet effective) hash. You can tweak it in real live by timing QSort versus the scanning of the 24 lists. If the sorting is slower, do a wider spread, say first 2 letters. It all depends on the input data. Note QSort performs VERY poorly if applied to small or sorted lists. If you expect semi-sorted lists then perhaps a double bubble would be better.

Though the simplest approach would most likely be sorting then scanning. String comparison is slower so you could do some dirty(er) tricks, like sorting by word length. That should speed things up a bit.

Sign In

finding duplicates in huge list

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Important Information