Note: This entry has been restored from old archives.
Time for a little Python “spot the difference” game! I’ve been working primarily with Python lately and a few weeks ago I found this hideous performance bug lurking in my code.
This:
setHash = {}
for featureSet in featureSets:
for feature in featureSet:
if feature not in setHash.keys():
setHash[feature] = []
setHash[feature].append(featureSet)
Versus this:
setHash = {}
for featureSet in featureSets:
for feature in featureSet:
if not setHash.has_key(feature):
setHash[feature] = []
setHash[feature].append(featureSet)
Now, consider that I have huge data sets. Guess what happens? Yes, that’s right Tommy! The first example is much slower, in fact it goes 1000 times slower. Beware the Indictkeys, my script! The its that rate, the keys that come! Beware the Hashhash loop, and crypt the pythious Syntactum.
The lesson here is: remember has_key
On reflection the wrongness of what I originally wrote seems obvious; I don’t know exactly what happens “under the hood” but I can probably make a fairly accurate guess. It just goes to show that even in a language as pleasant to work with as Python it isn’t too difficult to trip yourself up with simple, everyday foolishness.
[Of course, there’s always going to be other ways to do it!
It does look neater without the has_key]