Thursday, December 31, 2009

Much of the last few days has been spent in toying with data structures, and testing a few dozen hash functions for use with a simple hash table module. It's actually the first time I've ever had to implement a hash table myself, lol. It's a fairly simple creation built on top of chaining.

The perfect hungry mans analogy for those that know squat about data structures: let's say you want to find a recipe for peanut butter cookies in a cook book. You could thumb through every couple pages of the book, looking for cookie recipes, or even flip over to the index in back, and depending on the quality of the index, hunt down by looking down the index and systematically checking all cookie references; linked lists and arrays more or less work similar. A hash table is more like a table of contents: you flip open the book, go to the ToC, and find the section with the cookie recipes, then flip to that page and start flipin' every few pages until you find the one you want.

Hash tables work essentially the same way, you feed in the key (peanut butter cookies, yum), hash it down to an index to where you can find the target. Although hash functions that don't collide (often enough to care) are possible, when you have to bind the generated index within a given size, the odds of several different keys sharing the same index skyrockets. So instead of a direct 1 to 1 mapping, you have a needle to which haystack mapping.

My implementation uses a dynamic array of bucket lists that are allocated with HashTableCreate(), each element is a list: every key is hashed then squashed down to the size. When a new key:value is inserted, it gets added to the bucket list. On look up, the key is hashed back to the same value in order to find the correct list, in which to hunt down the correct entry.


One reason I chose separate chaining, other then it fits with how my brain works; many schemes for open addressing (the alternative), feels more like something I would think up, in order to skip writing a hash table >_>. Although, I must admit the possibilities of  open addressing combined with quadratic probing are interesting; I'm more in favour of the chains. While I doubt my implementation is memory efficient—since the only two design goals were to be faster then a (pure) linear search and quick to hash out the code, it undoubtedly has it's flaws. For the sake of testing how it effects the intended usage, I may try augment/replace the current behavior of prepending new entries to the lists with moving last requested keys to the head, or switch to using red-black trees in place of a linked list.


I spent several hours testing different hash functions, using a small input set of words and a larger system dictionary file for testing. HashTableCreate() allows one to specify which hash function should be used, and it's possible to override with a user supplied one. In testing, I've found using hash functions created by people with a background in mathematics, works significantly better then my own, as duly expected lol. Other then dealing with hash function issues, the rest was a cake walk.

No comments:

Post a Comment