A hash function, is a function that maps a big set of data to a smaller set of data. For example, a hash function can map people’s names, to an integer between 1 and 10. Hash is explained in the video using terms like keys and buckets. In this case, the keys are the names, and there are 10 buckets, numbered 0 to 9.
A good hash function has certain properties. It tries to spread the keys evenly over all the buckets, so the size of each buckets after hashing is roughly the same. And for each key, it will definitely output a hash value that is valid, i.e. less than or equals to the numbers of buckets available.
A function provided by Python works similar to a hash function: ord(). Given a string of length one, this method outputs an integer representing the Unicode code point of a Unicode object For example:
print ord('A') # gives you 65
print ord('B') # gives you 66
A hash function that we would want to use for our search engine is similar, given a character, it will assign a bucket number to dump the key into. One way to quickly do this is using the modulus operator ‘%’:
print 12%2 # gives you 0
print 12%5 # gives you 2
The modulus operator gives you the remainder when you take the expression to its left, and divide it by the expression on the right. Suppose you take
x%y: the range of results you will get is from
0 all the way till
y - 1, which effectively gives you
y buckets to dump your keys into.
This is a very simple hash function, and very problematic too. Suppose you provide the words of the English Language as keys, and you hash them according to the first letter. Some buckets will definitely be bigger than the rest, compare ‘Q’ with ‘S’.
Hence a better hash function would look at all the letters in the word, and return an integer after taking into account all the letters, simply:
hash_value = 0
for letter in word:
hash_value += (ord(letter)) % buckets
The ‘+=’ is a short hand operator:
a += b # is equivalent to a = a + b