Hashing, Bloom Filters

Hash function: Does the mapping between the numbers and the required key.

Purpose: Maintain a possibly evolving set of stuff.
Operations: Insert, Delete, Lookup (All operations in O(1) time)

NOTE: Hash tables do not have ordering among its keys.
Hash tables speed up programs that require repeated lookups. Or for keeping track of stuff.

Applications:
1) Deduplication: Given a stream of objects, remove duplicates by looking up in a hash table.

2) The 2-SUM problem: Unsorted array of integers. Find two values with target sum T.
Sol: Insert elements in a hash table. For each X, lookup T-x in H.

Implementation:
High Level Idea- Consider Universe U (A REALLLY BIG SET).
Goal: Maintain a feasible evolving subset S of U.

Structure: Consider a array of size N and consider each spot to be a bucket. Hence we have N buckets. Now use a hash function. Collisions might occur for small datasets itself.

Resolving Collisions:
Chaining: Maintain a list in every bucket.
Open Addressing: No list, every bucket has one item. If the bucket is full, try another bucket. Keep trying until success.
Eg: Linear Probing, Double Hashing,(Second hash function will give the offset which will be added continuously to the first hash value to get an empty bucket)

Pro's and Con's: Chaining will take up extra space. Deletion is difficult in open addressing.

Good Hash Function: Should spread data evenly, be able to compute in constant time. Try to make it completely random.

Quick and dirty hash functions: Making simple hash functions which are useful. Basically, convert the main string into integers, then apply a compression function to fit them to the number of buckets.

How to choose number of buckets n?
1)Let n be a prime number.
2) Not too close to a power of 2 and 10.

Load factor of a hash table: (alpha) : # of objects in table/# of buckets

alpha = 1 is a necessary condition for the operations to run in constant time.

EVERY HASH FUNCTION HAS A PATHOLOGICAL DATA SET.

Solutions? 1) Cryptographic hash function 2) Use randomization over a family of hash functions.

-------------------------------------------------------------------------------------------------------------------------------------------------

BLOOM FILTER:

1) More space efficient over hash tables.
2) They allow for errors.
3) Doesnt remember(store) the object itself, just remembers whether it has seen it or not.
4) Basic ones have no deletions.
5) Possible mistakes. Flase positives. Despite not having inserted before, filter will say its there.

Applications: Early spell checkers,List of forbidden passwords, network routers,

Ingredients: Array and hash functions. Array only has 0 or 1.

Consider object . We use k hash functions. Run each hash function with the object to get an index and mark the value with that index as 1, regardless of its previous value. This is for insertion. For look-up, run the hash functions again and check the values of the corresponding posititons.

K =(log2).b (b=bits per object)

Hashing, Bloom Filters

Next up

Previous

Hashing, Bloom Filters