A hash function is a pure function: when called on the same argument twice, it returns the same value. A real-world hash function is also expected to return a useful key: a small value (in terms of memory footprint) with a trivial comparison operator, and such that the probability of collision (two different arguments having the same key) is small.
Such hash functions are useful: they turn a large object that is difficult to store and compare (the argument) into a small one that is easy to store and compare, with the benefit that if the small objects are different, then the large objects are different, and if the small objects are equal then there is a very high probability that the large objects are also equal.
A perfect hash function has a probability of collision of zero: two objects are equal if and only if their keys are equal.
The problem is that generic perfect hash functions cannot exist. If a hash function always creates a 32-bit number, then that hash function can only discriminate between 4 294 967 296 different objects. Meaning that if you manipulate 4 294 967 297 objects, you will have a collision and the function will not be perfect.
Specific perfect hash functions do exist: if the entire set of manipulated objects is used, then a perfect hash function can be found. However, this requires knowledge of the set, which is not always possible.
Approximation
I propose here an approximation of a generic perfect hash function, which uses the Objective Caml garbage collection to make an arbitrary number of values fit within an Objective Caml integer.
The interface to this approximation is as follows:
type 'a handle val (!!) : 'a handle -> 'a val make : unit -> 'a -> 'a handle
Here is what happens:
- ‘make ()’ creates a new perfect hash function that handles values of a certain type. When called on an argument, that hash function returns a handle that is associated to that argument. If two handles are associated to arguments that are structurally equal, then the handles are referentially equal. If two handles are associated to arguments that are not structurally equal, then the handles are not structurally equal either (however, comparison of two handles from distinct hash functions is unspecified).
- ‘!! handle’ returns the argument to which a handle was associated. Since this is a perfect hash function, there is only exactly one argument associated with each handle.
A simple implementation of the above consists in creating a hash table that maps arguments to handles, and an integer reference. If the argument is in the hash table, the hash function returns the associated handle. Otherwise, the integer reference is incremented, a handle is created from that integer reference (to ensure structural inequality), added to the hash table, and returned.
The problem with that simple implementation is that it is not GC-friendly: even if the values which are hashed disappear from the user code, the hash table will keep them alive until the program ends or the hash function itself is flushed (which, if the function was global, never happens).
Optimization
The proposed optimization is to make the previous implementation GC-friendly by using weak references. The implementation idea is as follows:
- The hash function stores an array of weak references to handles, and a hash table mapping values to indices in that array.
- Whenever a major GC iteration occurs, the hash function contents are scanned. If a weak reference is gone, then the corresponding hash table entry disappears.
- The value associated to a handle is stored outside the handle and in the hash function, so that it remains available once the handle is gone, in order to clean up.
The final implementation is as follows:
type 'a handle = { id : int ; owner : 'a handles } and 'a handles = { mutable handles : 'a handle Weak.t ; mutable values : 'a option array ; hash : ('a,int) Hashtbl.t } let find_free_index handles = let i = ref 0 in let s = Array.length handles.values in while !i < s && handles.values.(!i) <> None do incr i done ; if !i < s then !i else begin let new_handles = Weak.create (s*2) in let new_values = Array.make (s*2) None in Weak.blit handles.handles 0 new_handles 0 s ; Array.blit handles.values 0 new_values 0 s ; handles.handles <- new_handles ; handles.values <- new_values ; s end let get handles value = match begin try Weak.get handles.handles (Hashtbl.find handles.hash value) with Not_found -> None end with | Some handle -> handle | None -> let handle = { id = find_free_index handles ; owner = handles } in Weak.set handles.handles (handle.id) (Some handle) ; handles.values.(handle.id) <- Some value ; Hashtbl.add handles.hash value handle.id ; handle let (!!) handle = match handle.owner.values.(handle.id) with | Some x -> x | None -> assert false let clean handles = for i = 0 to Array.length handles.values do if not (Weak.check handles.handles i) then match handles.values.(i) with | Some x -> Hashtbl.remove handles.hash x ; handles.values.(i) <- None | None -> () done let make_handler size = let handles = { handles = Weak.create size ; values = Array.make size None ; hash = Hashtbl.create size } in let alarm = ref None in let handler = Weak.create 1 in Weak.set handler 0 (Some handles) ; alarm := Some (Gc.create_alarm (fun () -> match Weak.get handler 0 with | None -> (match !alarm with | None -> () | Some alarm -> Gc.delete_alarm alarm) | Some handles -> clean handles)) ; handles let make () = let handles = make_handler 128 in (fun value -> get handles value)
This code is hereby placed in the public domain, without warranty of any kind.
Hi. I'm Victor Nicollet,
0 Responses to “Generic Perfect Hash”