Perfect Hashes in Java
Given a set of m keys, a minimal perfect hash function maps each key to an integer 0 to m-1, and (most importantly) each key maps to a different integer. This means you can use the "perfect hash" number as a index into an array (i.e. use it as a hashmap) for guaranteed O(1) insertions & lookups. I'm going explain the BMZ algorithm, roughly following the author's C implmentation as it creates perfect hashes in O(m) space and time. I'll end up with an implementation of Google Guava's Equivalence as then you can use wrappers and standard Java HashMaps to create an efficient Collection with a minimum of wheel-reinventing.
But first I'll start with a simple example. Working in Java is useful as we can re-use our key Objects' hashCode methods to do most of the work. However, it's unlikely that the numbers that hashCode returns are "perfect" - so we'll have to modify them deterministically. I'll use an idea I got from the Jenkins hash algorithm - basically choose a seed integer and mix that with the hashCodes of the keys. As we want the resulting hashCode to lie between 0 and m-1 we'll just do mod-m on the result after mixing in the seed - so then now we just have to worry about choosing a seed that makes each object map to a different number.
8 9 10 11 12 13 14 15 16 17 18 19 |
⟨First Draft⟩+≡ private static final class MixSeed<E> extends Equivalence<E> { private final int seed; private final int m; MixSeed(int seed, int m) { this.seed = seed; this.m = m; } protected boolean doEquivalent(E a, E b) { return a.equals(b); } protected int doHash(E t) { return (t.hashCode() ^ seed) % m; } } |
The first - draft approach is simply to guess a seed; if the resulting hashCodes are perfect, then return an Equivalence that uses that seed, but if not try again. We don't want to keep looping forever, so fix the number of tries and fail if no perfect hash is found.
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
⟨First Draft⟩+≡ public static <E> Equivalence<E> create(Collection<? extends E> keys) { Random r = new Random(17); // fixed seed, so deterministic for (int tries = 0; tries < 1000; ++tries) { int seed = r.nextInt(); SortedSet<Integer> perfectHashes = Sets.newTreeSet(); for (E key : keys) { perfectHashes.add((key.hashCode() ^ seed) % keys.size()); } if (perfectHashes.first() == 0 && perfectHashes.last() == keys.size() - 1) { return new MixSeed<E>(seed, keys.size()); } } ⟨Give Up⟩ } |
This is clearly not very likely to succeed. To work out the exact probability of an iteration finding a perfect hash, we'll assume the hashCode mixed with the seed is uniformly distributed between 0 and m-1. The first key can be mapped to any of the m integers in this range, the second to any of the m-1 remaining integers, the third to the m-2 remaining integers, &c., and the probablity of this happening is m/m * (m-1)/m * (m-2)/m * ... * 1/m, which is m!/m^{m} - so not very likely!
The BMZ algorithm takes a pretty interesting approach. To build the perfect hash in O(m) time we can only store an O(m) amount of state. We want to make the constant as big as possible (which uses a lot of memory - not ideal), so we could either store really big state objects, or make several queries smaller state objects (which BMZ does). The problem them becomes: (1) how do you work out what queries to make, and more importantly (2) how do you build up the state such that each key makes result in a different hash number.
BMZ queries the state twice to get the data it needs to return the hash number, and solves the first step by a logical extension of the first draft above: instead of having one seed, have two! The Equivalence below takes the shared state g (an array whose length is not m), queries it twice with the two different seeds, and combines them by simply summing the two states it finds. I've made the Equivalence Serializable so once you've done the hard work of generating it you can persist it somewhere and load it in other applications.
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
⟨Returned Equivalence⟩+≡ private static final class BMZ<E> extends Equivalence<E> implements Serializable { private final int seed1; private final int seed2; private final int[] g; BMZ(int seed1, int seed2, int[] g) { this.seed1 = seed1; this.seed2 = seed2; this.g = g; } protected boolean doEquivalent(E a, E b) { return a.equals(b); } protected int doHash(E t) { int[] hashes = getTwoHashes(t, seed1, seed2, g.length); return g[hashes[0]] + g[hashes[1]]; } } /** we'll use this elsewhere, so let's extract this logic into its own method */ private static int[] getTwoHashes(Object t, int seed1, int seed2, int n) { int hc = t.hashCode(); // don't call twice - premature optimization? int h1 = (hc ^ seed1) % n; int h2 = (hc ^ seed2) % n; if(h1 < 0) { h1 += n; } // Java modulus gives numbers -n < h1 < n... if(h2 < 0) { h2 += n; } // ...but we want positive numbers to use as indices if(h1 == h2) { h2 = (h2 + 1) % n; } // h1 == h2 violates some assumptions (see later) - this is a quick fix! return new int[]{h1, h2}; } |
That was the easy part - so how do we know what to put in g? The BMZ algorithm centres around treating this state as a graph. Each key is mapped to an edge (so that's it uses two queries - one for the vertex at each end) and each vertex has an integer attached to it. The vertices are numbered from 0 to n (I'll use the same letters as the paper to make it easier to read this side-by-side), and the integer attached to each vertex v is stored in the g array at index v. This means that the lookup operation in the Equivalence above adds the two numbers attached to vertices at either end of the edge that corresponds to the key.
So how should we choose how big n is? The answer again parallels the "First Draft" solution: we relax the problem slightly, and say that we only require a solution (i.e. a perfect hash Equivalence) with a reasonable probability. As above, we make several guesses, and fail if none of them reach an answer - and the relaxed problem means we can choose an n that is reasonable likely to give us a solution (much easier than working out an exact answer); the paper suggests this should be 1.15m.
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
⟨BMZ Method⟩+≡ public static <E> Equivalence<E> createBMZ( Collection<? extends E> keys, int maxTries /*100 in c implementation*/, double c /*1.15 suggested*/) { Random r = new Random(17); // fixed seed, so deterministic for (int tries = 0; tries < maxTries; ++tries) { int seed1 = r.nextInt(); int seed2 = r.nextInt(); int[] g = new int[(int)Math.ceil(c * keys.size())]; ⟨Put Numbers in g⟩ } ⟨Give Up⟩ } |
Now we have to choose what number to give each vertex so that the edges match to the perfect hash codes of the keys. It'll help if we break this problem down. In the following situations, a, b, c and d are vertices and the edges are numbered in square brackets (how we choose which number gets assigned to which edge comes later).
a
a single vertex can be trivially assigned zeroa--[4]--b
if either a or b are known, then the other is chosen so that a+b=4. If neither a nor b have been assigned then we can choose any pair of integers that sum to 4a--[4]--b--[7]--c--[4]--d
an interesting case - if we know the integer of one of the end vertices (a or d) in this chain we can work out the other vertices by walking along the chain, at each step picking a number to make the edge just walked calculate correctlya--b--c--a
this is going to be the really hard case to solve - cycles or vertices of degree) greater than two.
We'll therefore divide the vertices of the graph into two parts - one set that have to be solved the hard way (case 4 - called "critical nodes" in the paper), and others that can be solved by walking down chains or the other two simple cases. You can also see that loops in the graph (edges with both ends at the same vertex) will cause real problems - as (e.g.) if the edge needs to be an odd number, and the vertex stores an integer then we can't solve this graph. This is why the BMZ Equivalence class adds one to one of the hashes in a lookup if both hashes are the same - this turns loops into normal edges.
We'll first need to convert the Objects passed to the graph into a set of edges (in O(m) time and space - or we'll lose any big-O speedup this algorithm gives). We'll make our domain objects immutable, and not worry about all the garbage they make.
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
⟨Graph Utilities⟩+≡ private static final class Edge { final int a; final int b; Edge(int[] ab) { this.a = ab[0]; this.b = ab[1]; } public String toString() { return String.format("(%d,%d)", a, b); } } private static final class Graph { final List<Edge> edges; /** indexed by vertex, holds list of vertices that vertex is connected to */ final List<Integer>[] adjacencyList; Graph(int n, int m) { this.edges = new ArrayList<Edge>(m); this.adjacencyList = new List[n]; } /** @returns true if this edge is a duplicate */ boolean addEdge(Edge e) { edges.add(e); if(getAdjacencyList(e.a).contains(e.b)) return true; // linear, but list should be v. small getAdjacencyList(e.a).add(e.b); getAdjacencyList(e.b).add(e.a); return false; } private List<Integer> getAdjacencyList(int forVertex) { List<Integer> ret = adjacencyList[forVertex]; return ret == null ? (adjacencyList[forVertex] = new LinkedList<Integer>()) : ret; } } private static Graph toGraph(Collection<?> objects, int seed1, int seed2, int n) { Graph ret = new Graph(n, objects.size()); for(Object object : objects) { if(ret.addEdge(new Edge(getTwoHashes(object, seed1, seed2, n)))) return null; } return ret; } ⟨Put Numbers in g⟩+≡ Graph graph = toGraph(keys, seed1, seed2, g.length); if(graph == null) { continue; } // some duplicates - try again with new seeds |
So how do we work out if a node is "critical" or not? We know that degree 0 and 1 nodes definitely aren't critical, so we'll start by eliminating them. This leaves us with the remaining tangle mess (or messes - the graph could be disconnected). We can then "strip off" any chains of edges (case 3 above) as we can solve them the easy way. We can find the ends of all the chains (if there are any) by looking through all the degree-one vertices, and then follow the chain towards the mess as far as it'll go, removing any vertices we cross from the critical set:
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
⟨Find Critical Nodes Helper⟩+≡ private static BitSet findCriticalNodes(Graph graph, int n) { // calculate node degrees... int[] degrees = new int[n]; for(Edge edge : graph.edges){ ++degrees[edge.a]; ++degrees[edge.b]; }; // ...and trim the chains... List<Integer> degree1 = new LinkedList<Integer>(); for(int i=0; i<n; ++i) { if(degrees[i] == 1) degree1.add(i); } while(!degree1.isEmpty()){ int v = degree1.remove(0); --degrees[v]; if(graph.adjacencyList[v] != null) for(int adjacent : graph.adjacencyList[v] ) if(--degrees[adjacent] == 1 ) degree1.add(adjacent); } // ...and return a bitmap of critical vertices BitSet ret = new BitSet(n); // all non-critical by default - very useful! for(int i=0; i<n; ++i) { if(degrees[i] > 1) ret.set(i); } return ret; } ⟨Put Numbers in g⟩+≡ BitSet criticalNodes = findCriticalNodes(graph, g.length); |
Now that we've classified the vertices into "critical" and (therefore) "non-critical" ones, we can start assigning integers to them. Assigning numbers to the critical vertices is essentially a graph colouring problem - we want to choose the integers so that adjacent nodes sum to the value of the edge (also - we haven't assigned the integers 0 to m-1 to the edges yet!). We'll therefore decide what integer each edge should have as we go along - this gives us a bit more flexibility when we assign integers to vertices. As we've still not assigned numbers to the non-critical vertices we don't have to assign edge integers sequentially in this step. We can skip any edge integers that would require impossible combinations of vertex integers, and assign these leftover edge integers to the non-critical vertices later.
We'll therefore have a bitmap ae that stores all the edge integers we've assigned so far. We can only assign each integer to an edge once or we won't end up with a perfect hash (remember, each edge is a key and a perfect hash assigns a different integer to each key).
179 180 |
⟨Put Numbers in g⟩+≡ BitSet ae = new BitSet(g.length); |
We'll call the value we'll try to give to the next critical vertex x, and will start our assignment at the lowest critical vertex (this is an arbitary choice - we need to start our depth-first search somewhere). For each vertex we process, we must make sure the integer we give it (i.e. x) doesn't cause two edges to end up with the same integer (as each edge is a key, and two keys that hash to the same number means our hash isn't perfect). We'll therefore just keep incrementing the x (in getXThatSatifies) until it doesn't break this invariant. However, we mustn't forget the other invariant - the hash of each key (i.e. integer assigned to each edge) must be between 0 and m-1. We'll have to add a bit of validation every time we pick a new x; we'll check every adjacent vertex to make sure this new x doesn't cause the edge to have the same value as one of the other edges.
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
⟨Label Critical Nodes Helper⟩+≡ /** @returns false if we couldn't assign the integers */ private static boolean assignIntegersToCriticalVertices(Graph graph, int[] g, BitSet ae, BitSet criticalNodes) { int x = 0; List<Integer> toProcess = new LinkedList<Integer>(); BitSet assigned = new BitSet(g.length); while(!assigned.equals(criticalNodes)) { BitSet unprocessed = ((BitSet)criticalNodes.clone()); unprocessed.andNot(assigned); toProcess.add(unprocessed.nextSetBit(0)); // start at the lowest unassigned critical vertex // assign another "tree" of vertices - not all critical ones are necessarily connected! x = processCriticalNodes(toProcess, graph, ae, g, x, assigned, criticalNodes); if(x < 0) return false; // x is overloaded as a failure signal } return true; } /** process a single "tree" of connected critical nodes, rooted at the vertex in toProcess */ private static int processCriticalNodes(List<Integer> toProcess, Graph graph, BitSet ae, int[] g, int x, BitSet assigned, BitSet criticalNodes) { while(!toProcess.isEmpty()) { int v = toProcess.remove(0); if(v < 0 || assigned.get(v)) continue; // there are no critical nodes || already done this vertex if(graph.adjacencyList[v] != null) { x = getXThatSatifies(graph.adjacencyList[v], x, ae, assigned, g); for(Integer adjacent : graph.adjacencyList[v]) { if(!assigned.get(adjacent) && criticalNodes.get(adjacent) && v!= adjacent) { // give this one an integer, & note we shouldn't have loops - except if there is one key toProcess.add(adjacent); } if(assigned.get(adjacent)) { int edgeXtoAdjacent = x + g[adjacent]; // if x is ok, then this edge is now taken if(edgeXtoAdjacent >= graph.edges.size()) return -1; // this edge is too big! we're only assigning between 0 & m-1 ae.set(edgeXtoAdjacent); } } } g[v] = x; assigned.set(v); // assign candidate x to g ++x; // next v needs a new candidate x } return x; // will use this as a candidate for other "trees" of critical vertices } private static int getXThatSatifies(List<Integer> adjacencyList, int x, BitSet ae, BitSet assigned, int[] g) { for(Integer adjacent : adjacencyList) { if(assigned.get(adjacent) /*only covers critical nodes*/ && ae.get(g[adjacent] + x)) { // if we assign x to v, then the edge between v & and 'adjacent' will // be a duplicate - so our hash code won't be perfect! Try again with a new x: return getXThatSatifies(adjacencyList, x + 1, ae, assigned, g); } } return x; // this one satisfies all edges } ⟨Put Numbers in g⟩+≡ if(!assignIntegersToCriticalVertices(graph, g, ae, criticalNodes)) continue; // try again from the start with different seeds |
We've done the hard part - now it's all downhill from here. We've got all integers we haven't assigned to edges as zeros in the ae BitSet, and we know that the edges between vertices in the non-critical group are just single chains (i.e case 3 above). We'll therefore do a breadth-first search of the vertices starting at the critical ones, and every time we go from a critical to a non-critical vertex or go from one non-critical vertex to another we'll assign integers to those non-critical vertices so that the edge between them is the next edge unassigned in the ae set:
242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
⟨Label Non Critical Nodes Helper⟩+≡ private static void assignIntegersToNonCriticalVertices(Graph graph, int[] g, BitSet ae, BitSet criticalNodes) { List<Integer> toProcess = new LinkedList<Integer>(); for(int v = criticalNodes.nextSetBit(0); v != -1; v = criticalNodes.nextSetBit(v+1)) { toProcess.add(v); } // load with the critical vertices BitSet visited = (BitSet) criticalNodes.clone(); processNonCriticalNodes(toProcess, graph, ae, visited, g); // process the critical nodes // we've done everything reachable from the critical nodes - but // what about isolated chains? for(int v = visited.nextClearBit(0); v != -1 && v < g.length; v = visited.nextClearBit(v+1)) { toProcess.add(v); processNonCriticalNodes(toProcess, graph, ae, visited, g); } } /** process everything in the list and all vertices reachable from it */ private static void processNonCriticalNodes(List<Integer> toProcess, Graph graph, BitSet ae, BitSet visited, int[] g) { int nextEdge = ae.nextClearBit(0); while(!toProcess.isEmpty()) { int v = toProcess.remove(0); if(v < 0) continue; // there are no critical nodes if(graph.adjacencyList[v] != null) { for(int adjacent : graph.adjacencyList[v]) { if(!visited.get(adjacent) && v != adjacent) { // shouldn't have loops - only if one key // we must give it a value g[adjacent] = nextEdge - g[v]; // i.e. g[v] + g[a] = edge as needed toProcess.add(adjacent); ae.set(nextEdge); nextEdge = ae.nextClearBit(nextEdge + 1); } } } visited.set(v); } } ⟨Put Numbers in g⟩+≡ assignIntegersToNonCriticalVertices(graph, g, ae, criticalNodes); // this can't fail |
And that's it! Every vertex has a value so our graph is complete. We'll just return it, wrapped in the Equivalence we made above:
282 283 |
⟨Put Numbers in g⟩+≡ return new BMZ<E>(seed1, seed2, g); |
To make this code into a useful library we'll add an public static method that chooses the hash algorithm and fills in some of the default parameters:
289 290 291 292 293 |
⟨Generic Method⟩+≡ /** makes a perfect hash function for the given set of keys */ public static <E> Equivalence<E> create(Collection<? extends E> keys) { return createBMZ(keys, 100, 1.15); } |
And here's the overall framework of the class:
299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 |
⟨com/googlecode/perfecthashes/PerfectHashes.java:*⟩+≡ package com.googlecode.perfecthashes; ⟨Imports⟩ public final class PerfectHashes { /* ⟨First Draft⟩ */ ⟨Returned Equivalence⟩ ⟨BMZ Method⟩ ⟨Generic Method⟩ // Helpers ⟨Graph Utilities⟩ ⟨Find Critical Nodes Helper⟩ ⟨Label Critical Nodes Helper⟩ ⟨Label Non Critical Nodes Helper⟩ } |
And some final Java boilerplate:
320 321 322 323 324 325 326 327 328 329 330 |
⟨Imports⟩+≡ import java.io.Serializable; import java.util.ArrayList; import java.util.BitSet; import java.util.Collection; import java.util.LinkedList; import java.util.List; import java.util.Random; import com.google.common.base.Equivalence; ⟨Give Up⟩+≡ throw new IllegalStateException("giving up - perfect hashcode too hard to find!"); |
And we're finished! The code's here and you can use it in a maven project by adding the dependency:
<dependency> <groupId>com.googlecode.perfect-hashes</groupId> <artifactId>perfect-hashes</artifactId> <version>1.0.1</version> </dependency>
Too late to finish the article, but there is an integer overflow bug in the getTwoHashes method, in the h1 == h2 case. If h1 == h2 == Integer.MAX_VALUE, h2 + 1 < 0, so h2_final = (h2 + 1) % n < 0. Move the line up, and you're right as rain. You even save a modulus operation in that case!
ReplyDeleteprivate static int[] getTwoHashes(Object t, int seed1, int seed2, int n) {
int hc = t.hashCode(); // don't call twice - premature optimization?
int h1 = (hc ^ seed1) % n;
int h2 = (hc ^ seed2) % n;
if(h1 == h2) { h2 = h2 + 1; }
if(h1 < 0) { h1 += n; } // Java modulus gives numbers -n < h1 < n...
if(h2 < 0) { h2 += n; } // ...but we want positive numbers to use as indices
return new int[]{h1, h2};
}
You're right about fewer modulus problems - but I've written unit tests and think this bit's safe from overflows. h1 and h2 will only ever be between 0 and Integer.MAX_VALUE - 1 due to the mod-n (e.g. n = 0 or n = Integer.MAX_VALUE) so if h1 == h2 == Integer.MAX_VALUE - 1 then adding one to h1 or h2 won't overflow.
DeleteDoes the solution assume that hashCode() never returns the same hash code for different keys?
ReplyDeleteYes - although it will fail gracefully (by throwing an IllegalStateException). You can always work around this by wrapping your keys to change their hashCode (e.g. to System.identityHashCode, although that's not unique either)...
Delete