Wednesday 28 March 2012

Perfect Hashes in Java

Given a set of m keys, a minimal perfect hash function maps each key to an integer 0 to m-1, and (most importantly) each key maps to a different integer. This means you can use the "perfect hash" number as a index into an array (i.e. use it as a hashmap) for guaranteed O(1) insertions & lookups. I'm going explain the BMZ algorithm, roughly following the author's C implmentation as it creates perfect hashes in O(m) space and time. I'll end up with an implementation of Google Guava's Equivalence as then you can use wrappers and standard Java HashMaps to create an efficient Collection with a minimum of wheel-reinventing.

But first I'll start with a simple example. Working in Java is useful as we can re-use our key Objects' hashCode methods to do most of the work. However, it's unlikely that the numbers that hashCode returns are "perfect" - so we'll have to modify them deterministically. I'll use an idea I got from the Jenkins hash algorithm - basically choose a seed integer and mix that with the hashCodes of the keys. As we want the resulting hashCode to lie between 0 and m-1 we'll just do mod-m on the result after mixing in the seed - so then now we just have to worry about choosing a seed that makes each object map to a different number.

8
9
10
11
12
13
14
15
16
17
18
19
⟨First Draft⟩+≡ 
private static final class MixSeed<E> extends Equivalence<E> {
    private final int seed;
    private final int m;
    MixSeed(int seed, int m) { this.seed = seed; this.m = m; }
    protected boolean doEquivalent(E a, E b) {
        return a.equals(b);
    }
    protected int doHash(E t) {
        return (t.hashCode() ^ seed) % m;
    }
}

The first - draft approach is simply to guess a seed; if the resulting hashCodes are perfect, then return an Equivalence that uses that seed, but if not try again. We don't want to keep looping forever, so fix the number of tries and fail if no perfect hash is found.

25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
⟨First Draft⟩+≡ 
public static <E> Equivalence<E> create(Collection<? extends E> keys) {
    Random r = new Random(17); // fixed seed, so deterministic
    for (int tries = 0; tries < 1000; ++tries) {
        int seed = r.nextInt();
        SortedSet<Integer> perfectHashes = Sets.newTreeSet();
        for (E key : keys) {
            perfectHashes.add((key.hashCode() ^ seed) % keys.size());
        }
        if (perfectHashes.first() == 0 && perfectHashes.last() == keys.size() - 1) {
            return new MixSeed<E>(seed, keys.size());
        }
   }
   ⟨Give Up⟩
}

This is clearly not very likely to succeed. To work out the exact probability of an iteration finding a perfect hash, we'll assume the hashCode mixed with the seed is uniformly distributed between 0 and m-1. The first key can be mapped to any of the m integers in this range, the second to any of the m-1 remaining integers, the third to the m-2 remaining integers, &c., and the probablity of this happening is m/m * (m-1)/m * (m-2)/m * ... * 1/m, which is m!/mm - so not very likely!

The BMZ algorithm takes a pretty interesting approach. To build the perfect hash in O(m) time we can only store an O(m) amount of state. We want to make the constant as big as possible (which uses a lot of memory - not ideal), so we could either store really big state objects, or make several queries smaller state objects (which BMZ does). The problem them becomes: (1) how do you work out what queries to make, and more importantly (2) how do you build up the state such that each key makes result in a different hash number.

BMZ queries the state twice to get the data it needs to return the hash number, and solves the first step by a logical extension of the first draft above: instead of having one seed, have two! The Equivalence below takes the shared state g (an array whose length is not m), queries it twice with the two different seeds, and combines them by simply summing the two states it finds. I've made the Equivalence Serializable so once you've done the hard work of generating it you can persist it somewhere and load it in other applications.

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
⟨Returned Equivalence⟩+≡ 
private static final class BMZ<E> extends Equivalence<E> implements Serializable {
    private final int seed1;
    private final int seed2;
    private final int[] g;
    BMZ(int seed1, int seed2, int[] g) { 
        this.seed1 = seed1; this.seed2 = seed2; this.g = g;
    }
    protected boolean doEquivalent(E a, E b) {
        return a.equals(b);
    }
    protected int doHash(E t) {
        int[] hashes = getTwoHashes(t, seed1, seed2, g.length); 
        return g[hashes[0]] + g[hashes[1]];
    }
}
/** we'll use this elsewhere, so let's extract this logic into its own method */
private static int[] getTwoHashes(Object t, int seed1, int seed2, int n) {
    int hc = t.hashCode(); // don't call twice - premature optimization?
    int h1 = (hc ^ seed1) % n;
    int h2 = (hc ^ seed2) % n;
    if(h1 < 0) { h1 += n; } // Java modulus gives numbers -n < h1 < n...
    if(h2 < 0) { h2 += n; } // ...but we want positive numbers to use as indices
    if(h1 == h2) { h2 = (h2 + 1) % n; } // h1 == h2 violates some assumptions (see later) - this is a quick fix!
    return new int[]{h1, h2};
}

That was the easy part - so how do we know what to put in g? The BMZ algorithm centres around treating this state as a graph. Each key is mapped to an edge (so that's it uses two queries - one for the vertex at each end) and each vertex has an integer attached to it. The vertices are numbered from 0 to n (I'll use the same letters as the paper to make it easier to read this side-by-side), and the integer attached to each vertex v is stored in the g array at index v. This means that the lookup operation in the Equivalence above adds the two numbers attached to vertices at either end of the edge that corresponds to the key.

So how should we choose how big n is? The answer again parallels the "First Draft" solution: we relax the problem slightly, and say that we only require a solution (i.e. a perfect hash Equivalence) with a reasonable probability. As above, we make several guesses, and fail if none of them reach an answer - and the relaxed problem means we can choose an n that is reasonable likely to give us a solution (much easier than working out an exact answer); the paper suggests this should be 1.15m.

82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
⟨BMZ Method⟩+≡ 
public static <E> Equivalence<E> createBMZ(
    Collection<? extends E> keys,
    int maxTries /*100 in c implementation*/, 
    double c /*1.15 suggested*/) 
{
    Random r = new Random(17); // fixed seed, so deterministic
    for (int tries = 0; tries < maxTries; ++tries) {
        int seed1 = r.nextInt();
        int seed2 = r.nextInt();
        int[] g = new int[(int)Math.ceil(c * keys.size())];
        ⟨Put Numbers in g⟩
   }
   ⟨Give Up⟩
}

Now we have to choose what number to give each vertex so that the edges match to the perfect hash codes of the keys. It'll help if we break this problem down. In the following situations, a, b, c and d are vertices and the edges are numbered in square brackets (how we choose which number gets assigned to which edge comes later).

  1. a
    a single vertex can be trivially assigned zero
  2. a--[4]--b
    if either a or b are known, then the other is chosen so that a+b=4. If neither a nor b have been assigned then we can choose any pair of integers that sum to 4
  3. a--[4]--b--[7]--c--[4]--d
    an interesting case - if we know the integer of one of the end vertices (a or d) in this chain we can work out the other vertices by walking along the chain, at each step picking a number to make the edge just walked calculate correctly
  4. a--b--c--a
    this is going to be the really hard case to solve - cycles or vertices of degree) greater than two.

We'll therefore divide the vertices of the graph into two parts - one set that have to be solved the hard way (case 4 - called "critical nodes" in the paper), and others that can be solved by walking down chains or the other two simple cases. You can also see that loops in the graph (edges with both ends at the same vertex) will cause real problems - as (e.g.) if the edge needs to be an odd number, and the vertex stores an integer then we can't solve this graph. This is why the BMZ Equivalence class adds one to one of the hashes in a lookup if both hashes are the same - this turns loops into normal edges.

We'll first need to convert the Objects passed to the graph into a set of edges (in O(m) time and space - or we'll lose any big-O speedup this algorithm gives). We'll make our domain objects immutable, and not worry about all the garbage they make.

111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
⟨Graph Utilities⟩+≡ 
private static final class Edge {
     final int a; final int b;
     Edge(int[] ab) { this.a = ab[0]; this.b = ab[1]; } 
     public String toString() { return String.format("(%d,%d)", a, b); }
}
private static final class Graph {
    final List<Edge> edges;
    /** indexed by vertex, holds list of vertices that vertex is connected to */
    final List<Integer>[] adjacencyList;
    Graph(int n, int m) { 
        this.edges = new ArrayList<Edge>(m);
        this.adjacencyList = new List[n];  
    }
    /** @returns true if this edge is a duplicate */
    boolean addEdge(Edge e) {
        edges.add(e);
        if(getAdjacencyList(e.a).contains(e.b)) return true; // linear, but list should be v. small
        getAdjacencyList(e.a).add(e.b); 
        getAdjacencyList(e.b).add(e.a); 
        return false;
    }
    private List<Integer> getAdjacencyList(int forVertex) {
        List<Integer> ret = adjacencyList[forVertex];
        return ret == null ? (adjacencyList[forVertex] = new LinkedList<Integer>()) : ret;
    }
}
private static Graph toGraph(Collection<?> objects, int seed1, int seed2, int n) {
    Graph ret = new Graph(n, objects.size());
    for(Object object : objects) { if(ret.addEdge(new Edge(getTwoHashes(object, seed1, seed2, n)))) return null; }
    return ret;
}
⟨Put Numbers in g⟩+≡
Graph graph = toGraph(keys, seed1, seed2, g.length);
if(graph == null) { continue; } // some duplicates - try again with new seeds

So how do we work out if a node is "critical" or not? We know that degree 0 and 1 nodes definitely aren't critical, so we'll start by eliminating them. This leaves us with the remaining tangle mess (or messes - the graph could be disconnected). We can then "strip off" any chains of edges (case 3 above) as we can solve them the easy way. We can find the ends of all the chains (if there are any) by looking through all the degree-one vertices, and then follow the chain towards the mess as far as it'll go, removing any vertices we cross from the critical set:

151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
⟨Find Critical Nodes Helper⟩+≡
private static BitSet findCriticalNodes(Graph graph, int n) {
    // calculate node degrees...
    int[] degrees = new int[n];
    for(Edge edge : graph.edges){ ++degrees[edge.a]; ++degrees[edge.b]; };
    // ...and trim the chains...
    List<Integer> degree1 = new LinkedList<Integer>();
    for(int i=0; i<n; ++i) { if(degrees[i] == 1) degree1.add(i); }
    while(!degree1.isEmpty()){
        int v = degree1.remove(0); --degrees[v];
        if(graph.adjacencyList[v] != null) 
            for(int adjacent : graph.adjacencyList[v] ) 
                if(--degrees[adjacent] == 1 ) degree1.add(adjacent);
    }
    // ...and return a bitmap of critical vertices
    BitSet ret = new BitSet(n); // all non-critical by default - very useful!
    for(int i=0; i<n; ++i) { if(degrees[i] > 1) ret.set(i); }
    return ret;
}
⟨Put Numbers in g⟩+≡
BitSet criticalNodes = findCriticalNodes(graph, g.length);

Now that we've classified the vertices into "critical" and (therefore) "non-critical" ones, we can start assigning integers to them. Assigning numbers to the critical vertices is essentially a graph colouring problem - we want to choose the integers so that adjacent nodes sum to the value of the edge (also - we haven't assigned the integers 0 to m-1 to the edges yet!). We'll therefore decide what integer each edge should have as we go along - this gives us a bit more flexibility when we assign integers to vertices. As we've still not assigned numbers to the non-critical vertices we don't have to assign edge integers sequentially in this step. We can skip any edge integers that would require impossible combinations of vertex integers, and assign these leftover edge integers to the non-critical vertices later.

We'll therefore have a bitmap ae that stores all the edge integers we've assigned so far. We can only assign each integer to an edge once or we won't end up with a perfect hash (remember, each edge is a key and a perfect hash assigns a different integer to each key).

179
180
⟨Put Numbers in g⟩+≡
BitSet ae = new BitSet(g.length);

We'll call the value we'll try to give to the next critical vertex x, and will start our assignment at the lowest critical vertex (this is an arbitary choice - we need to start our depth-first search somewhere). For each vertex we process, we must make sure the integer we give it (i.e. x) doesn't cause two edges to end up with the same integer (as each edge is a key, and two keys that hash to the same number means our hash isn't perfect). We'll therefore just keep incrementing the x (in getXThatSatifies) until it doesn't break this invariant. However, we mustn't forget the other invariant - the hash of each key (i.e. integer assigned to each edge) must be between 0 and m-1. We'll have to add a bit of validation every time we pick a new x; we'll check every adjacent vertex to make sure this new x doesn't cause the edge to have the same value as one of the other edges.

186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
⟨Label Critical Nodes Helper⟩+≡ 
/** @returns false if we couldn't assign the integers */
private static boolean assignIntegersToCriticalVertices(Graph graph, int[] g, BitSet ae, BitSet criticalNodes) {
    int x = 0;
    List<Integer> toProcess = new LinkedList<Integer>(); 
    BitSet assigned = new BitSet(g.length);
    while(!assigned.equals(criticalNodes)) {
        BitSet unprocessed = ((BitSet)criticalNodes.clone()); unprocessed.andNot(assigned);
        toProcess.add(unprocessed.nextSetBit(0)); // start at the lowest unassigned critical vertex
        // assign another "tree" of vertices - not all critical ones are necessarily connected!
        x = processCriticalNodes(toProcess, graph, ae, g, x, assigned, criticalNodes);
        if(x < 0) return false; // x is overloaded as a failure signal
    }
    return true;
}
/** process a single "tree" of connected critical nodes, rooted at the vertex in toProcess */
private static int processCriticalNodes(List<Integer> toProcess, Graph graph, BitSet ae, int[] g, int x, BitSet assigned, BitSet criticalNodes) {
    while(!toProcess.isEmpty()) {
        int v = toProcess.remove(0);
        if(v < 0 || assigned.get(v)) continue; // there are no critical nodes || already done this vertex
        if(graph.adjacencyList[v] != null) {
            x = getXThatSatifies(graph.adjacencyList[v], x, ae, assigned, g);
            for(Integer adjacent : graph.adjacencyList[v]) {
                if(!assigned.get(adjacent) && criticalNodes.get(adjacent) && v!= adjacent) { 
                    // give this one an integer, & note we shouldn't have loops - except if there is one key 
                    toProcess.add(adjacent); 
                } 
                if(assigned.get(adjacent)) {  
                    int edgeXtoAdjacent = x + g[adjacent]; // if x is ok, then this edge is now taken
                    if(edgeXtoAdjacent >= graph.edges.size()) return -1; // this edge is too big! we're only assigning between 0 & m-1
                    ae.set(edgeXtoAdjacent); 
                } 
            }
        }
        g[v] = x; assigned.set(v); // assign candidate x to g
        ++x; // next v needs a new candidate x
    } 
    return x; // will use this as a candidate for other "trees" of critical vertices
}
private static int getXThatSatifies(List<Integer> adjacencyList, int x, BitSet ae, BitSet assigned, int[] g) {
    for(Integer adjacent : adjacencyList) {
        if(assigned.get(adjacent) /*only covers critical nodes*/ && ae.get(g[adjacent] + x)) { 
            // if we assign x to v, then the edge between v & and 'adjacent' will
            // be a duplicate - so our hash code won't be perfect! Try again with a new x:
            return getXThatSatifies(adjacencyList, x + 1, ae, assigned, g);
        } 
    }
    return x; // this one satisfies all edges
}
⟨Put Numbers in g⟩+≡
if(!assignIntegersToCriticalVertices(graph, g, ae, criticalNodes)) continue; // try again from the start with different seeds

We've done the hard part - now it's all downhill from here. We've got all integers we haven't assigned to edges as zeros in the ae BitSet, and we know that the edges between vertices in the non-critical group are just single chains (i.e case 3 above). We'll therefore do a breadth-first search of the vertices starting at the critical ones, and every time we go from a critical to a non-critical vertex or go from one non-critical vertex to another we'll assign integers to those non-critical vertices so that the edge between them is the next edge unassigned in the ae set:

242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
⟨Label Non Critical Nodes Helper⟩+≡ 
private static void assignIntegersToNonCriticalVertices(Graph graph, int[] g, BitSet ae, BitSet criticalNodes) {
    List<Integer> toProcess = new LinkedList<Integer>();
    for(int v = criticalNodes.nextSetBit(0); v != -1; v = criticalNodes.nextSetBit(v+1)) { toProcess.add(v); } // load with the critical vertices
    BitSet visited = (BitSet) criticalNodes.clone();
    processNonCriticalNodes(toProcess, graph, ae, visited, g); // process the critical nodes
    // we've done everything reachable from the critical nodes - but
    // what about isolated chains?
    for(int v = visited.nextClearBit(0); v != -1 && v < g.length; v = visited.nextClearBit(v+1)) { 
        toProcess.add(v);
        processNonCriticalNodes(toProcess, graph, ae, visited, g);
    }    
}
/** process everything in the list and all vertices reachable from it */
private static void processNonCriticalNodes(List<Integer> toProcess, Graph graph, BitSet ae, BitSet visited, int[] g) {
    int nextEdge = ae.nextClearBit(0);
    while(!toProcess.isEmpty()) {
        int v = toProcess.remove(0);
        if(v < 0) continue; // there are no critical nodes
        if(graph.adjacencyList[v] != null) {
            for(int adjacent : graph.adjacencyList[v]) {
                if(!visited.get(adjacent) && v != adjacent) { // shouldn't have loops - only if one key 
                    // we must give it a value
                    g[adjacent] = nextEdge - g[v]; // i.e. g[v] + g[a] = edge as needed
                    toProcess.add(adjacent);
                    ae.set(nextEdge);
                    nextEdge = ae.nextClearBit(nextEdge + 1);                    
                }
            }
        }
        visited.set(v);
    } 
}
⟨Put Numbers in g⟩+≡
assignIntegersToNonCriticalVertices(graph, g, ae, criticalNodes); // this can't fail

And that's it! Every vertex has a value so our graph is complete. We'll just return it, wrapped in the Equivalence we made above:

282
283
⟨Put Numbers in g⟩+≡ 
return new BMZ<E>(seed1, seed2, g);

To make this code into a useful library we'll add an public static method that chooses the hash algorithm and fills in some of the default parameters:

289
290
291
292
293
⟨Generic Method⟩+≡
/** makes a perfect hash function for the given set of keys */
public static <E> Equivalence<E> create(Collection<? extends E> keys) {
    return createBMZ(keys, 100, 1.15);
}

And here's the overall framework of the class:

299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
⟨com/googlecode/perfecthashes/PerfectHashes.java:*⟩+≡
package com.googlecode.perfecthashes;
⟨Imports⟩
public final class PerfectHashes {
    /*
    ⟨First Draft⟩
    */
    ⟨Returned Equivalence⟩
    ⟨BMZ Method⟩
    ⟨Generic Method⟩
    // Helpers
    ⟨Graph Utilities⟩
    ⟨Find Critical Nodes Helper⟩
    ⟨Label Critical Nodes Helper⟩
    ⟨Label Non Critical Nodes Helper⟩
}

And some final Java boilerplate:

320
321
322
323
324
325
326
327
328
329
330
⟨Imports⟩+≡ 
import java.io.Serializable;
import java.util.ArrayList;
import java.util.BitSet;
import java.util.Collection;
import java.util.LinkedList;
import java.util.List;
import java.util.Random;
import com.google.common.base.Equivalence;
⟨Give Up⟩+≡
throw new IllegalStateException("giving up - perfect hashcode too hard to find!");

And we're finished! The code's here and you can use it in a maven project by adding the dependency:

<dependency>
  <groupId>com.googlecode.perfect-hashes</groupId>
  <artifactId>perfect-hashes</artifactId>
  <version>1.0.1</version>
</dependency>

Friday 9 March 2012

A Literate Programming Tool

This is the last blog post I'll be hand-writing html for. I'd like a tool that can take a markdown document (as they're easy to write) and either "weave" it into a legible html document or "tangle" it into source code. There are loads of tools that do this already - but I want one that requires a minimum of markdown syntax, can understand (read: syntax highlight) many different language, and most importantly produce more than one source file from a single markdown document (for polyglot programming, or code & config, or antlr grammars and other DSLs). I don't want to re-invent any more of the wheel than I have to, so I'll use redcarpet to do the markdown parsing, trollop for a command-line interface and coderay to generate syntax highlighted html.

I'll start with a basic overview of the functionality:


opts = Trollop::options do
  opt :weave, "Produce documentation", :short => 'w'
  opt :tangle, "Produce code", :short => 't'
  opt :outputdir, "Directory to write files to", :default => Dir.pwd, :short => 'o'
  opt :lang, "Default language of code", :default => "ruby", :short => 'l'
  opt :files, "Files to process", :type => :string, :short => 'f', :required => true
end

It's pretty straightforward; I want to tangle and weave in a single command as well as one or the other, and want to be able to override the output directory to aid integration into build chains. Coderay can't guess what (programming) language a piece of text is in, so I'll have to either specify it in the markdown (as part of a fenced code block or use some default). I'll add some other markdown extensions that seem useful:


html_opts = {
  :fenced_code_blocks => true, 
  :superscript => true, 
  :tables => true, 
  :lax_html_blocks => true, 
  :strikethrough => true
}

The markdown must specify how to glue the various bits of code together. I like to think of it in terms of "anchors", or as a tree of bits of code. Each section of code could have one or more "anchors", or named markers, and should say which marker it should be attached to. Each output source file has an artificial root anchor (called *). I'll define anchors using ampersands (as that seems to be a common feature of literate tags in other implementations), so something like @My Anchor Name. I'll also indicate that a section of code should be added to an anchor using anchor-name-plus-equals, e.g. @My Anchor Name@ += .... A markdown code block can have zero or more of these plus-equals, and any code before the first plus-equals should be added to the root (i.e. *) anchor of a file with the same name as the markdown document, but with a file extension corresponding to the default language (let's call this the "default source file"). You can add a block to the root of a specific source file by prepending the source file path (including folders) to the *, so something like @my_folder/my_output.c:*@ += .... We'll find these tags in the code using regexes:


$allowed = '(\w| )*|((.*:)?\*)'

And we'll have a poor-but-simple way of guessing the output extension for the default source file: a hardcoded hash keyed by the default language (config files, command line overrides or generally anything not hard-coded would be useful here - I haven't found a library that does this):


$ext_for_lang = {
  :ruby => 'rb',
  :c => 'c',
}

And as we should create any directory structure needed to write the source files to the locations given in their root tags, I'll add a helper (this should really be an optional argument to File.open - maybe I should monkey-patch it?):


def write_to path, data
 FileUtils.mkdir_p(File.dirname(path))
 File.open(path, 'w') {|f| f.write(data)}
end

I'll do "weaving" (conerting to HTML) first. I'll use redcarpet's XHTML renderer for the most part - we just have to add some highlighting to the code blocks. I'll also use Coderay's :line_number_start feature to print the line numbers in the markdown document. There's not any obvious helper method that redcarpet can give me (it's written in C, and only exposes some lightweight ruby object wrappers, so there's not much in the way of utility functions - or even any way to interact with the parsing process that I can see), so I'll just count the new lines between the start of the markdown document and this fragment of code (and yes, this is very inefficient and potentially incorrect if the code block is copy+pasted - maybe I should add a new-line counter to every callback). There's one really big problem: "@" symbols and unicode. Any "@" passed to Coderay get highlighted (i.e. surrounded by HTML tags), which will interfere with the anchor display. I want to display the tags as left- and right-angle brackets (a bit like the latex output of Knuth's web), so I'll have to substitute dummy strings in place of the "@"s before passing the code to Coderay for formatting and then replace the dummy strings with the HTML entity codes for the symbols I want. Ideally I'd just pass the unicode equivalent of the left- and right-angle brackets, but that gets tagged by Coderay too.


class Weave < Redcarpet::Render::XHTML
  attr_accessor :default_lang, :original_text
  def block_code(code, lang)
   l_ang, r_ang, equiv = "__lang__", "__rang__", "__equiv__"
   line_num = @original_text[0,@original_text.index(code)].count("\n")+1
    code = code.
      gsub(/@(#{$allowed})@\s*\+=/,"#{l_ang}\\1#{r_ang}+#{equiv}").
      gsub(/@(#{$allowed})@/,"#{l_ang}\\1#{r_ang}")
    code = CodeRay.
      scan(code, lang.nil? ? @default_lang : lang.to_sym).
      html(
        :wrap => :div, 
        :css => :style, 
        :line_numbers => :table,
        :line_number_start => line_num)
    code.
      gsub(/#{l_ang}(#{$allowed})#{r_ang}(\s*)\+#{equiv}/,'⟨\\1⟩+≡').
      gsub(/#{l_ang}(#{$allowed})#{r_ang}/,'⟨\\1⟩')
  end  
  def codespan(code); block_code(code,nil) end
end

The code that calls this is pretty standard boilerplate: create a parser, iterate over the input markdown files, render them to HTML & write the output to an HTML file with the same name as the markdown file (with an HTML file extension tacked on):


if opts.weave
  r = Redcarpet::Markdown.new(Weave, html_opts)
  r.renderer.default_lang = opts.lang.to_sym
  opts.files.split(',').each{|file|
    code = File.open(file, 'r'){|f|f.read}
    r.renderer.original_text = code
 html = r.render(code)
 write_to("#{opts.outputdir}/#{file}.html", html)
  }
end

The tangle code's a bit more interesting. We have to build up a hash of what strings of code are attached to each anchor, and then run a pass after processing every file to construct the output source files (fully & recursively resolving all the anchors in the code blocks). The normalise method converts the default root node * into a root node with the default source file path - so it needs to know what the markdown file name is (@file_no_ext) and what language the code is in (if it's not given) (@default_lang).

The chunk processing code splits the code block into sections to be added to different anchors (including a synthentic anchor-add for everything before the first anchor). It picks up the start index and the length of the anchor tokens so we can remove them from the code blocks - they won't be valid c/ruby/&c. code. There's also a synthetic anchor-add at the end to make the iteration easier.


class Tangle < Redcarpet::Render::Base
  attr_accessor :default_lang, :links, :file_no_ext
  def block_code(code, lang)
    chunks = 
      [{:start => 0, :anchor => '*', :anchor_len => 0}] + 
      code.scan(/(@(#{$allowed})@\s*\+=)/).map{|x|
        {:start => code.index(x[0]), :anchor => x[1], :anchor_len => x[0].length}
      } +
      [{:start => code.length}]
    (1..chunks.length-1).each{|index|
      last, this = chunks[index-1], chunks[index]
      @links[normalise(last[:anchor],lang)] << code[last[:start] + last[:anchor_len], this[:start]]
    }
    nil
  end
  def normalise(link_name, lang)
    if link_name == '*'
      "#{@file_no_ext}.#{$ext_for_lang[lang.nil? ? @default_lang : lang.to_sym]}:*"
    else
      link_name
    end
  end
  def codespan(code); block_code(code,nil) end
end

The last bit of code does the tangling. As in the weaving code it uses redcarpet to process each file, but this time we don't care about the output. Instead we're left with the links Hash which has all the sections of code to be added to each anchor point. The (normalised) root anchor(s) will also be keys to this hash. There's one final step - we'd like to just extract the roots and write their arrays of code snippets to a file. As these snippets may still have anchors in we should remove the anchor tokens and replace them with any snippets that should be attached to those anchors. We should do this recursively until no anchors are left. My implementation isn't very safe - you could easily make it recurse infinitely or forget to attach snippets to an anchor due to a typo. It also doesn't warn you if anchors don't have any code attached to them - another potential sign of misspelt anchor names.


if opts.tangle
  links = Hash.new{|h,k|h[k]=[]}
  r = Redcarpet::Markdown.new(Tangle, html_opts)
  r.renderer.default_lang, r.renderer.links = opts.lang.to_sym, links
  opts.files.split(',').each{|file|
    r.renderer.file_no_ext = file[0,file.rindex('.')]
 r.render(File.open(file, 'r'){|f|f.read})
  }
  resolve = lambda{|parts| parts.join('').gsub(/@(#{$allowed})@/) {|match|resolve.call(links[match[1..-2]])} }
  links.keys.reject{|k|!k.end_with? ':*'}.each{|root|
 write_to(
   "#{opts.outputdir}/#{root[0..-3]}", 
   resolve.call(links[root]))
  }
end

So that's it! An example: the following markdown...


### my file ###
add some *text*:

~~~~~
@*@ +=
int main() { 
  @Do Stuff@
  @Do More Stuff@
  return 0; 
}
~~~~~

and some more text. we should probably say what dostuff is; it's:

~~~~~
@Do Stuff@ +=
int* me = &1;
*me++;
~~~~~

was that clearer? More stuff is the same as the first:

~~~~
@Do More Stuff@ += @Do Stuff@
~~~~

...can be woven into the following html using literate_md --lang=c --weave -f example.md into...

my file

add some text:

5
6
7
8
9
10
⟨*⟩+≡
int main() { 
  ⟨Do Stuff⟩
  ⟨Do More Stuff⟩
  return 0; 
}

and some more text. we should probably say what dostuff is; it's:

16
17
18
⟨Do Stuff⟩+≡
int* me = &1;
*me++;

was that clearer? More stuff is the same as the first:

24
⟨Do More Stuff⟩+≡ ⟨Do Stuff⟩

...or tangled using into...


int main() { 
  
int* me = &1;
*me++;

   
int* me = &1;
*me++;


  return 0; 
}

I've put this script as a gem here - let me know if you have any improvements!