radixtree: a prefix-tree parsing library for Haskell
This is just a brief post about radixtree, which is a library that:
- Produces radix (alternatively, prefix?) trees from
- Provides a generic parser suitable for use with
parsec, or anything with a
I’m developing a requirements authoring tool called Respecify, at Transport Engineering. One of its core features is a nifty (if I do say so myself) parser generator that I’ve used to create an extremely constrained English grammar for, which then parses requirements, and enables some other core features of Respecify.
The parser has to be able to quickly parse hundreds of requirements at once, each containing terms from a fixed (-ish) dictionary of user-specified nouns and verbs. The number of terms in these dictionaries can easily number in the thousands, so it’s critical that this is fast. See this short project page for more information about Respecify.
Radix trees are a much-beloved data structure useful for parsing terms from large dictionaries with lots of similar prefixes (e.g., English). In this context, each edge from a specific node is labelled with the text needed to advance a parser to the next node, and each node is marked with whether or not the parser can terminate successfully there, returning whatever datum is at that node.
The motivation is simply speed. Parsing large corpuses can be much faster with this approach. Benchmark results on a corpus of around 800 English terms demonstrate this:
Benchmark radixtree-parsing: RUNNING... benchmarking attoparsec/radix time 5.032 μs (5.025 μs .. 5.041 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 5.035 μs (5.027 μs .. 5.047 μs) std dev 31.24 ns (23.48 ns .. 45.45 ns) benchmarking attoparsec/radix compressed time 5.041 μs (5.031 μs .. 5.053 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 5.060 μs (5.045 μs .. 5.084 μs) std dev 59.94 ns (42.56 ns .. 82.04 ns) benchmarking attoparsec/naiive time 69.32 μs (69.16 μs .. 69.47 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 69.54 μs (69.28 μs .. 69.89 μs) std dev 1.028 μs (778.0 ns .. 1.251 μs) Benchmark radixtree-parsing: FINISH
(Here, “naiive” refers to, basically,
choice . map text . sortOn (negate .
I don’t know and haven’t measured if, for Respecify’s use-case,
actually faster than e.g., tokenising and using a hashtable /
that approach also has its own unique downsides. Namely, that you actually have
to define what a token is: even though a user might want the term “IEEE 754” in
their references dictionary, the parser would have to lookup “IEEE”, then
“IEEE 754”. But if the user just wants to ignore the parse error because “IEEE
754” isn’t in their reference list yet, the parser will have to just keep trying
with each new token, e.g. “IEEE 754 is”, “IEEE 754 is great”, etc.
The trade-off is that a
RadixTree uses much more memory than a simple list of
terms, though this is slightly mitigated with a clever (i.e., probably horribly
dangerous – though it hasn’t done anything evil to me yet, and the test suite
seems to validate it) variant named
CompressedRadixTree, which just relies on
Text array for all of its edges and leafs.
|Variant||Size in bytes|
- Become less afraid of
- Allow arbitrary data to be accepted by
RadixTree, so they can be used as an alternative to
blog comments powered by Disqus