radixtree: a prefix-tree parsing library for Haskell
This is just a brief post about radixtree, which is a library that:
- Produces radix (alternatively, prefix?) trees from
Text
values - Provides a generic parser suitable for use with
attoparsec
,trifecta
,parsec
, or anything with aCharParsing
(from parsers)
Background
I’m developing a requirements authoring tool called Respecify, at Transport Engineering. One of its core features is a nifty (if I do say so myself) parser generator that I’ve used to create an extremely constrained English grammar for, which then parses requirements, and enables some other core features of Respecify.
The parser has to be able to quickly parse hundreds of requirements at once, each containing terms from a fixed (-ish) dictionary of user-specified nouns and verbs. The number of terms in these dictionaries can easily number in the thousands, so it’s critical that this is fast. See this short project page for more information about Respecify.
Foreground
Radix trees are a much-beloved data structure useful for parsing terms from large dictionaries with lots of similar prefixes (e.g., English). In this context, each edge from a specific node is labelled with the text needed to advance a parser to the next node, and each node is marked with whether or not the parser can terminate successfully there, returning whatever datum is at that node.
Performance
The motivation is simply speed. Parsing large corpuses can be much faster with this approach. Benchmark results on a corpus of around 800 English terms demonstrate this:
Benchmark radixtree-parsing: RUNNING...
benchmarking attoparsec/radix
time 5.032 μs (5.025 μs .. 5.041 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 5.035 μs (5.027 μs .. 5.047 μs)
std dev 31.24 ns (23.48 ns .. 45.45 ns)
benchmarking attoparsec/radix compressed
time 5.041 μs (5.031 μs .. 5.053 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 5.060 μs (5.045 μs .. 5.084 μs)
std dev 59.94 ns (42.56 ns .. 82.04 ns)
benchmarking attoparsec/naiive
time 69.32 μs (69.16 μs .. 69.47 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 69.54 μs (69.28 μs .. 69.89 μs)
std dev 1.028 μs (778.0 ns .. 1.251 μs)
Benchmark radixtree-parsing: FINISH
(Here, “naiive” refers to, basically, choice . map text . sortOn (negate .
length)
.)
I don’t know and haven’t measured if, for Respecify’s use-case, radixtree
is
actually faster than e.g., tokenising and using a hashtable / HashMap
, though
that approach also has its own unique downsides. Namely, that you actually have
to define what a token is: even though a user might want the term “IEEE 754” in
their references dictionary, the parser would have to lookup “IEEE”, then
“IEEE 754”. But if the user just wants to ignore the parse error because “IEEE
754” isn’t in their reference list yet, the parser will have to just keep trying
with each new token, e.g. “IEEE 754 is”, “IEEE 754 is great”, etc.
Memory use
The trade-off is that a RadixTree
uses much more memory than a simple list of
terms, though this is slightly mitigated with a clever (i.e., probably horribly
dangerous – though it hasn’t done anything evil to me yet, and the test suite
seems to validate it) variant named CompressedRadixTree
, which just relies on
a single Text
array for all of its edges and leafs.
Variant | Size in bytes |
---|---|
[Text] |
69840 |
Vector Text |
56952 |
CompressedRadixTree |
254032 |
RadixTree |
709904 |
Example code
TODOs
- Become less afraid of
CompressedRadixTree
- Allow arbitrary data to be accepted by
RadixTree
, so they can be used as an alternative toHashMap Text
.
Links
blog comments powered by Disqus