begriffs

Fast, Elegant Regexes in Haskell

June 27, 2016

“Typed, elegant and efficient – we can have all three.” Gabriel Gonzalez introduces his new regular expression library which breaks away from the conventional way of using strings to represent regexes. Gabriel’s library treats regexes as algebraic objects that can be combined together much like numbers.

Not only is the library beautiful, it’s fast. The latest version approaches the speed of GNU grep and even beats it in certain situations.

Slides | Download Video: (HD / SD).

Summary

  • Regular expressions are not historically a pleasure to work with in Haskell
    • This is primarily because in Haskell we have really good facilities for parsers and parser combinators
    • People tend to reach for parsers and let regexes languish
    • But sometimes it’s faster to just write a regex to match a string!
  • How can we provide an fast, elegant, typed Haskell regex API?
    • Regexes behave in some ways like numbers
    • Call the regex matching nothing 0, that matching the empty sequence 1. Define addition as regex alternative and multiplication as match one then the next regex
    • Note that ordinary laws of these operations hold, and we can overload the Num typeclass
    • We can also implement the IsString typeclass so that the programmer can write a sequence of text to match as a simple quoted string
    • With this and a few helper functions to build regexes we can express everything we need
  • Haskell Regex libraries fall in three categories
    • Simple wrappers for a C library which takes raw strings
    • Those implemented in Haskell but still using a string API
    • Those implemented in Haskell having an elegant typed API
  • There is only one library in the last category, regex-applicative
    • It’s close to what we want except it is too slow!
    • It’s about 60x slower than GNU grep for complicated regexes, with an even worse disparity for simple string search
  • Gabriel wrote a library that is 4x slower than GNU grep, but he is working on a new version using a mix of Haskell and C that is actually much faster than grep by… a lot!
  • Review of (non-)deterministic finite state automata
  • We’ll use a non-determinstic finite state automaton (NFA) to implement a regular expression
    • An example hand-written NFA to match “hello”
    • The naive implementation takes 13 nanoseconds per element, which is OK, but searching for an exact string match in many cases does not have to do much work
    • Let’s trying matching “hello” anywhere in the interior of a string
    • This takes 20 ns, which isn’t that hot
    • The regex-applicative library is 5x slower. In all fairness that library also captures matches for later so it is doing more work
  • Building state machines from the mathematical API
  • Oh no, our first attempt at the Thompson construction is horribly slow! 3.2 microseconds per element
    • We want to speed it up while preserving the API
    • We’ve been using heavy-weight Set implementation. We can instead pack bits representing possible states into the Haskell Integer type and use bitwise operations to manipulate them
    • Bitwise sets makes the code 5x faster
    • But still slower than regex-applicative
    • Rather than Integer, we can use Int when the regex has up to 64 states
    • Now it’s 8x faster than before, which is faster than regex-applicative
  • Still 10x slower than grep, how much faster can we go?
    • ByteString-specific optimizations
    • We loop through the sequence with ByteString internals and use mmap to read files quickly
    • Now the code is 3x faster, about 18ns per element vs grep’s 4ns
    • This is as fast as Gabriel could make the pure Haskell solution
  • He is now working on a mixed C and Haskell library and it is beating grep
  • In conclusion Haskell is typed, elegant, and efficient at the same time. You don’t have to choose some properties but not others.