Building a Regex Search Engine for DNA

This winter, Tony interned at Benchling and built the foundation of our DNA search feature. Somak and I brought it into production shortly after his internship ended, and we're finally writing it up!

We recently launched DNA search on Benchling - you can have a library of thousands of plasmids and primers, and we'll search through them in less than 100ms end to end. It was a fun project, and through it we ended up building a regex search engine (on top of Elasticsearch) to support the various search needs of scientists. Here's how we did it.


Some rough biology: each DNA strand is modeled as a string of A, T, C, and G characters, and each character is called a base. For a string like ATG AAA TTT the DNA gets read 3 bases at a time. 3 bases together are called a codon, and each codon eventually gets translated into a series of amino acids: ATG becomes methionine, AAA becomes lysine, and TTT becomes phenylalanine. Amino acids then are assembled into proteins, which then perform an astounding number of functions inside an organism. In this way, DNA stores the "code" of life.

DNA search is essentially substring search across a database of strings (albeit with a smaller alphabet). Some common use cases: one, scientists will search for certain genes that they've used in engineered plasmids. Two, since multiple codons can translate to the same amino acid, a process called codon optimization might replace codons with equivalent ones that work better given a certain kind of organism - your CAA (glutamine) might become CAG (still glutamine) instead. You might want to search by amino acid ("glutamine") so that you find it regardless of whether CAA or CAG was used.

As scientists design and document their research on Benchling, the constructs that they create build up over time. DNA search should help scientists cut through the noise and find the relevant sequences they're looking for.

General Approach

Scientists have a few ways of describing sequences, all of which map well to regular expressions:

  • IUPAC notation describes DNA bases. On top of A, C, T, and G, you have letters like N (any base: A, C, G, or T) and W (A or T). You can search for TAGNGG, which becomes TAG(A|C|T|G)GG. Here's a chart from Wikipedia:
  • Amino acids have their own set of letters: glutamine is Q, for example. Scientists can search by amino acids, and multiple codons will translate to the same amino acid - Q maps to the regex (CAA|CAG). Searching for QWD becomes (CAA|CAG)(TGG)(GAT|GAC)
  • Finally, amino acids have a regex-like syntax introduced by the protein database Prosite - Q(3)-x-x denotes three glutamines Q followed by any two amino acids x, or (CAA|CAG)(CAA|CAG)(CAA|CAG)(NNN)(NNN).

To address these three cases, we built a "regex search engine" - users can use IUPAC notation, amino acid notation, or prosite notation and we'll translate it into a regex to feed into the engine.

Our approach uses a combination of regex parsing and Elasticsearch indexing:

  • We build an inverted trigram index of DNA bases
  • For each regex, we convert it into a set of constraints that we can use as filter. The filter can allow for false positives (but not false negatives) - it helps narrow the search space: (A|C|G|T)GG becomes AGG OR CGG OR GGG OR TGG
  • That filter is sent to Elasticsearch, which uses the trigram index to quickly find potentially matching bases. We use a post-search regex filter to eliminate false positives
  • That's it! The result is sub 100ms searches across a user's entire collection of sequences.

We used Elasticsearch because we already used it for indexing and searching the various file types we have on Benchling. Searching for bases is just another way to match items and combines well with our existing filters and scoring functions. (We considered building it using pg_trgm / Postgres instead, but Elasticsearch has Postgres beat in terms of built-in analyzers and query features.)

We're definitely not the first to build regex search, and we won't be the last - Russ Cox has a great writeup of how it was done for Google code search. What differentiates our approach is that we built it for DNA search using relatively standard software bits.

Finally, a quick note on BLAST ( it and several other similar algorithms have been developed for quickly finding similar sequences given an input sequence. The exact use case here is different, though: BLAST is based on a heuristic algorithm to approximately match against a large set of sequences, whereas here we're focused on precise matching of a scientist's personal library. The narrower scope allows us to be more precise and support more complex regexes, but as you'll see it still requires some work to be performant.

Parsing Regex

It turned out finding a Python library to parse regexes (and not parse using regexes) was surprisingly difficult. You could use the debug flag (re.compile(regex, re.DEBUG)) to see the AST it parses, but 1) it prints to stdout and 2) we'd need to reparse it again to get our own AST, so it seemed too hacky of an approach:

>>> re.compile('a(b.)*', re.DEBUG)
literal 97  
max_repeat 0 4294967295  
  subpattern 1
    literal 98
    any None

(Update: Andrew Dalke had a great tip - there's actually an undocumented sre_parse module that returns a parse tree - definitely worth looking into.)

We ended up writing a simple recursive descent parser to parse our subset of supported regex. It sounds fancier than it is: Matt Might has a great article on writing one that came in handy. The nice thing about recursive descent parsers is that they're pretty simple to write for a language like regex. I wrote the core parser on a plane ride - it weights in at about 200 lines of code and easy to test. So we started off with that as a proof of concept.

parse(...) takes in the regex string to parse and outputs a data structure commonly referred to as an abstract syntax tree (AST):

>>> parse('a(b.)*')
ArrayNode([CharNode('a'), StarNode(ArrayNode([CharNode('b'), PeriodNode()]), False)])  

Our AST generally consists of ArrayNode, which corresponds to a sequence of terms, and AlternateNode, which corresponds to | in regex. (There's a bunch of other nodes for specific regex features like PlusNode and StarNode.) So A(T|C)G becomes ArrayNode('A', AlternateNode('T', 'C'), 'G'). This AST provides an easier-to-use representation than a regex string, and it'll be very useful for further analysis (see Constraint Generation below).

It wasn't completely without hitches. Python doesn't do tail recursion elimination (Guido doesn't like it), and we quickly hit recursion limits. Our recursive code looked something like parse(ATG) => Sequence(A, parse(TG)) => Sequence(A, Sequence(T, parse(G))). The code is dead simple this way, but you quickly run out of stack space for a longer regex. We tokenized consecutive characters, so ATG is treated as a single token instead of three tokens, which addresses the most common use case of consecutive, normal bases. For genuinely long regexes (e.g. a long amino acid chain: (AAA|TTT){1000 more}), we simply bumped the recursion limit. We have some ideas for improving the parser, but more on that later.

Constraint Generation

Now that we have a parsed regex, we need to extract as much information as we can for the search. The idea here is to map the regex to a filter consisting of boolean expression - for a regex, what are filters we can generate to help narrow down potential matches? We have a function generate_constraints(...) that takes in a regex and returns a tree of AND and OR nodes (with strings at the leaves). We'll use this as an example:


After parsing, we have

=> ArrayNode(ATTAAA, StarNode(DotNode), AlternateNode(TTT, GGG))

Our generate_constraints function takes in regex in the form of a parsed AST and recursively calls itself: generate(ArrayNode(X, Y)) becomes AND(generate(X), generate(Y)) and generate(AlternateNode(X, Y)) becomes OR(generate(X), generate(Y)). Any nodes that are optional - StarNode, for example - are replaced with an ANY() node. Any string literals are left alone, as strings.
Our example becomes

=> ArrayNode(ATTAAA, ANY(), OR(TTT, GGG))

We then simplify the tree: ANY() inside of AND() is removed, and OR() that contains ANY() is replaced with ANY(). Any single-term AND or OR clause is unwrapped - AND(X) becomes X.


Finally, we flatten constraints a bit - you get a lot of AND(X, AND(Y, Z)) due to how we parsed things, which we replace with AND(X, Y, Z) since AND is associative. The same goes for OR. This is actually really important - we found that Elasticsearch doesn't do much algebraic simplification, so if you search for AND(AND(X, Y)) a query could take 4 seconds while AND(X, Y) would take 10 milliseconds.

Now that we have a set of boolean constraints, we can pass it into Elasticsearch.

Elasticsearch Indexing

If you're not familiar, Elasticsearch is a distributed document store that is really good at creating inverted indices. Given an input source, you configure how it extracts "terms" out of the source, and Elasticsearch then provides a rich API to query for which documents contain your terms.

In our case, we used Elasticsearch to index DNA bases. Elasticsearch's out-of-the-box text indexing won't work here: Elasticsearch splits by whitespace and normalizes to lowercase, which works great for e.g. English documents. However, DNA is one long string - a common approach, especially for substring search, is to index by trigrams instead. (An n-gram is just a fancy word for a substring of length n. Given abcd, the 3-grams, or trigrams, are abc and bcd.) We pass it a DNA string and have it extract all trigrams for indexing.

We then search using trigrams - if our query is ATTG, we'll search for any DNA that has ATT and TTG. This can have false positive matches (such as a sequence TTGGATT), but the goal here is to quickly eliminate sequences that definitely don't match. For the remaining matches we can run the actual query to verify - as long as our false positive rate isn't too high, this works well.

It gets one level deeper since we're using boolean filters from above. Elasticsearch uses a JSON-formatted DSL, so our constraint from above looks something like:

  "and": [
    {"query": {"match": "ATTAAA"}},
      "or": [
        {"query": {"match": "TTT"}},
        {"query": {"match": "GGG"}}

Elasticsearch will (basically) perform each query independently - each query gets the same trigram treatment, so ATTAAA actually becomes ATT, TTA, TAA, AAA and a sequence that has all the trigrams will match that clause. Those results are intersected (due to the and) with the sequences that match the or clause. After running it through the set of boolean clauses, the sequences that match have been narrowed down and are good candidates for the original regex. In our experience, this is super fast (a few milliseconds), even with millions of sequences.

Inspired by a thread on the Elasticsearch mailing list, we use a groovy script to run through each of the final matching sequences with the original regex. Originally we tried using the built-in Regexp Filter, but Elasticsearch would attempt to run it first, before any filtering from the boolean query, resulting in a slow scan over the entire database. Elasticsearch will run custom scripts last, which alleviates this problem.

The groovy script is run against each matching document and filters by the regex. This is where false positives are removed, leaving us with the DNA we were looking for!

(Side note: configuring a groovy script was surprisingly annoying: Elasticsearch allows both dynamic scripts (where the query itself has the script to run as a string) and stored scripts (where the script is stored on the node). Running dynamic scripts on Elasticsearch looks like a security vulnerability (they're not sandboxed and are discouraged), so we had to use Chef to upload a .groovy file to each node in the cluster.)

Gotcha: Indexing

We found indexing to be the most problematic step. We already had an indexing infrastructure in place to re-index sequences on change, so that wasn't problematic. However, out of the box Elasticsearch choked on longer sequences, and we haven't had too much time to tune the cluster.

We originally indexed not only trigrams, but 4-grams and 5-grams as well - the thinking there being that the larger n-grams would have less false positives. We measured that the indexing time increased linearly with the number of n-gram types, so indexing 3-grams took half as long as indexing both 3-grams and 4-grams. (This is not so surprising if you assume it's just reading it over twice, but ideally it could do both in one pass.) Luckily, only 3-grams still provides us with fast enough searching (not too many false positives), so we stuck with that.

It still takes 30 seconds for a 200,000 basepair sequence. This is on the higher end of what our indexing processes can bear, so we currently only index sequences shorter than 200kb (200 kilobases). A ton of use cases are addressed with indexing sequences under 200kb, but we'd obviously like to support larger sequences as well. More on that later.

Gotcha: Circular Sequences

In bacteria, DNA can come in a circular form (called a plasmid) - this means that for circular DNA, our regexes can wrap around the end of a string: T.*A matches AAATTT if AAATTT is circular! This is a twist on the age-old interview question of checking if one string is a rotation of the other - we store the DNA, if circular, as a doubled string. So AAATTT is stored as AAATTTAAATTT - we run the same search as before against it, but in our groovy script we also check that the length of the match is less than the length of the original DNA string. (Otherwise, something like AAA.*AAA would incorrectly match AAATTTAAATTT.)

We need one last change - a regex like AAA.*T will match greedily and now breaks: it matches AAATTTAAATTT and is filtered out by the length filter we just added. Instead, during our parsing step we transform the AST to a non-greedy version - AAA.*T becomes AAA.*?T. Now, the match is just AAAT, and we're all set.

Next Steps

As mentioned earlier, our recursive descent parser works but isn't a great fit for Python due to lack of tail recursion optimization. Moving forward, we'd probably use a parser generator like ANTLR to generate an LL parser, which won't have the same recursion issues. ANTLR is heavier and requires some more tooling to work, but it's a lot more efficient, pretty standard in the parsing community, and less likely to be buggy than writing one ourselves. Our recursive descent parser works in the meanwhile, though!

Matching longer sequences is going to take more infrastructure work. Our current thinking is that we can cover a large percentage of remaining use cases if we can support indexing long sequences but still cap the length of the allowed match (to about 100kb). We could index sequences in 200kb chunks, each one overlapping by 100kb, and parallelize across multiple Elasticsearch shards to speed up indexing as well as search. A search would look across all chunks (any matches than 100kb would guarantee to be found) and then de-duplicate the resulting sequences.

Comments or ideas? I'd love to hear them: josh at

Finally, as always, we're hiring engineers who are excited to make research faster by solving these sorts of problems.

Discuss on Hacker News