Using Suffix Trees to Detect Homology at Scale

Karen Gu
Benchling Engineering
6 min readAug 1, 2022

--

Benchling’s mission is to unlock the power of biotechnology, which often involves creating new tools that help scientists do their work faster and more intelligently. Many of these tools, such as the homology detection tool described here, sit at a unique intersection where results in computer science are used to improve everyday biotechnology workflows.

Benchling is the single source of truth for biotech research and development (R&D) workflows. One particularly common workflow is molecular cloning, where a scientist creates many copies of a gene or protein. This process involves inserting a DNA fragment into a plasmid, a circular piece of DNA, that can be replicated in large quantities by microorganisms.

Learn the basics of Bulk DNA Assembly

Scientists first complete this workflow in silico (an experiment modeled on a computer or performed via computer simulation) on Benchling by finding the fragments they need to combine. Each fragment may have slight variations or regions of scientific interest (e.g. a gene for antibiotic resistance). The fragments are joined together using different cloning methods, which correspond to different scientific procedures. This can result in hundreds of possible combinations that would be infeasible to determine by hand.

Benchling offers a Bulk Assembly product to address this use case: users identify the fragments of DNA to combine, and Benchling determines what the result of each combination should be, saving users hours of manual work.

What is homology detection?

One important tool in molecular cloning workflows is homology-based cloning. To explain this, we’ll first dive into some of the basics of DNA structure.

The key to this method is the double-stranded nature of DNA. DNA is composed of two complementary strands of nucleic acids. Each nucleic acid has one of four nitrogenous bases: adenine (A), thymine (T), guanine (G), or cytosine (C). These four bases form two pairs that preferentially bond with each other — A with T, and G with C (the so-called base pairing rules). This means that a single strand of DNA exactly specifies what its corresponding strand should be. For example, if an A appears at a given position in a DNA strand, we know that the corresponding strand should have a T at that same position.

By Madprime (talk · contribs) — This vector image was created with Inkscape., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1848174

Homology-based cloning methods, such as the Gibson assembly method depicted below, rely on this complementarity to join fragments of DNA together. Single strands of DNA will join together (anneal) according to the base pairing rules if they contain complementary regions, or homology regions. Thus, a homology region is simply a region of shared bases between two different fragments of DNA (the region overlapping the two different fragments in the second step of the procedure).

By Tobias Vornholt — Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=39652308

On Benchling, scientists perform the molecular cloning workflow in silico by specifying two adjacent fragments of DNA and a homology region that the fragments have in common. However, the homology region is just a string of A, T, G, and C. This string can be difficult to remember or identify, so Benchling provides a homology detection feature that determines the longest shared region between two fragments of DNA. And since the key benefit of Bulk Assembly is being able to combine more than just two fragments, this is actually the longest shared region between many fragments of DNA.

To put this into computer science terms: it’s just the longest common substring problem, on strings consisting of only the letters A,T,G or C.

The Old Way

Our first attempt to solve this problem was simple. We knew that our customers were using the homology detection feature in the context of a specific cloning workflow (Gibson assembly) where the length of the homology region is quite small — on the order of tens of bases. We therefore assumed that the length of the longest homology region would fall within a narrow window of a few tens of bases, i through j.

This approach was easy to implement and to understand, but it was not performant due to the step that computed all potential substrings of a given length. When there is no cap on substring length, the overall process requires O((length of the sequence)³) time. It worked under our previous assumptions, but broke down as our customers came to us with new use cases with hundreds to thousands of bases.

Suffix Trees

Before we describe our second attempt to solve this problem, let’s talk about the data structure that underlies the new solution: suffix trees.

A suffix tree is a compact way to store all of the suffixes of a given input string, which is concatenated with a unique end-of-string character. Each internal node can have one or more outgoing edges, at most one for each letter of the input alphabet. Each edge can have one or more letters, and points to another node. Then, by traversing the path from the root to a leaf, recording the letters on each edge, a suffix of the input will be created.

For example, in the suffix tree below for the string BANANA, the substring NANA is represented by an edge extending from the root labeled NA, followed by another edge labeled NA$, which form the desired substring NANA (plus the end-of-string character $) when concatenated together.

Adapted from https://commons.wikimedia.org/w/index.php?curid=1197251

A generalized suffix tree is a suffix tree that represents multiple input strings.
This is done by creating a simple suffix tree with the inputs concatenated together, separated by a unique end-of-string character that is not found in any of the inputs. For example, to represent the inputs GATTACA, ATCG, ATTAC, we would create a suffix tree with the single input GATTACA$ATCG#ATTAC@, using a different end-of-string character for each input.

The New Way

In the new world, we perform homology detection in three steps:

  1. Create a generalized suffix tree using Ukkonen’s algorithm
  2. Annotate each node in the tree according to which input strings correspond to it
  3. Perform a depth-first search over the tree to find the deepest node(s) annotated with all input strings

Since steps (1) and (3) are described thoroughly elsewhere, we’ll only discuss (2) in more detail.

By definition, each leaf in the suffix tree represents a suffix of the original string. Then, since we know the length of each suffix and the length of the original string, we can determine the suffix’s start position in the original string:

Furthermore, we know the length of each of the input strings that were concatenated together to form the original string. This allows us to uniquely identify which input string corresponds to each leaf of the tree.

We can then recursively annotate the tree from leaf to root, where an internal node will receive an annotation with each input string for which it contains a descendant leaf.

This method performs significantly better than the old method for homology regions on the order of thousands of bases. Using the old method, we would have rejected any requests at the scale represented by the rows in the below table, as they would’ve caused out-of-memory (OOM) errors. However, it still runs into its limits.

Next Steps

As customers continue to use the homology detection feature to handle new scientific use cases, we may need to make further performance enhancements:

  • Use a C-based implementation to avoid incurring the Pythonic overhead
  • Look into more state-of-the-art algorithms for solving the longest common substring problem
  • Investigate a distributed solution (this may bring us back to our original implementation which allows us to chunk by homology region length)

We’re hiring!

If you’re interested in working with us to build the future of biotech R&D platforms, check out our careers page or contact us!

--

--