CRISPR is a genome-editing technique that has rapidly gained popularity in the world of synthetic biology over the last two years.

CRISPR allows scientists to edit genomes with unprecedented precision, efficiency, and flexibility. Gizmodo

It is completely transforming how medical research is done - scientists are already applying CRISPR towards curing genetic disorders, creating disease models, accelerating drug development, and more^{1}. We believe software will play a crucial role in these advancements, and we've released the Benchling CRISPR tool to make working with CRISPR faster and more accurate.

This post focuses on how we've improved 15-minute batch jobs to sub-second searches on a 3-billion base genome. Faster analysis means faster feedback, and faster feedback means faster iteration. Without going into too much detail about the biology of CRISPR (see Andrew Gibiansky's post for a great primer), let's go over a few basics to first understand the motivation. Then we'll dive into the code.

## CRISPR Basics

We model DNA as a string of *base molecules*, where each base is either adenine (A), thymine (T), cytosine (C), or guanine (G). So the human genome is a long string of `ACGT`

characters - that's it^{2}.

At its core, to work with CRISPR scientists design a 20 base string of DNA called a "guide." This guide is integrated into a special protein called Cas9, which then *cuts the genome* at all occurrences of the guide. Cutting the genome at a position means actually breaking the DNA strands apart. This is where CRISPR gets its powers - by designing the right 20 bases, you can break apart specific parts of the genome.

For example, to break a gene:

- Find a 20 base guide that occurs near the beginning of the gene
- Use the CRISPR/Cas9 system to cut at that location
- The cell's natural repair mechanisms will attempt to repair the cut. Since it's an error-prone process, continual cutting by Cas9 eventually results in an incorrect repair. The gene is essentially "corrupted" around that region, and the cell is no longer able to express that gene as intended.

However, this is also the hard part of CRISPR - given a 20 base guide, it's possible for it to occur elsewhere in the genome. You'd like to pick your guides to target a certain gene but avoid breaking other things - while CRISPR makes this much easier, it still has to be carefully done. *Additionally*, a perfect match isn't necessary for CRISPR to cut at a location - empirically, up to 4 mismatches can occur and still result in cutting.

So we want to find the optimal guide that matches near a gene of interest, without having too many false positives. This sounds like something a computer would be great at doing, and it's one of our core CRISPR features.

We help scientists find a set of guides and, for each guide, determine where else it binds in a 3 billion base genome.

## Our Problem

Let's define a function `match(guide, genome_region)`

which returns true if `guide`

and `genome_region`

are close enough that Cas9 would successfully cut at the `genome_region`

. In other words, it will return true if they are 1) the same length, and 2) differ in at most 4 characters.

```
match('AAAAAAAAAAAAAAAAAAAA', 'AAAAAAAAAAAAAAAATTTT') # True
match('AAAAAAAAAAAAAAAAAAAA', 'AAAAAAAAAAAAAAATTTTT') # False
```

Given a genome `G`

and a set of guides `S`

(all with the same length), we'd like to find all locations in `G`

that `match`

at least one guide in `S`

^{3}.

Let's optimize our algorithm for the following constraints:

- A genome length of 3 billion bases (the length of the human genome)
- A guide length of 20
- 200 guides in
`S`

Our algorithm should be both parallelizable and cost efficient. We'd like it to run, parallelized, in under 1 second.

(For readability, I've used Python "pseudocode" for any code snippets - all CPU times shown here are based on C++ implementations.)

## Naive Solution

Let's start with a simple solution as a baseline. We can iterate over the entire genome, and check each index to see if any guide is a match.

CPU Time: 3.25 hours

Ouch. Now, what are the constraints of our problem, and how can we take advantage of them?

- We have the genome in advance. Each species has only one or two reference genomes that scientists use.
- DNA strings are made up of only 4 different characters (A, C, G, and T).
- The length of a guide is 20 bases.
- The maximum edit distance (replaces only) is 4.

Let's start with the first constraint.

## Solution #1: Index the Genome

Since we have the genome beforehand, why don't we just index the entire genome? We can create a `hash_table`

that maps every 20 character string in the genome to the list of locations that it occurs.

For a 23 base genome `AAAAAAAAAAAAAAAAAAAAATG`

, `hash_table`

would be:

```
'AAAAAAAAAAAAAAAAAAAA': [0, 1]
'AAAAAAAAAAAAAAAAAAAT': [2]
'AAAAAAAAAAAAAAAAAATG': [3]
```

Since there are only 4 possible characters, we can represent each one with just two bits, using a simple encoding: `A => 0b00`

, `C => 0b01`

, `G => 0b10`

, and `T => 0b11`

. For example, `TGGA`

would be represented as `0b11101000`

. This means that a 20 base string can fit into one 64-bit int. That changes `hash_table`

to:

```
0b0000000000000000000000000000000000000000: [0, 1]
0b0000000000000000000000000000000000000011: [2]
0b0000000000000000000000000000000000001110: [3]
```

Given this map, how do we find all of the matches for a given guide? We have to allow an edit distance of up to 4, so we can't just do a single lookup for the guide. For a given guide, there are 424,996 strings^{4} within edit distance 4, so we could just search for each one...

CPU Time: 400 seconds

Unfortunately, the pseudocode hides a very important detail - each lookup in `hash_table`

is actually a disk seek. The table is too large to store in memory (around 36GB). Doing `200 * 424996`

disk reads takes quite a long time, even on an SSD.

## Solution #2: Preprocess everything

If the problem is the number of disk reads, maybe we should try preprocessing some more information from the genome.

One idea would be to store the matches for every possible guide - essentially, a full map from all possible inputs to their corresponding answer. That way, given a guide, we could do a single query into the hash table to find all of its matches.

CPU Time: Unknown

Assuming the genome is random, there should be about 1000 (32-bit int) matches for each guide. This would result in a storage requirement of at least 4^{20} * 1000 * 4 bytes/match, or more than 4 petabytes!

Even after playing around with this for a while, it was hard to come up with a solution that was both fast and used a reasonable amount of storage.

Let's take a step back, and look at the problem as a whole. We are trying to search for 200 guides over a 3 billion base genome. We can roughly divide the solution space into two groups:

- Process the genome, and do a separate search for each guide
- Process the guides, and do a separate search for each position in the genome

My intuition tells me that the second scheme should be much worse. We can't process the guides in advance (whereas we can process the genome). We will have to do at least 3 billion "operations," 1 per position in the genome (versus ~200 "operations," 1 per guide, by processing the genome). But let's give it a shot anyway.

Suppose we constrained the problem - let's search for an exact match for a single guide. How long does it actually take to naively iterate over the entire genome?

CPU Time (single guide): 15 seconds

Now, let's take advantage of the limits of DNA. Since we can store an entire 20 base string in a single 64-bit integer, we should be able to check equality in a single CPU instruction. As we go through the genome, we can keep a rolling hash of the last 20 bases we've seen. To add the next character, we can use simple bitwise operations to remove the highest 2 bits, and append on 2 bits to the end. Our "hash" is a bit of a misnomer - it's just the encoded bases.

Now, to check for an exact match, we can just compare our two 64-bit ints, the guide and the rolling hash.

CPU Time (single guide): 2.61 seconds

That's fast! It only takes 3 seconds to scan 3 billion bases and check equality at every index. Of course, we're not checking for exact matches. We're looking for the 424,996 guides that are within an edit distance of 4 from each of our 200 guides. Let's alter the loop to look for all of the guides simultaneously.

## Solution #3: Hash set for guides

Let's create a hash set `possible_matches`

with an entry for every possible match, for each guide. Remembering that we're using 200 guides, there will be `424,996 * 200 = 84,999,200`

^{5} such matches in `possible_matches`

. Then, we can simply iterate over the genome, and check at every index if a corresponding hash set entry exists.

CPU time: 773.8 seconds

Creating hash set: 41.8 seconds

Searching over genome: 732 seconds

That's weird. We just found that iterating over the entire genome to check equality only took 2.61 seconds. So why does this method take so much longer?

If we examine our loop, we are checking `rolling_hash in possible_matches`

in each iteration. Checking existence in a hash set requires computing the hash of `rolling_hash`

(pardon the naming). This means we need to do 3 billion hash operations as we iterate over the genome. We need to decrease the time taken at each position of the genome.

## Solution #4: Two boolean arrays

Since evaluating the hash function was the bottleneck, let's replace our hash set with a boolean array. `array[key] = True`

if and only if `key in hash_set`

. This means a set of `{1, 2, 5}`

would be an array of `[False, True, True, False, False, True]`

. We can avoid hashing, at the expense of additional memory usage.

Remember that we represent each guide as a number, using the two-bit encoding. In this case, we need a slot in the array for every possible guide. For length 20 guides, we need about 1 terabyte of memory (4^{20} 1-byte booleans)^{6}.

Instead, let's try using two boolean arrays (this is actually the key insight to solving our problem!)^{7}. The first array `first_half_matches`

will correspond to matches for the first 10 bases of any guide, and `second_half_matches`

will correspond to matches for the last 10 bases. Our storage needs drop from 1TB (4^{20} 1-byte booleans) to 2MB (2 * 4^{10} 1-byte booleans).

Remember we'd like to decide if `rolling_hash`

is within edit distance 4 of any guide. If `rolling_hash`

's first half is not in `first_half_matches`

, then its first half is already more than edit distance 4 from any guide and we can abort. The second half follows similarly.

If both halves of `rolling_hash`

are present in their respective arrays, we still need to verify that `rolling_hash`

matches overall. For example, `AAAAAAAAAAAAAAAAAAAA`

matches `AAAAAAACCC`

in the first half and `AAAAAAAGGG`

in the second half but does not match `AAAAAAACCCAAAAAAAGGG`

(edit distance of 6). For now, we'll simply check the `rolling_hash`

against all guides to verify if it's a match.

CPU time: 25.43 minutes

Creating arrays: 0.149 seconds

Searching over genome: 25.43 minutes

We're almost there! Preprocessing the queries is now extremely fast, and uses only 2MB of memory. But searching over the genome takes more than 25 minutes. Iterating over the genome to check equality only took 2.6 seconds, so what's taking so long now?

It turns out that our first `if`

statement (checking `first_half_matches`

and `second_half_matches`

) has too many false positives. On some test data, 2,898,421,200 indices pass the first `if`

check, but only 222,000 are real matches. That means that 96.6% of the 3 billion locations pass our first filter - not much of a filter, is it? The time taken to perform the expensive check takes up more than 99.999% of our CPU time.

Let's alter our approach slightly to fix this.

## Final Solution: Two Integer Arrays

Instead of using a boolean array to denote whether a 10 character string matches the first half or second half of the guide, let's also keep track of the minimum edit distance between the string and any guide. Now, we can change our filter to be more aggressive - if the summed min edit distances of the two halves exceeds 4, we can safely rule out a match.

Our false positive count drops from 2,898,421,200 to 34,706,100 (-98.8%), and our times reflect that:

CPU time: 28.5 seconds

Creating arrays: 0.16 seconds

Searching genome: 28.4 seconds

We now have a serial solution that takes only 29 seconds to complete, down from 3.25 hours!

### Parallelization

After even more optimizations, we were able to bring the serial version down to 8 seconds. This algorithm is also "embarrassingly parallelizable" - we chop the genome into 10 or so chunks, run the search over each chunk, and merge the results together. Empirically, we were able to easily achieve that final **sub-second runtime**.

Here's a quick overview of the key optimizations we made:

- Splitting an exponential space (all possible guides) into two halves
- Replacing hash tables with boolean arrays for faster lookup
- Operating on 64-bit ints instead of strings
- Using a rolling hash to remember the last 20 bases of the genome
- Parallelizing across machines

Our solution has one unexpected benefit as well - we don't need to process the genome beforehand. While not a huge priority, this gives us some flexibility in how we build our infrastructure and makes it easy to support more genomes.

## Conclusion

So there you have it - an analysis that used to take 10-15 minutes now takes a second on Benchling. We're pretty excited to be able to contribute these kinds of improvements to the scientific community, for free.

As far as infrastructure goes, we currently use a bunch of cores on AWS `c4.2xlarge`

boxes for the computations, but one of our interns is in the process of moving onto AWS Lambda for a super scalable (and cheap) setup. More on that in a future post!

## We're Hiring!

And yes, Benchling is hiring engineers to join our small-but-growing team - if you're interested in building high-quality tools to power a new generation of research, see our jobs page or contact us.

Hsu, Lander, Zhang 2014: http://www.cell.com/cell/pdf/S0092-8674(14)00604-7.pdf ↩

There are technically still some regions of the human genome which we are unsure about, and these regions are represented with a different set of letters. We ignore these regions when searching for matches. ↩

Technically, we also need to deal with PAM sites. This is just a minor addition to the final algorithm, though, so I've left it out of this blog post. ↩

The number of possible guides is computed as: (20 choose 4) * 3

^{4}+ (20 choose 3) * 3^{3}+ (20 choose 2) * 3^{2}+ (20 choose 1) * 3^{1}+ (20 choose 0) * 3^{0}= 424996 ↩We could represent this as a bitmap instead of a boolean array. This would cut our memory needs in 8, but we would still need 128GB. In addition, we would need at least 4 operations to check if a particular value exists - one to get the value in the array, two to extract the correct bit, and one to check if it was 0 or 1. ↩

This number actually depends on the guides themselves. Some values may overlap, so 84,999,200 is actually an upper bound on the size of possible_matches. In practice, the actual size will be very close to 84,999,200. ↩

I played around with the number and size of the arrays, and two arrays of 10 seems to be optimal for a 20 base pair guide. The tradeoffs are 1) amount of time to preprocess, 2) memory requirements, 3) number of CPU instructions in the filter, and 4) number of false positives. One interesting note is: even for a size 30 guide, we can ignore the last ten bases and use two arrays of size 10, since we have to check for false positives regardless. ↩