Modeling Science as a Directed Graph

Forest Tong
Benchling Engineering
7 min readNov 11, 2017

--

Science is complex, messy, and beautiful. And it’s full of graphs. In evolutionary biology, we might study the graph of ancestry. The scientific process itself is a graph. And when a large organization collaborates towards a scientific goal, there evolves a kind of meta-graph of the relationships between teams. Graphs are a powerful way of extracting structure out of a chaotic system.

Benchling models the scientific experiment as a directed acyclic graph called a workflow. Each node in this graph represents a step of the experiment, and each edge represents a biological sample that was produced in the previous step and will be consumed by the next step. In other words, this is the graph of what you do and what you do it to.

A Benchling workflow for antibody discovery. The direction of the edges is from left to right.

Later in this blog post, we’ll explore how workflows solve today’s problem of data fragmentation in the biotech industry and drive scientific innovation. But first, let’s start with a simple goal: to bake the best cookies in the world.

How to bake the best cookies in the world

If this were the middle ages, we would probably use our grandma’s grandma’s grandma’s recipe, chant an incantation, and pray for the best. Our graph could be modeled by a single node: just bake the cookies.

But it’s not the middle ages, and we want the best cookies in the world. So what creates the best cookies? Let’s investigate the effect of temperature with an experiment, where we try baking cookies at three different temperatures to see which one’s the best.

Now we’re getting serious. The question of which cookie tastes the best is a highly personal preference, so for statistical rigor, we can’t just taste the cookies ourselves — we need a team of cookie tasters. We hired Alice and Bob to help us.

Unfortunately, a problem has come up — the dough is too lumpy! John says the problem is that Jane isn’t sifting the flour. Jane says the problem is that John hasn’t sufficiently beaten the eggs. We need more granularity in our process. So let’s track the wet and dry ingredients separately, and measure their lumpiness individually before combining.

But mixing isn’t a single step, it’s a process of repeated stirring. To understand the effect of stirring on lumpiness, let’s model each stir as a separate step. Here’s where you might imagine that the graph would become cyclic, but because you can’t literally cycle back on a past moment in time, the repetition is still a linear path.

A few of our customers got sick eating the cookies. So we decided to institute Quality Control. We need to test for bacteria to keep customers healthy, and also so that if we detect a bad batch we can save the cost of the downstream baking and tasting.

And this is just the beginning. We’ll need controls to calibrate the tasters and test for placebo effects. We’ll want to try out different recipes. As we grow, we’ll massively parallelize this process, so that thousands of cookies are baked at a time. What started off as a single node quickly blossoms into an mind-bogglingly complex graph.

What’s the point of the experimental graph?

The purpose of every experiment is to answer some question about the world. If we’re Galileo dropping balls from the Leaning Tower of Pisa, a two-dimensional table in our personal notebook is sufficient to answer the question of mass versus time to fall.

But as an experimental process scales up in complexity, variables tested, and parallelization, often the most important questions involve graph traversals. In particular, it involves traversing the biological samples downstream that were produced from a given sample, or traversing the samples upstream that led to a given sample, and linking those samples back to the steps that operated on them. Here are a few examples:

  • What baking temperature produced the tastiest cookies, controlling for taster?
  • A shipment of eggs turned out to be rotten. What are all the cookies that we’re going to have to recall?
  • What’s the bottleneck in efficiency — mixing the wet ingredients or mixing the dry ingredients?

Put another way: these questions are database queries that JOIN across a graph of tables. Take the question of which cookies to recall. From the eggs, we need to find the wet batter made from those eggs; from the wet batter, we need to find the resulting dough; from the dough, we need to find the resulting cookies; from the cookies, we need to find the affected batches.

The ability to answer these questions is crucial in order to learn from our processes and get better and better results. These are the questions that Benchling can answer for you.

Science, not cookies

Our customers are not making graphs to bake cookies — they’re making life-changing scientific discoveries. One customer is developing cures for cancer by genetically engineering the body’s T cells to attack cancerous cells. Voyager is performing gene therapy to cure Advanced Parkinson’s Disease. Editas is harnessing CRISPR genome editing to cure rare eye disorders.

CRISPR has the power to correct fatal mutations in the human genome.

The experiments these scientists conduct are analogous in many ways to our example of baking cookies. Instead of varying the baking temperature, you might vary the concentration of an antibody. Instead of mixing wet and dry ingredients, you might mix the backbone and insert DNA, which together combine to form a plasmid.

And the questions you’ll need to ask are analogous as well:

  • Which screening method was most successful for producing antibodies that resulted in drug candidates?
  • What samples were used to produce the plasmid I’ve just been given to process in this experiment?
  • As a PI, which of this workflow’s steps is taking the longest amount of time, and what can I do to unblock my team?

Innovative science requires innovative processes. The more efficiently you can discover and release a therapy, the less costly it is to consumers, and the sooner you can start on the next challenge. Today’s biotech organizations are increasingly moving towards a world where the tedious manual labor in labs is fully automated.

Traversing a fragmented graph

The tragic thing is that scientists cannot answer these questions with today’s tools. The status quo of research tools is a dizzying combination of paper notebooks, Excel, legacy software, emails, memory, and past conversations. An Excel sheet works for the first iteration of our cookie process, even the second and the third — but by the time we have a complex graph, the tabular format of spreadsheets falls hopelessly short.

Data fragmentation obscures the big picture and makes the process tedious and error-prone.

Today’s trend is towards the collaboration of thousands of scientists, not only within a team or across teams, but across organizations around the world. Hours of painstaking manual labor are spent digging up the right data, cross-referencing, de-duplicating, extrapolating. Scientists are literally doing graph traversal by hand. These are extremely intelligent, skilled researchers stuck doing busy work. Just imagine the amount of intellectual capital the world loses by forcing scientists to spend 40% of their time on busy work rather than science.

Benchling powers effortless graph traversal.

And experiment analysis is only half the story. The other half is experiment execution. When the graph is sufficiently specified and autonomous instruments hook into our API, all the intellectual work has been done. The control flow of the program has been defined. Imagine all that’s left is to execute the experiment at the push of a button.

At Benchling, we believe that if a question can be answered with data, it should be answered — and it should be answered by a computer, not with the precious amount of time that a scientist has. Instead of typing, pipetting, copy-pasting, let scientists ideate, analyze, dream.

Excited to accelerate science innovation? We’re hiring.

Discuss on HackerNews

--

--