Optimizing the Events API Delivery Pipeline

Rebecca Weinberger
Benchling Engineering
11 min readAug 1, 2022

--

Last year, we released the Benchling Events API. The Events product is part of our Developer Platform, which gives developers a set of building blocks they can use to build custom extensions on top of Benchling. With Events, you can subscribe to (and subsequently take action on) certain events happening within Benchling. This is a useful pattern when your custom logic needs to be triggered by a particular user action that has occurred.

As a simple example, if you wanted to build a custom extension that sends you a Slack message when you are assigned a Request within Benchling, your integration would likely subscribe and act on our request.created event. To date, customers have built integrations using Events that allow Benchling to automatically interface with lab instruments, run advanced custom algorithms, or simply automate away some clicks.

In this post, we’ll dive into the technical architecture of the Events pipeline and walk through the details how we improved the system’s performance.

Events System Architecture

The original architecture of the Events system is illustrated below:

Original Events pipeline architecture

There are two main components of the pipeline: the section owned by our monolith, and the section within our AWS infrastructure. At a high level, we store the events we emit as rows in a database table, and asynchronously process these event rows before publishing them out to AWS, where they eventually make their way onto an Eventbridge event delivery bus. We use the Celery task queueing system to perform the asynchronous processing and publishing steps.

In more detail, the journey of an event from creation to delivery generally looks like:

  1. A particular user action in Benchling, e.g. creating a Request, triggers the creation of an event to be sent out
  2. A row is created in our Events database table containing the minimal viable amount of information about this event (we call this a dehydrated event)
  3. Our Celery workers are triggered, and read off any new dehydrated events from the database. The dehydrated events are processed, or hydrated, meaning that we serialize them into the full data payload that is eventually sent to customers. The events are then published to our AWS infrastructure.
  4. In this original architecture, events are first published out to a SNS topic, which then forwards them to an SQS queue.
  5. Next, a Lambda function consumes events from the SQS queue, and finally publishes those events to Eventbridge.

One thing to highlight here is the intentional design decision around this concept of event (de)hydration. This is a mechanism intended to help avoid incurring up-front slowdown to the in-app user experience.

The events that we send out to users must, at some point, be serialized into standardized JSON payloads. If we were to perform this serialization inside the same transaction that created thousands of items, we would be adding cost to the transaction, and effectively causing the user to experience extra slowness for any action in Benchling that fires off events. Instead, we can have that serialization happen asynchronously after the transaction has committed. We implemented this by creating a lightweight version of the event in the original transaction, and leveraging asynchronous Celery tasks to subsequently perform the serialization.

Defining Events SLOs

By definition, an SLO is an agreement between the service provider (i.e. Benchling) and the customer about a specific metric of a service. In the case of Events, we set out to define SLOs around the latency of delivery of an event — specifically, the amount of time taken between the event being triggered from Benchling to its delivery into the customer’s infrastructure.

When defining our target SLOs for Events, we wanted to find a balance between what we thought was reasonably attainable and what we thought customers would expect when interacting with the product. Intuitively, we knew that event delivery should feel snappy — a big selling point of the Events product was to enable real-time integrations, and so naturally, large amounts of latency would devalue that experience. At the same time, we weren’t looking to do a complete overhaul or re-architecture of the system.

Additionally, the existing system made some guarantees around event delivery that we needed to continue to maintain: (1) we guarantee a global, stable ordering of all events emitted, and (2) we guarantee at-least-once delivery of each event.

Keeping these constraints in mind, we came up with an initial definition of our target SLOs. At this stage of the project, these SLOs were more or less a best-guess definition. We defined our SLOs in terms of the volume of events flowing through the pipeline, as (somewhat expectedly) we had observed drastically higher latency for bulk batches of events being created in Benchling. The initial SLO definitions were as follows:

  • P95 delivery time for a single event of <1 second
  • P95 delivery time for a bulk batch of 1000 events of <10 seconds
  • P95 delivery time for a bulk batch of 10,000 events of <20 seconds

Once we had defined our target to hit, the next step was to understand where we were starting from — for each of these SLOs, how far off the target were we? Measuring this differential was actually pretty difficult, especially in the larger bulk cases, because the system behaved so inconsistently. In some cases of our stress testing, we observed faulty logic causing latency spikes on the order of hours! With that said, our initial measurements were roughly as follows:

  • Delivering a single event took up to 5 seconds
  • Delivering a bulk batch of 1000 events took up to 5 minutes
  • Delivering a bulk batch of 10,000 events was very inconsistent, and could take anywhere from 15 minutes to a couple hours

Our work was cut out for us!

Profiling the Pipeline

Before making any code changes, the first action we took was to gather profiling data for the Events pipeline. At Benchling, we use SumoLogic and Datadog as platforms for aggregating data and creating dashboards to dig into it. For the Events work, it was extremely helpful to segment the pipeline into distinct steps, and profile each one. This way, we could quickly tell which steps were taking the longest and be intentional about focusing our efforts on mitigating those problem areas.

To do this, we instrumented the code at several points along the pipeline that roughly follow the steps outlined in the Events System Architecture above. The main segments into which the pipeline was divided were:

  1. The amount of time between event creation and the start of the asynchronous processing task — in other words, the time taken to kick off the pipeline
  2. The time taken by the task for processing events
  3. The time taken by the task for publishing events
  4. The sum of time spent in our AWS infrastructure before handing the event off to the customer

From our segmentation of the pipeline, we learned that the slowest parts of our pipeline were by far our two asynchronous tasks, process_events and publish_events, especially in bulk cases with thousands of events being created simultaneously. This wasn’t necessarily surprising, as these tasks did the bulk of the work needed to emit an event payload. The tasks did not scale well, and often struggled to keep up with the volume of events passing through. Nonetheless, this was a very helpful learning that gave us clarity on where to focus the bulk of our optimization efforts.

Addressing Bottlenecks

Process task

In order to optimize the process_events task, we began by using the Speedscope profiling tool to get a granular look at the task:

Flame graph profile of the process_events task

As a refresher, the process_events task is responsible for serializing events into the data payload that we deliver to customers. The section boxed in red on the flame graph represents the time taken by that serialization. The rest of the task’s time is primarily taken up by committing to the database, which are the blue segments to the right of the red box.

It was clear to us that the data serialization step was far too slow, and that we would need to parallelize that work in order to hit our SLOs. The original architecture had a single Celery task responsible for processing all outstanding events in the database. In order to parallelize, we could leverage the asynchronous nature of the Celery system, and treat tasks more like individual workers.

As a first pass, we tried the most naive strategy for chunking up the work and assigning it to worker tasks: given a set of event rows that needed to be processed, just split them up evenly and pass each chunk of work to a worker.

Naive strategy for chunking up work

This strategy had a number of problems and actually resulted in performance just as slow (or in some cases, even slower!) as our serial starting point:

  1. This strategy added an overhead cost of splitting the work into chunks and starting up a Celery task for each one
  2. Celery workers are shared between services, so we couldn’t guarantee that there would always be free workers to work on the chunked data in a timely manner
  3. Most significantly, with this strategy, delivery had to block on the first chunk of work being completed, because we need to send those events out first. It was possible for the whole pipeline to stall if the worker processing those first events was slow or deprioritized.

With these issues in mind, we designed a more thoughtful parallelization strategy. The core idea is to always be making progress at the front of the queue of events to be delivered. To do this, we have a single serial worker constantly working its way through the events in order from oldest to newest. The parallel workers then pick up chunks of work opportunistically from the back of the queue. When the serial worker sees a chunk of completed work, it can safely mark those events as done and skip over them.

Optimized strategy for chunking up work

With this strategy, we guarantee that the worst case would be at least as good as the original strategy — if there are no async workers available or if the workers were to unexpectedly fail, there would still be a single serial worker processing events. Additionally, the serial worker guarantees that we are always making progress on the most immediate set of events to be delivered. The async workers help out at the back of the queue, and those volatile workers no longer stand in the critical path.

Our second parallelization strategy yielded much better results, and significantly improved our process task’s latency in production environments.

Publish task

We also started with a flame graph profile for the publishing task. This task is responsible for picking up events that have been processed by the previous task and publishing them — in order — to AWS SNS.

Flame graph profile of the publish_events task

The section boxed in red on the flame graph is the cumulative time taken up in this task by network calls to SNS. This box dominates the profile. Round trip network calls are slow, and when repeatedly making thousands of calls, the overall latency quickly becomes significant.

We had a few initial ideas for dealing with these slow network calls:

  1. Parallelize this task just like we did with the previous one
  2. Avoid waiting for the network response of each API call
  3. Batch network calls together

As it turns out, though, both 1 and 2 were made non-viable by our in-order delivery guarantee. Parallelization of these network calls would throw that ordering out the window. In a similar vein, if we were to make API calls without waiting for and properly handling each response (e.g. retrying if needed), we could get into a situation where a call fails but subsequent ones succeed. This, too, would violate the ordering guarantee.

So, by process of elimination, we pursued option 3. By sending multiple events per API call, we could reduce the total number of calls by an order of magnitude (depending on batch size), and implementing the code change to batch calls together seemed like a small lift. This hypothesis ended up being mostly true, with one hiccup coming up along the way: SNS doesn’t actually support receiving messages in batches, so we were out of luck if we wanted to communicate directly with SNS.

In the original Events architecture, we planned to use the SNS service to fan out messages to other event-driven interfaces beyond just Events (e.g. webhooks). Upon further reflection, however, we realized two things: if we were to create a Benchling webhook interface, we’d likely build out a separate pipeline for it rather than relying on the existing one for Events, and for the specific purposes of the Events pipeline, this middleman SNS step was pretty unnecessary. Furthermore, SQS does support receiving batches of up to 10 messages at a time. Therefore, by removing SNS from the architecture and instead sending batches of events directly to SQS, we could both remove an extraneous step from the pipeline and gain a 10x reduction on network call time. This is what we ended up implementing, and while the observed speed up in production didn’t quite reach a factor of 10 due to constant overhead factors of making API calls, it came close!

Conclusion and Next Steps

In this post, we took a detailed look at two of the main optimizations that we implemented as part of the larger initiative to define and attain SLOs for the Events product. There were many other smaller-scale optimizations that went into this effort as well. Ultimately, we ended up coming close to our initially-defined SLOs:

Event latency before and after optimizations

While there was a massive improvement over the original state of the system, we ended up settling in a state that didn’t quite hit the original goal. In the end, there were a few reasons for this.

  1. In retrospect, our initially-defined SLOs were too optimistic about the latency increase when going from 1000 → 10,000 events (10 seconds → 20 seconds). This would represent a sublinear rate of latency growth, which realistically was going to be very tough to achieve with our architecture.
  2. The optimizations were limited by the constraints of the existing Benchling system. Because the scope of this project did not extend to include massive system-wide overhaul, we had to operate within those constraints. One example is the Celery system — in an ideal world for parallelization, we’d have tons of dedicated workers available to immediately pick up work, but because we are limited to just a few workers, we had to get creative with how we parallelized.
  3. Ultimately, we embarked on this project to improve our end users’ experience of using the Events product. After a bit more customer research, we realized that after our improvements to the product, it actually held up to our customers’ usability expectations, even though we hadn’t quite hit our performance target.

The optimizations we implemented in this project have left our Events pipeline in a good state for our customers. However, as Events and overall Benchling usage continues to expand, we are aware that that there will likely come a time to revisit the performance of the Events pipeline — with performance, there’s always room for improvement! When that time comes, perhaps it will be the right time to invest in an ambitious and complete overhaul of the system.

We’re hiring!

If you’re interested in working with us to build the future of biotech R&D platforms, check out our careers page or contact us!

--

--