A visual guide to nucleotide chemical structures

Published in

Benchling Engineering

10 min readNov 21, 2022

Row of 21 circles representing nucleotides in a modified RNA sequence. Thee circle labeled “U” has a large tooltip underneath with the text “Sugar 2-O-Methylribose”, “Base Uracil”, and “3' Phosphate Phosphate”. The tooltip also contains an image of the chemical structure of the modified nucleotide. — Passenger strand of patisiran (the first siRNA drug that got FDA approval!) with a tooltip showing the chemical structure of one of the modified nucleotides.

Introduction

Benchling’s mission is to unlock the power of biotechnology. Benchling provides a platform used by 200,000+ scientists to centralize data, improve collaboration and insights, and accelerate the path to scientific discoveries. This summer, I joined the Modified Biologics and Chemistry team at Benchling as a software engineering intern. One of the goals of the team is to make Benchling chemically aware, which includes adding chemical modifications to DNA and RNA sequences and creating a new small molecule entity type on Benchling.

For my intern project, I worked on making our product more user-friendly for chemists. Some of our customers want to be able to visualize the actual chemical structure of what they’re working with, as they often can’t recognize a molecule by its name or a string representation of its chemical structure. They can more easily identify and design molecules through their visual chemical structure.

To cater to chemists, we want our product to surface chemical structure images. This will allow chemists to work more easily in Benchling, whereas before they had to use multiple tools to design monomers, nucleotides, or sequences. For my project, I built out a feature that surfaces chemical structure for monomers and nucleotides in DNA and RNA sequences.

✨ Our final interactive, connected, consistently oriented, nucleotide image! ✨

Before diving into the technical details of how I generated chemical structure images for DNA and RNA sequences, here’s some background on modified biologics.

Background

What’s a nucleotide?

A nucleotide is a molecule consisting of a base, a sugar, and a phosphate monomer connected together. DNA and RNA are long strands of many nucleotides bonded together.

In Benchling, a monomer is a molecule with attachment points that can be bonded to other monomers. I like to think of monomers as puzzle pieces and attachment points as the tabs that determine where they can be joined. Connecting monomers together at their specific attachment points creates a nucleotide.

Labeled chemical structure of the natural RNA nucleotide guanosine monophosphate. The phosphate is highlighted in a box labeled “phosphate”. The sugar is highlighted in a box labeled “ribose”. The base is highlighted in a box labeled “guanine”. — A nucleotide (like this Guanosine monophosphate) is made up of a sugar, a base, and a phosphate.

From an engineering standpoint, monomers are interchangeable with other monomers of the same component type (sugar, base, or phosphate). Benchling elegantly models a monomer as an entity with a chemical structure and a monomer type; and a nucleotide as an entity composed of one monomer of each type.

Try swapping out the monomers!

SMILES 🙂

SMILES (simplified molecular-input line-entry system) is a string-based notation that represents the chemical structure of a molecule. The notation can encode atom labels and stereochemistry, but not the 2D rotation of a molecule.

SMILES for monomers also specify their attachment points by labeling a group of atoms with the R-group number. The three different types of monomers in Benchling have different numbers of attachment points. Sugar monomers have an R1, R2, and R3. Bases have just an R1, while phosphates have an R1 and R2.

Row of three monomers, each with a SMILES string and a chemical structure image underneath, both with highlighted R-groups. The SMILES for ribose is “O[C@H]1[C@H]([OH:3])O[C@H](CO[H:1])[C@H]1O[H:2]” with [OH:3], [H:1], and [H:2] highlighted. The SMILES for guanine is “Nc1nc2n([H:1])cnc2c(=O)[nH]1” with [H:1] highlighted. The SMILES for phosphate is “OP([OH:1])([OH:2])=O” with [OH:1] and [OH:2] highlighted. — SMILES and structures for three natural monomers with their capping groups highlighted.

The capping group of an attachment point is the functional group in the monomer (almost always a H or OH) that’s used if no other monomers are connected to the attachment point. In the figure above, the sugar’s OH labeled R3 is the sugar’s R3 capping group and the base’s H labeled R1 is its R1 capping group. When forming a nucleotide, we replace those capping groups with a single chemical bond between the monomers.

Algorithm for generating nucleotide images

We use the RDKit, an open source cheminformatics library, to render chemical structure images.

For a single nucleotide, the base’s R1 attaches to the sugar’s R3 and the phosphate’s R1 attaches to the sugar’s R2. If we wanted to connect multiple nucleotides together, we would attach the first nucleotide’s phosphate’s R2 to the next nucleotide’s sugar’s R1.

Chemical structures of the base guanine, sugar ribose, and phosphate phosphate, with R-groups labeled. There is a highlighted box around the base’s R1 [H] and the sugar’s R3 [OH]. There is another box around the sugar’s R2 [H] and the phosphate’s R1 [OH]. — Base, sugar, and phosphate monomers with their attachment points labeled and the R-groups highlighted.

Connecting monomers together requires some simple graph operations:

1. Creating monomer molecular graphs from SMILES

The first step is to create molecular graphs for the monomers from their SMILES strings and label the monomers’ attachment points. RDKit represents molecules as a graph data structure with atoms as vertices and bonds as edges.

We use RDKit to convert each of the monomers’ SMILES into a molecular graph. RDKit also lets us combine two molecules into a disconnected graph by taking the disjoint union of their molecular graphs.

Our next steps are to make the nucleoside a connected graph by adding an edge between its disconnected components.

2. Connect the nucleoside (base + sugar)

The nucleoside is the base and sugar bonded together. To create the nucleoside, we have to replace the capping groups on the base’s R1 and the sugar’s R3 attachment points with a single bond between the monomers.

By definition, monomer capping groups are attached to the rest of the monomer with a single bond. Since there’s exactly one bond (edge) to the capping group, we know the attachment point is the only neighboring atom (vertex) to the capping group atom.

Three structure images side-by-side. The first contains disconnected guanine, ribose, and phosphate monomers with the base’s R1 and sugar’s R3 groups crossed out. The second image is labeled “Connect nucleoside, Sugar + Base,” now with a dashed bond between the sugar’s R3 and base’s R1 groups and the sugar’s R2 and phosphate’s R1 groups crossed out. The last image is labeled “Connect nucleotide, Nucleoside + Phosphate” and now has a solid bond between the sugar’s R2 and phosphate’s R1. — The stages of connecting three monomers into a nucleotide. Left: all three monomers with base’s R1 and sugar’s R3 capping groups removed. Middle: nucleoside with R2 capping group and phosphate R1 capping group removed. Right: completely connected nucleotide.

What if the capping group contains more than one atom?
If the capping group is a hydroxyl group (OH), how do we know which of the two atoms to remove? Like other cheminformatics libraries, RDKit uses hydrogen suppression, meaning implicit hydrogens like the Hs in [OH:2] or [NH2:3] aren’t included in the molecular graph. The library can deterministically calculate the number of hydrogens attached to an atom when rendering. From a code perspective, the capping group atom is just the oxygen.
Also, the SMILES specification disallows multiple atoms with the same atom label, guaranteeing that there’s at most one atom labeled with any R-group number. For example, it’s legal to write [H:1] and [CH:3], which have only one non-suppressed atom. But it’s illegal to label constituents with multiple non-H atoms like[CC:2] or [C(=O)OH:1]. This constraint is acceptable because in practice, our customers’ monomers only have H and OH capping groups.
If hydrogens are suppressed, how do you find a sole hydrogen capping group?
RDKit doesn’t suppress explicit Hs. When we include a single labeled hydrogen atom in a monomer’s SMILES string with [H:1], RDKit does include that hydrogen in the molecular graph.

3. Connect the nucleotide (nucleoside + phosphate)

Finally, we connect the phosphate to the nucleoside with the same process. We remove the phosphate R1 and sugar R2 capping groups and add a single bond between them.

Handling edge cases

This process works great for nucleotides with all three monomers, but in some special cases, a nucleotide doesn’t have all three monomers.

A degenerate base is how biologists represent a base which has multiple base possibilities of A/T/G/C. bases that could be one of many instead of the normal A/T/G/C. Since there’s no single structure for a degenerate base, we simply don’t render the base.

A null terminal phosphate is when the last nucleotide in a sequence has no attached phosphate. Most biological use cases for oligos do not include a terminal 3’ or 5’ phosphate. Sequences with a terminal phosphate require scientists to specially request phosphorylated oligos from their oligonucleotide manufacturer.

Benchling has a variety of customers with diverse scientific needs. Our goal is to be as flexible as possible to support them. When rendering nucleotides with a missing monomer, we display a partially rendered chemical structure along with a message that the structure isn’t complete.

Row of five circles representing nucleotides in a modified RNA sequence. Thee circle labeled “U” has a large tooltip underneath with the text “Sugar Ribose”, “Base Uracil”, and “3' Phosphate None”. The tooltip also contains an image of the nucleoside chemical structure, with no attached phosphate. — Chemical structure image with null terminal phosphate.

Rotating nucleotides

By default, RDKit orients the generated nucleotide to maximize the image’s aspect ratio. Depending on the size of the monomers and the sugar’s bond angles, the phosphate might be to the right of the base or on the left. This makes it harder for scientists to directly compare nucleotide structures because they have to mentally rotate the entire image.

In the table below, the three nucleotides are identical except for their base monomer. It’s easier to spot the difference in the second row, where the nucleotides are lined up in a consistent orientation.

Each row has the same nucleotide — with one monomer changed. It’s easier to find the difference when the nucleotide has a consistent orientation. Try changing the rotation of the monomers in the top row!

Why did you choose to orient it this way?
Consistency is important, but the specific orientation is just convention. We looked at the depictions of nucleotides in the literature to see what scientists are used to. Most textbooks position the 3' phosphate on the left of the base and the 5'-end phosphate on the right. Benchling models nucleotides with 3' phosphates, so we show them on the left.

Five distinct chemical structure images of nucleotides, titled “The phosphate is on the left of the base." In each structure, the phosphate and base are highlighted in boxes, with the phosphate always on the left of the base. — A literal textbook example of nucleotides.

Trigonometry 🤝 chemistry

Fun fact: this intern project was the first time I’ve used arctan since high school.

1. Finding centroids

Two chemical structure images of the same unrotated nucleotide. On the left, the nucleotide has its three monomers highlighted: guanine, ribose, and phosphate, with the phosphate on the right of the base. On the right, the structure shows the centroids of each monomer along with equations. The sugar, base, and phosphate centroids are the average of the position vectors of their constituent atoms. — Finding the centroids of each monomer.

Our first step is to find the centroids (arithmetic mean) of each monomer. Luckily, RDKit has methods to find the (x, y) coordinates of each atom in the molecular graph. Since we keep track of which monomer each atom originally belonged to, we can find the centroid of each monomer in the connected nucleotide’s graph.

Note that the visual centroid of a monomer isn’t related to its physical center of mass because (1) this is only a 2D representation of a 3D structure and (2) the molecular graph doesn’t include hydrogens.

2. Finding the counterclockwise angle between the base and phosphate vectors

The smallest angle between vectors B and P is the arctangent of the magnitude of the cross product divided by the dot product of the vectors. We need the counterclockwise angle, not the shortest, so we add 360° to the angle if it’s negative.

Then, we find the “half angle” with respect to the +x-axis.

The equations show that the variable named “angle” equals the arctangent of the magnitude of the cross product of B and P divided by the dot product of the vectors. If angle is negative, it is incremented by tau radians. The “base angle” is the arctangent of its y-component divided by its x-component. The variable named “mid” equals the base angle plus half of angle. There are also two annotated chemical structure images showing angle and mid as a red arrow pointed southeast on the nucleotide. — Left: counterclockwise angle from base vector to phosphate vector. Right: half angle in red.

3. Rotating half angle up

Rotating the image so that the half angle points up (90°) makes sure that the phosphate is on the left of the base.

There is an equation where the variable “rotate by” equals one-fourth tau radians minus mid. On the left, the nucleotide structure image has a red arrow pointing straight up. On the right, the nucleotide is the same, with the original three monomers highlighted. Both nucleotides are now oriented so that the phosphate is on the left of the base. — Left: nucleotide rotated so that half angle points up. Right: final rotated nucleotide.

We pass the rotation in degrees to RDKit’s molecule drawing function, which handles rotating the atom labels correctly. Then we send the generated SVG (including the colored monomer highlights!) to the frontend.

Rendering chemical structures on the frontend

Benchling uses React for all of our frontend code. There were a couple of interesting components we made to render an interactive structure.

Automatic SVG resizing

When rendering chemical structure SVGs, RDKit requires the client to specify the width and height of the generated image (auto-resizing was added in a later version of RDKit). When generating images, we set width=height=1000 , which doesn’t work well with structures with non-square aspect ratios.

Table showing chemical structure images of two molecules with and without automatic resizing. The phosphate structure has a square ratio, so it looks the same before and after resizing. The N2-[(imidazol-4-yl)ethyl]guanine structure is small when constrained to square without resizing and is more legible with automatic resizing. — Automatic resizing maximizes our use of the available space. The blue border is the SVG’s bounding box.

We solved the aspect ratio problem by creating a React hook that iterates over the children of the <svg> and sets the <svg>'s viewBox attribute to be the smallest rectangle containing the BBox of all of them. In computational geometry, this is the minimum bounding rectangle problem!

Interactive hover highlighting

Another fun part of my intern project was to highlight the monomer substructures of the nucleotide image when a user hovers over them. This is the process for the <NucleotideImage> component:

Render the colored nucleotide SVG with a React component <ChemicalStructureImage>.
Use D3 to select all the colored path and ellipse elements in <ChemicalStructureImage>’s <svg>
Use a mapping of color → monomer type to bind the selected elements to their monomer type.
For each selected element, set their color to gray and opacity to 0.
Add mouseover/mouseout event listeners on each bound element. When an element is hovered over, it also sets every other element with the same monomer type to be opaque.

For performance, we memoize <ChemicalStructureImage> so that we only modify the DOM once for each nucleotide image. Since the <svg> is memoized, <NucleotideImage> will run the callback on the same <svg> every time <NucleotideImage> is re-rendered. This happens very frequently — every time the user mouses over a nucleotide component. Our useCallback hook has to be idempotent so that it doesn’t break the <svg> by modifying its contents multiple times or cause a memory leak with stale event listeners.

Performance

The RDKit code is pretty complex, so I was worried my intern project would be too slow. We considered several alternatives to improve performance including backend caching, RDKitJS in the frontend, and tooltip debouncing. But as it turned out, the performance was already great!

Flame graph explaining the time taken for a browser request for a nucleotide image, which takes 126 ms in total. Sending the request takes 25 ms, processing by Benchling’s backend takes 75 ms, and downloading the response takes 26 ms. Within the “Processing by Benchling backend” bar, “SQL query for monomer SMILES” takes 5 ms and “RDKit generation of nucleotide SVG” takes 8 ms. — Simplified flame graph showing the time spent on each major component of the network request.

From the client’s perspective, the whole browser request to get a nucleotide’s image is very fast, about 90–180 ms (25–75th percentile). Thanks to RDKit, generating the actual SVG only takes 8 ms!

Summary

Throughout my internship, I had the opportunity to learn many new technologies, many of them aligned with what I wanted to learn during my time at Benchling. I came into this internship with more frontend development experience, and expressed interest in learning more about SQLAlchemy and GraphQL, two technologies I didn’t have experience with. I also got to dive deeply into RDKit, and found it rewarding to work to this extent with open source software. Finally, throughout my internship with Benchling, I got to work across many engineering verticals, including full stack development, investigating end to end performance, and thinking critically about analytics.

We’re hiring

If you’re interested in working with us to build the future of secure biotech platforms, check out our careers page or contact us!