A visual guide to nucleotide chemical structures
Introduction
Benchling’s mission is to unlock the power of biotechnology. Benchling provides a platform used by 200,000+ scientists to centralize data, improve collaboration and insights, and accelerate the path to scientific discoveries. This summer, I joined the Modified Biologics and Chemistry team at Benchling as a software engineering intern. One of the goals of the team is to make Benchling chemically aware, which includes adding chemical modifications to DNA and RNA sequences and creating a new small molecule entity type on Benchling.
For my intern project, I worked on making our product more user-friendly for chemists. Some of our customers want to be able to visualize the actual chemical structure of what they’re working with, as they often can’t recognize a molecule by its name or a string representation of its chemical structure. They can more easily identify and design molecules through their visual chemical structure.
To cater to chemists, we want our product to surface chemical structure images. This will allow chemists to work more easily in Benchling, whereas before they had to use multiple tools to design monomers, nucleotides, or sequences. For my project, I built out a feature that surfaces chemical structure for monomers and nucleotides in DNA and RNA sequences.
Before diving into the technical details of how I generated chemical structure images for DNA and RNA sequences, here’s some background on modified biologics.
Background
What’s a nucleotide?
A nucleotide is a molecule consisting of a base, a sugar, and a phosphate monomer connected together. DNA and RNA are long strands of many nucleotides bonded together.
In Benchling, a monomer is a molecule with attachment points that can be bonded to other monomers. I like to think of monomers as puzzle pieces and attachment points as the tabs that determine where they can be joined. Connecting monomers together at their specific attachment points creates a nucleotide.
From an engineering standpoint, monomers are interchangeable with other monomers of the same component type (sugar, base, or phosphate). Benchling elegantly models a monomer as an entity with a chemical structure and a monomer type; and a nucleotide as an entity composed of one monomer of each type.
SMILES 🙂
SMILES (simplified molecular-input line-entry system) is a string-based notation that represents the chemical structure of a molecule. The notation can encode atom labels and stereochemistry, but not the 2D rotation of a molecule.
SMILES for monomers also specify their attachment points by labeling a group of atoms with the R-group number. The three different types of monomers in Benchling have different numbers of attachment points. Sugar monomers have an R1, R2, and R3. Bases have just an R1, while phosphates have an R1 and R2.
The capping group of an attachment point is the functional group in the monomer (almost always a H
or OH
) that’s used if no other monomers are connected to the attachment point. In the figure above, the sugar’s OH
labeled R3 is the sugar’s R3 capping group and the base’s H
labeled R1 is its R1 capping group. When forming a nucleotide, we replace those capping groups with a single chemical bond between the monomers.
Algorithm for generating nucleotide images
We use the RDKit, an open source cheminformatics library, to render chemical structure images.
For a single nucleotide, the base’s R1 attaches to the sugar’s R3 and the phosphate’s R1 attaches to the sugar’s R2. If we wanted to connect multiple nucleotides together, we would attach the first nucleotide’s phosphate’s R2 to the next nucleotide’s sugar’s R1.
Connecting monomers together requires some simple graph operations:
1. Creating monomer molecular graphs from SMILES
The first step is to create molecular graphs for the monomers from their SMILES strings and label the monomers’ attachment points. RDKit represents molecules as a graph data structure with atoms as vertices and bonds as edges.
We use RDKit to convert each of the monomers’ SMILES into a molecular graph. RDKit also lets us combine two molecules into a disconnected graph by taking the disjoint union of their molecular graphs.
Our next steps are to make the nucleoside a connected graph by adding an edge between its disconnected components.
2. Connect the nucleoside (base + sugar)
The nucleoside is the base and sugar bonded together. To create the nucleoside, we have to replace the capping groups on the base’s R1 and the sugar’s R3 attachment points with a single bond between the monomers.
By definition, monomer capping groups are attached to the rest of the monomer with a single bond. Since there’s exactly one bond (edge) to the capping group, we know the attachment point is the only neighboring atom (vertex) to the capping group atom.
What if the capping group contains more than one atom?
If the capping group is a hydroxyl group (OH
), how do we know which of the two atoms to remove? Like other cheminformatics libraries, RDKit uses hydrogen suppression, meaning implicit hydrogens like theH
s in[OH:2]
or[NH2:3]
aren’t included in the molecular graph. The library can deterministically calculate the number of hydrogens attached to an atom when rendering. From a code perspective, the capping group atom is just the oxygen.Also, the SMILES specification disallows multiple atoms with the same atom label, guaranteeing that there’s at most one atom labeled with any R-group number. For example, it’s legal to write
[H:1]
and[CH:3]
, which have only one non-suppressed atom. But it’s illegal to label constituents with multiple non-H atoms like[CC:2]
or[C(=O)OH:1]
. This constraint is acceptable because in practice, our customers’ monomers only haveH
andOH
capping groups.If hydrogens are suppressed, how do you find a sole hydrogen capping group?
RDKit doesn’t suppress explicitH
s. When we include a single labeled hydrogen atom in a monomer’s SMILES string with[H:1]
, RDKit does include that hydrogen in the molecular graph.
3. Connect the nucleotide (nucleoside + phosphate)
Finally, we connect the phosphate to the nucleoside with the same process. We remove the phosphate R1 and sugar R2 capping groups and add a single bond between them.
Handling edge cases
This process works great for nucleotides with all three monomers, but in some special cases, a nucleotide doesn’t have all three monomers.
A degenerate base is how biologists represent a base which has multiple base possibilities of A/T/G/C. bases that could be one of many instead of the normal A/T/G/C. Since there’s no single structure for a degenerate base, we simply don’t render the base.
A null terminal phosphate is when the last nucleotide in a sequence has no attached phosphate. Most biological use cases for oligos do not include a terminal 3’ or 5’ phosphate. Sequences with a terminal phosphate require scientists to specially request phosphorylated oligos from their oligonucleotide manufacturer.
Benchling has a variety of customers with diverse scientific needs. Our goal is to be as flexible as possible to support them. When rendering nucleotides with a missing monomer, we display a partially rendered chemical structure along with a message that the structure isn’t complete.
Rotating nucleotides
By default, RDKit orients the generated nucleotide to maximize the image’s aspect ratio. Depending on the size of the monomers and the sugar’s bond angles, the phosphate might be to the right of the base or on the left. This makes it harder for scientists to directly compare nucleotide structures because they have to mentally rotate the entire image.
In the table below, the three nucleotides are identical except for their base monomer. It’s easier to spot the difference in the second row, where the nucleotides are lined up in a consistent orientation.
Why did you choose to orient it this way?
Consistency is important, but the specific orientation is just convention. We looked at the depictions of nucleotides in the literature to see what scientists are used to. Most textbooks position the 3' phosphate on the left of the base and the 5'-end phosphate on the right. Benchling models nucleotides with 3' phosphates, so we show them on the left.
Trigonometry 🤝 chemistry
Fun fact: this intern project was the first time I’ve used arctan
since high school.
1. Finding centroids
Our first step is to find the centroids (arithmetic mean) of each monomer. Luckily, RDKit has methods to find the (x, y) coordinates of each atom in the molecular graph. Since we keep track of which monomer each atom originally belonged to, we can find the centroid of each monomer in the connected nucleotide’s graph.
Note that the visual centroid of a monomer isn’t related to its physical center of mass because (1) this is only a 2D representation of a 3D structure and (2) the molecular graph doesn’t include hydrogens.
2. Finding the counterclockwise angle between the base and phosphate vectors
The smallest angle between vectors B and P is the arctangent of the magnitude of the cross product divided by the dot product of the vectors. We need the counterclockwise angle, not the shortest, so we add 360° to the angle if it’s negative.
Then, we find the “half angle” with respect to the +x-axis.
3. Rotating half angle up
Rotating the image so that the half angle points up (90°) makes sure that the phosphate is on the left of the base.
We pass the rotation in degrees to RDKit’s molecule drawing function, which handles rotating the atom labels correctly. Then we send the generated SVG (including the colored monomer highlights!) to the frontend.
Rendering chemical structures on the frontend
Benchling uses React for all of our frontend code. There were a couple of interesting components we made to render an interactive structure.
Automatic SVG resizing
When rendering chemical structure SVGs, RDKit requires the client to specify the width and height of the generated image (auto-resizing was added in a later version of RDKit). When generating images, we set width=height=1000
, which doesn’t work well with structures with non-square aspect ratios.
We solved the aspect ratio problem by creating a React hook that iterates over the children of the <svg>
and sets the <svg>
's viewBox
attribute to be the smallest rectangle containing the BBox
of all of them. In computational geometry, this is the minimum bounding rectangle problem!
Interactive hover highlighting
Another fun part of my intern project was to highlight the monomer substructures of the nucleotide image when a user hovers over them. This is the process for the <NucleotideImage>
component:
- Render the colored nucleotide SVG with a React component
<ChemicalStructureImage>
. - Use D3 to select all the colored path and ellipse elements in
<ChemicalStructureImage>
’s<svg>
- Use a mapping of color → monomer type to bind the selected elements to their monomer type.
- For each selected element, set their color to gray and opacity to 0.
- Add
mouseover
/mouseout
event listeners on each bound element. When an element is hovered over, it also sets every other element with the same monomer type to be opaque.
For performance, we memoize <ChemicalStructureImage>
so that we only modify the DOM once for each nucleotide image. Since the <svg>
is memoized, <NucleotideImage>
will run the callback on the same <svg>
every time <NucleotideImage>
is re-rendered. This happens very frequently — every time the user mouses over a nucleotide component. Our useCallback
hook has to be idempotent so that it doesn’t break the <svg>
by modifying its contents multiple times or cause a memory leak with stale event listeners.
Performance
The RDKit code is pretty complex, so I was worried my intern project would be too slow. We considered several alternatives to improve performance including backend caching, RDKitJS in the frontend, and tooltip debouncing. But as it turned out, the performance was already great!
From the client’s perspective, the whole browser request to get a nucleotide’s image is very fast, about 90–180 ms (25–75th percentile). Thanks to RDKit, generating the actual SVG only takes 8 ms!
Summary
Throughout my internship, I had the opportunity to learn many new technologies, many of them aligned with what I wanted to learn during my time at Benchling. I came into this internship with more frontend development experience, and expressed interest in learning more about SQLAlchemy and GraphQL, two technologies I didn’t have experience with. I also got to dive deeply into RDKit, and found it rewarding to work to this extent with open source software. Finally, throughout my internship with Benchling, I got to work across many engineering verticals, including full stack development, investigating end to end performance, and thinking critically about analytics.
We’re hiring
If you’re interested in working with us to build the future of secure biotech platforms, check out our careers page or contact us!