Building a high-quality, reliable, and efficient bioinformatics pipeline

At Color we provide high-quality, physician-ordered, genetic testing at a low cost. A core component of this service is the bioinformatics pipeline: the software framework that processes data from the DNA sequencers in the clinical lab, finds genomic variants, annotates them for variant classification, and performs quality control.

We’ve previously published a few posters and papers ([1], [2], [3]) outlining our novel solutions for detecting and managing hard-to-call variants. This is an exciting and fast-evolving area of research, both at Color and across the field of bioinformatics.

But until today, we haven’t shared many details about our pipeline itself, and the role that distributed systems engineering plays in the future of bioinformatics and Color. Jeremy Ginsberg recently shared an introduction to the topic (worth a quick read for background/context). Here, we discuss more specific technical details and optimizations which may be of interest to other bioinformatics teams and, we hope, distributed systems engineers who are motivated make an impact by solving hard computational problems at the center of precision medicine (we’re hiring!).

For those who are new to genetics and bioinformatics, here’s the typical “life of sample” ordered as part of a clinical lab test like our Color Hereditary Cancer and Heart Health Test:

Life of a sample

The main stages of the bioinformatics pipeline are:

It sounds relatively straightforward, but as shown in this workflow, each sample actually runs through more than 20 distinct processes with complex dependencies:

Dependency graph for bioinformatics pipeline tasks

In this blog post, we outline the guiding principles of building our bioinformatics pipeline: quality, reliability, performance, and cost efficiency. We offer some insights into how we implemented these principles at high throughput in a production clinical setting.

Quality

The most important attribute of any clinical bioinformatics pipeline is correctness: all variants in the biological samples have to be detected correctly; both false positives and false negatives have clinical implications. By comparison, research-grade pipelines often focus on easier-to-detect variants and are comfortable with limited sensitivity, especially for CNVs. For context, we note that a clinical bioinformatics pipeline is just one of many procedures and systems required to achieve Color’s overall accuracy/quality goals:

One novel approach we use on the software side is running extensive regression tests for all bioinformatics pipeline code changes. By running every new software release against thousands of previously processed samples, we have been able to increase confidence in software changes. Though this approach may seem fairly routine to engineers accustomed to working on distributed systems at large consumer internet companies, it’s very rare in this domain.

To even consider this approach, the pipeline itself must be exceedingly reliable, reproducible, and able to be massively parallelized with a low end-to-end runtime, even when processing many thousands of samples. Without these characteristics, a regression test of this scale is too cumbersome to manage.

How does a regression test work? The output of the bioinformatics pipeline — variant calls, annotations, and quality control metrics — are automatically compared between releases. Any changes in the output have to be reviewed and explained by code changes going into the new release; unwanted changes will trigger an investigation and prevent the release from getting deployed to production until resolved. This allows us to deploy a new release of the bioinformatics pipeline every few weeks, while maintaining the high quality and necessary validation for each release. The faster iteration cycle means we can keep the pace of development high and continue work on delivering new genetic tests to our users.

To ensure comprehensiveness of the regression integration tests, we run them against a large set of recent production samples and against known challenging datasets (samples with rare and hard-to-call variants); this makes sure we capture every critical code path and run tests with data that’s representative of the current conditions in the lab: assay, lab process, and sequencing hardware.

Reproducibility of the pipeline code is vital for running regression tests — to be able to compare different software releases, the same sequencing input has to produce the same output for every run. This is not generally the case for DNA alignment tools and variant callers; these are stochastic processes that often run in a multithreaded mode, both of which are sources of non-determinism. We took the following approaches to ensure determinism:

As a result, we have eliminated non-determinism as a source of spurious regressions during backtesting.

The other key requirements for running large regression tests are cost, speed, and reliable automation:

The next sections outline the work required to achieve these goals.

Reliability

To avoid human error and increase our bioinformatics throughput, we invested heavily in automation. Some of this was by necessity: our bioinformatics engineering team is small and any operational overhead takes time away from research and development.

We started by making sure the entire pipeline workflow is automated:

  1. When the lab operations team loads the DNA sequencers with samples, an automated process detects new files being written to the network attached storage system in the lab.
  2. The process immediately starts uploading sequencer data to Amazon S3. Another process running in AWS EC2 monitors the new files and watches for sequencing completion markers. The moment a sequencing run is complete, a message is enqueued to trigger a new bioinformatics pipeline run for the new batch of samples.
  3. AWS EC2 instances are automatically scaled up to provision for the computing requirements of the new workload. These EC2 instances are also automatically terminated when the computing is no longer needed (i.e. the pipeline is complete) in order to reduce cost. All data and logs are persisted in the Amazon S3 storage system.
  4. A pipeline run starts processing the data. Dependencies are automatically resolved and the workflow progresses until all required tasks successfully complete.
  5. Once the pipeline is complete, it automatically triggers downstream processes: quality control approval, variant import into the database for variant classifications, and sending out any necessary notifications to teams that need to interpret the data.

Of course, automation only works well if these processes are fault-tolerant and don’t require manual interventions. Our systems’ reliability has improved over time: as new failure modes are detected, we would investigate and address the root cause. Root cause analysis is vital here; while it’s possible to improve the reliability of idempotent processes with automatic retries, this generally leads to brittle systems. Our team built a culture of zero runtime exceptions; every pipeline failure gets reported to our team’s Slack channel, so it’s very visible. The oncall engineer is responsible for investigating the exception and preventing repeat failures.

Slack notification for a pipeline run

The causes of intermittent failures in the bioinformatics pipeline are twofold: distributed system failures modes and the unpredictability of biological input. The former are familiar; like any large distributed system, we have to deal with hardware failures, network issues, memory or disk resource exhaustion, etc. The latter is a problem specific to processing biological data; unlike processing digital input, the diversity of inputs — everyone’s DNA is unique — and sensitive thermal and chemical processes in the lab contribute to a wide range of input parameters. For example: small fluctuations in temperature in the lab can negatively affect the PCR amplification process, leading to significantly reduced coverage of sequenced data. The pipeline has to properly handle such outliers in data input, by either attempting to make variant calls with the same confidence, or by rejecting the samples at QC review and requiring the sample to be processed in the lab again.

A key performance indicator for our team is time between manual interventions. Over time, our emphasis on root cause analysis allowed us to keep the bioinformatics pipeline running autonomously for up to 30 days with no intervention, all the while processing thousands of clinical samples, comprising 30+TB of data and utilizing 100,000+ CPU-hours.

Performance

As mentioned above, being able to process a high volume of samples with low turnaround time allows us to run large regression tests. Similarly, research and development benefits from short iteration cycles — for example, when we develop a new version of our assay.

Turnaround time is also important for production clinical samples, to improve the client experience of genetic testing. When looking at Color’s overall client-facing turnaround time, the majority of days are spent either in the clinical lab or our thorough analysis/classification of called variants. For example, an Illumina NovaSeq sequencer alone takes around 40 hours to sequence a batch of DNA samples. So why does the runtime/reliability of the bioinformatics pipeline matter? Cascading errors, for one reason: a slow pipeline that later requires manual intervention and a second attempt can easily add several days to processing. We’ve spent years focusing on making the pipeline run faster and more reliably, to the point that we now return results in less than 2 hours on average, regardless of the number of samples processed in parallel.

To achieve this level of performance, we first focused on efficient and highly parallel use of AWS EC2 resources. DNA alignment and variant calling are the most computationally intensive tasks, and may use up an entire EC2 instance for each sample for a short amount of time (one of the instance types we use are c5.9xlarge instances with 36 CPU cores and 72GB of memory!).

There’s a large set of both third-party and homegrown bioinformatics tools involved in running our pipeline:

Each tool was individually profiled and tuned to make sure CPU, memory and storage resources are efficiently utilized given the size of our workload. What makes this harder is the fact that these tools weren’t necessarily designed for running in a parallel high-throughput environment. While there exist newer tools with better performance characteristics (e.g. minimap2 for alignment, or sambamba for processing aligned reads), our emphasis on quality and correctness requires us to first and foremost use industry-accepted and validated tools.

Here are two specific discoveries/insights which had noteworthy performance impacts:

Cost efficiency

To offer clinical-grade genetics at industry-leading price points, we must keep the costs of running the bioinformatics pipeline low. Some engineers are surprised to learn that, despite ongoing decreases in the overall cost of sequencing, the wetlab process itself (from DNA extraction to sequencing) is inherently a much bigger cost than running software. So why do we care about pipeline compute costs? Apart from the regression test mentioned above, we know that at some point in the future, storage and computational costs will match or exceed wetlab costs, especially as we move to whole-genome sequencing (WGS). WGS data for a single sample at 30x depth of coverage generates around 60GB of compressed data and can take several hundred CPU-hours to process.

A few of the decisions which help keep our pipeline costs low:

Summary

Running a production clinical bioinformatics pipeline requires expertise both in computational biology and distributed systems. By leveraging our team of bioinformaticians and systems engineers, we’re able to achieve high sensitivity and specificity of variant calling, while keeping the operational overhead, runtime, and costs low.

We’re hoping some of these insights will help others in our field, and perhaps inspire the next generation of software engineers pursuing careers in health technology! Check out https://color.com/careers or reach out to bioinformatics@color.com if you have questions.

Tags: , ,