A search engine for the human genome Part I: the genome in software

Elasticsearch is an open-source, scalable search engine built on top of the Lucene core. First released eight years ago, it has since become among the most popular choices for search infrastructure, with hundreds of companies having deployed it in consumer internet applications.

At Color, we apply search technology to solving critical quality control problems for clinical genomics — the first genetics laboratory to do this at scale, to the best of our knowledge. In this post and part II, we share notes on our methodology and results, and discuss related scenarios where you might consider using search to answer a variety of large-population genetics queries.

Part I provides a high-level overview of bioinformatics and common encodings of DNA sequences. If you’ve always wondering how DNA sequencing works, this is for you! If you’re a computational biologist or otherwise well-versed in these topics, you can skip directly to Part II.

How human genomes are represented in software

DNA, as you might recall from biology class, is a long molecule found in (almost) every cell in a living organism. It carries instructions for building the proteins that perform the bulk of the organism’s functions, and is transmitted from parents to offspring. The protein-building instructions are encoded using an alphabet of four letters (or nucleotides, or bases): A, C, G, and T; in what is sometimes called the “code of life”, each triplet of these letters translates to one of the amino acids that make up a protein, or to instructions about the protein structure.

The human genome — the collection of all our DNA, organized into 23 chromosomes — is just over 3 billion characters long. Each cell in our body contains two copies of the genome: one that we inherited from Mom, and one from Dad. This adds up to a bit over 6 billion characters altogether.

However, these two copies are 99.9% similar. in fact, the genomes of any two humans are also 99.9% similar. Human genomes can be thought of as a pair of two 3-billion character strings, but because of their similarity, it’s inefficient to store the entire genomic sequence for every individual. Instead geneticists often use an agreed-upon, widely available “reference genome,” so that only the differences between a particular genome and the reference genome need to be stored.

For example, if the human genome was 10 characters long and in a single chromosome instead of 3 billion characters long in several chromosomes, we could agree on the following reference genome:

Then, let’s say the two copies of my genome are:

My two copies are mostly similar to the reference one (and to one another), but one of my copies is different in position 3. So to represent my genome, instead of listing all 20 characters in it, we could just use something like 3:A , which would mean that in position 3, one of my copies has an A, and otherwise everything is identical to the reference genome (these differences are referred to as “variants” or “mutations”). An entire genome would then just be a series of variants, and look something like:

3:A, 501:C, 3922:A, 3923:A, 40095:T, …

This approach will look familiar to software engineers and data scientists, as it is used in many situations: to represent sparse feature vectors in machine learning applications, to compress repetitive data using static encoders, and so on.

In this description we’ve left out many details for the sake of simplicity. For example, genomes differ in ways that are more complex than just “one letter is substituted by another” — some extra letters can be inserted, or some can be deleted, or both. There are also cases where both the maternal and the paternal copies differ from the reference genome, but each in different ways; so there’s additional data we need to store, such as the chromosome this variation is found in. But the core idea is clear: we rely on predefined data sets to allow very compact representations of individual genomic sequences.

Genomic similarity

Clinical laboratories like Color apply sophisticated methods to studying these genomes to assess genetic predisposition to cancer, heart disease, and other inherited conditions. During that process we’ll often need to compare two genomes: to check for errors, to identify samples that are related in a way that impacts analysis downstream, and so on.

One real-world example: what if, despite numerous controls and safeguards, a human working in a laboratory accidentally processes the same sample twice, thinking that it’s actually two different samples? To make sure such an error would be caught, we’d want to check if any sample being processed is similar to other recently processed samples, but do so in a way that doesn’t unnecessarily expose any PHI (personal health information, like name and birthdate).

There are a number methods of comparing genomic sequences. At Color we’re mostly focused on Identity-by-State (IBS). This method checks for the similarity between two sequences in a manner similar to calculating edit distance to compare two strings. We’re particularly interested in an IBS system that would allow:

Many common IBS approaches are optimized for areas of the genome which are known to vary widely among individuals. These areas are commonly relied upon in the context of “genetic traits” or ancestry, since they help identify differences between humans who are, as we noted earlier, otherwise 99.9% similar. But clinical laboratories like Color which focus on inherited disease care most about what are called “highly-conserved” regions of the genome that have little variation between individuals. This makes intuitive sense: variations in the genes we study have a higher likelihood of causing fatal diseases, and therefore are less likely to be passed to offspring. For the purposes of IBS, this makes computing similarity even more challenging.

In Part II of this post, we’ll discuss how Elasticsearch helps Color satisfy these requirements and constraints.

Tags: , , ,