In part 1 of this post, we discussed how genomes are represented in software, why comparing them is useful in a clinical lab setting, and some of the constraints around implementing Identity-by-State (IBS). For those who are new to genomics, it’s worth a quick read.
To recap, our goals for an efficient genomic search tool:
- Fast comparison of a new DNA sample to all existing samples. We’re only interested in the top-K similar samples, rather than the exact similarity estimate between two samples that are clearly unrelated
- Fast pairwise comparison of a given subset of samples. For example, all samples from the last 60 days, or all samples processed by a particular instrument in the lab
- “Interpretable” comparisons: a list of regions or positions that are identical between two genomes, in addition to an overall similarity score
- Capacity that scales linearly or better in the amount of samples, and which supports millions of genomes
The case for search
The most obvious approach, perhaps, would be to store our compact genomic representations in a database, relational or otherwise (and we do, for several use cases; see our work with Google’s BigQuery and Variant Transforms), and then query this database for sample comparisons. Query languages like SQL are the best choice for a variety of informational retrieval purposes, but writing a query to compute similarity (defined according to our specific needs) is often inefficient and inexact.
Another obvious approach, and one upon which we relied for years, is a brute-force batch compute job, running on a cluster of computers, which scans thousands of recent historic samples and looks for unexpectedly high similarity. The downside is clear: it’s slow and expensive to run such compute jobs.
So, in late 2017, we began exploring a search-based approach to the problem. Note that if we replace the term “DNA sample” with the term “document” in the problem statement above, we get a classic description of a full-text search engine. If DNA samples are documents, and variants are words in these documents, we have a wide variety of tools and know-how from the search engine world that allows us to efficiently index and rank samples, produce explainable top-K lists, scale up as needed, support faceting and drill-downs, have access to a variety of ranking strategies, and so on.
Elastic genomic search at Color
We use Elastic to index and store several collections of genomes: those that have been processed in our lab, but also sets of publicly-available genomes such as the Thousand Genomes Project, a widely-used diverse set of genomes. For each genome, we store metadata such as dates, instruments used to process the sample, version information, and so on; importantly, we also store the set of variants discovered in the sample, as a free text field in a sparsely-encoded format similar to the one described in the first post in this series. No personally-identifiable information is stored in the index. Once a sample has completed processing through our wet lab and our high-efficiency bioinformatics pipeline, we add it to the appropriate search index, a process that takes less than a second.
Searching and ranking samples
To find similar samples, we repurpose the set of variants discovered in a sample as a search query to the entire index, as a large OR boolean query of all terms. We can also utilize several common search capabilities implemented in Elastic and elsewhere, such as limiting the number of variants (terms) used for comparison, requiring a minimal overlap between the query and the document, and so on. Particularly appealing is Elastic’s more like this query, which allows efficient document comparison.
A benefit of using a search approach to genomic similarity is that many of the assumptions used to build effective search ranking are applicable to genomes. For example, in natural-language documents, common terms such as “the” tend to be less important to ranking, because they are not distinctive to a particular subset of documents. Search engines will therefore give higher weights to less common terms during the document scoring process; inverse document frequency, or IDF, is one of the core information retrieval concepts that translates this insight into more effective ranking.
As it turns out, the same observation holds for genomes; when comparing two DNA samples, a variation that is common in the overall population should be weighed less than a variant that’s very rare and is expected to only be shared by a small number of samples, possibly related by descent or even identical.
Out of the box, using a single Elasticsearch replica and with very few optimizations, we observe the following latencies:
Elasticsearch, as implied by the name, elastically scales search, so that performance is stable through increased document collection sizes if additional replicas are added (this can be done within seconds, without disrupting the existing serving replicas). In terms of accuracy, we’ve found that our search-driven approach produces ranked lists that are highly correlated with tools we used previously for sample comparison. Our new system correctly flagged every unexpectedly similar sample that our previous implementation also detected, but in a matter of seconds (using a few CPUs) rather than more than ten minutes (using many CPUs).
In addition to efficient indexing, scaling, and IDF ranking, there are many other benefits to using search technology on genomes. Faceted search allows quick exploration of our genomes’ properties and their aggregates; using the metadata indexed along the variants for each sample, we can easily zoom in and focus the search on particular time periods, instruments used, and so on. Lists of matching terms in a ranked result allow us to quickly view overlapping and non-overlapping variants between two samples.
But we’ve also found that once the infrastructure for genomic search is in place, additional use cases become easy to prototype and productionize. For example, it’s straightforward to use our search engine for a quick estimate of the high-level ancestry of a particular sample, by aggregating the self-reported ancestries of its top similar samples. This helps us ensure that the sample processed indeed belongs to the individual we think it belongs to. Whereas if someone self-reported that he is a male of African descent, and the sample appears to come from a female of broadly Asian descent, we’d want to investigate further.
Building on a decades of insights from the information retrieval world, and using highly-scalable, battle-tested open source packages, we’ve found that Elasticsearch is invaluable for exploring genetics in novel ways. We’ve built a fast and scalable search engine for genomes, allowing us to speed up components in our computational pipeline, test hypotheses quickly, and gain a better understanding of the data we process. This directly translates to making our hereditary tests for diseases like cancer and cardiovascular conditions more efficient and accessible, and thus saving more lives.
One final note: the approach outlined here, which is in use today in production at Color for multiple purposes, started out as a proof of concept implemented in a few days during our biannual hack week. If you want to hear more about the engineering, bioinformatics, and data science behind the scenes at Color, shoot us an email at firstname.lastname@example.org, or check our open positions.