Single-cell harmonization

Establishing a mapping between different single-cell data modalities

Mar 13, 2022

Some say they see Poetry in my paintings; I see only science.
- Georges Seurat

Introduction

Academic software can be notoriously difficult to install, let alone use. It is often created and maintained by trainees moving between institutions, with wide-ranging levels of experience in software engineering. I’ve written about this at length recently for New Science. Despite this general situation, beautiful tools do emerge. The Seurat toolkit for single-cell genomics is one of the best examples of this.

The Seine seen from La Grande Jatte (La Senna alla Grande-Jatte) Georges Seurat — The Seine seen from La Grande Jatte (La Senna alla Grande-Jatte) by Georges Seurat

The software package is named after the famous French painter Georges Seurat, who is primarily known for developing the pointillism painting technique, where complex scenes and landscapes are composed out of small dots of color on a canvas. Like the artist it was named after, the Seurat package provides a comprehensive toolkit for creating biological portraits based on the individual measurements present in single-cell genomics datasets.

Clustering of cells from the free PBMC dataset generated by 10X Genomics. (source)

The quality of Seurat is apparent in many ways. It is directly installable from CRAN (the main package repository for the R language), extensively documented and tested, and provides comprehensive tutorials for the majority of common use cases.1 Another important aspect is the general approach to creating and publishing computational methods from the Satija Lab where the tool was created and is actively maintained. Instead of creating a new package for any paper describing a new method, there are often new papers that describe impressive new computational approaches that have been added to the main Seurat package—resulting in a more coherent user experience and lower maintenance burden for the developers.

Raphael Gottardo 📊🧪🖥️ @raphg

One thing that we've learned through this effort is that Seurat is really really popular, for good reasons. Thanks to @satijalab for making such a great tool, and of course the #rstats and @Bioconductor teams. >90% of datasets were analyzed/processed with #RStats/Seurat.

Nature Biotechnology @NatureBiotech

Single-cell immunology of SARS-CoV-2 infection https://t.co/7Yoq1uxBN8 https://t.co/tfc9ckh2nH

The recent preprint “Dictionary learning for integrative, multimodal, and scalable single-cell analysis” describes a new computational approach for harmonizing single-cell datasets across different measurement modalities. It is implemented as an extension to the wonderful Seurat package. Here I’ll provide an overview of the problem being solved, and the solution they developed.

This study was led by Yuhan Hao and the corresponding author is Rahul Satija.

Key Advances

In genomics, reference genomes are crucially important for analyzing new sequencing data. The core idea of the Human Genome Project (HGP) was to create a reference map of the bases in a representative genome of a human being. Reference genomes now exist for many species, and we have also greatly improved the human reference genome as new sequencing technologies have been developed. New sequencing data is analyzed relative to the reference genome of the species it was collected from. This is done by aligning the sequence against the reference to determine its global position. This is like a look-up operation in a dictionary. By aligning a sequencing read, you can determine the annotations at its position in the genome—such as if it is in a gene coding region.

In the world of single-cell genomics, high-dimensional measurements such as RNA-seq are collected for each individual cell in a sample. The field has relied on clustering analysis to group populations of cells together, and then to annotate them based on what genes are expressed. Until recently, this type of analysis has been unsupervised, where there is no ground truth label for the data—and no equivalent reference set of annotations compared to how DNA sequencing analysis is performed.

This is beginning to change as single-cell genomics matures. There are new large-scale consortia such as the Human Cell Atlas (HCA) and HuBMAP that are generating massive and intensively curated reference data sets of single-cell measurements. The HCA has an incredible mission:

To create comprehensive reference maps of all human cells—the fundamental units of life—as a basis for both understanding human health and diagnosing, monitoring, and treating disease.

As these atlases have expanded in their scope and resolution, ML-based tools have been created to effectively query them with newly collected unlabeled cells—much like aligning a sequencing read to a reference genome.

One of the main shortcomings of current tools and reference sets is that they have been primarily focused on one measurement modality: single-cell RNA-seq (scRNA-seq). Most approaches to trying to annotate alternative types of measurements such as scATAC-seq—which measures chromatin accessibility in a cell—work by encoding biological assumptions about how it might be related to RNA.

This paper describes a really clever new approach to this problem. It uses a dataset with a large number of multi-omic measurements to serve as a “bridge” between different types of technologies and the reference datasets available.

This technique relies on powerful ideas from a branch of machine learning called representation learning. The core idea is to use tools from “dictionary learning” where the goal is to find a set of numerical elements that can reconstruct the original dataset.2 Using these transformations, it is possible to query the exquisitely annotated reference atlases with data other than scRNA-seq—which is a big step forward.

Results

With some intuition for the problem being solved and what bridge integration accomplishes, let’s look at an example of how it works. For the initial demonstration, the query was scATAC-seq data collected from human bone marrow mononuclear cells (BMMCs). The reference dataset was the Azimuth reference, and the “bridge” between them was a 10x Multiome dataset—meaning that it contained both chromatin and expression measurements.

Above you can see that the scATAC-seq query was annotated using the RNA reference. Encouragingly, the annotations broadly overlapped with the previous results from the authors—indicating that the technique is accurate.3 The real value of bridge integration could be seen in what else they were able to find beyond the original annotations.

To start, their analysis “found that bridge integration annotated additional rare and high-resolution subpopulations.” The intuition here is that leveraging a larger reference can provide resolution that unsupervised analysis on the raw dataset isn’t powered to match. Imagine trying to annotate the entire genome from scratch for each analysis! Same goes for cells.

The new representation offered another really interesting advantage: they could compare how variation in one modality impacted the other. How are gene expression and chromatin accessibility actually connected? As a simplistic model, when chromatin is accessible, regulators of gene expression can bind DNA and initiate transcription. Again, previous computational tools even assumed this relationship. But empirically, the story is more complex. They were able to find many cases where there was actually a lag between chromatin becoming accessible and transcription starting. This type of resolution highlights how exciting this approach is for mapping between different types of biological measurements.

The last result that I’ll mention is that they also used bridge integration to tackle the problem of ‘community-wide’ integration, which they defined as the “challenge of harmonizing a broad swath (or the entirety) of publicly available single-cell datasets from a single organ.” This is a really hard computational problem.

One of the previously described approaches has been geometric sketching, which was developed in the Berger Lab at MIT. This has been a powerful integration approach, but still requires running PCA on all of the data being integrated. In this study, they combined ideas from geometric sketching and bridge integration to dramatically increase the efficiency of community-wide integration.

The UMAP plot shown above represents the integration of 19 different single-cell genomics datasets characterizing human lung biology—which has been of intense interest during the COVID-19 pandemic. Amazingly, their “atomic sketch integration procedure performed all these steps (including preprocessing) in 55 minutes, using a single computational core.” The incredible computational efficiency made it possible to compose a cellular portrait that would make Georges Seurat proud.

Final Thoughts

Single-cell genomics is an enormously powerful paradigm that is beginning to rewrite many aspects of our understanding of biology. We are now creating comprehensive molecular atlases of all of the different cell types in the human body. The bridge integration approach described in this study unlocks the power of these reference datasets by making it possible to query them with a large number of data modalities beyond gene expression measurements.

In combination with geometric sketching, it also is a powerful tool for creating new integrated atlases across datasets. For this reason, the authors “anticipate that our methods will be valuable to both individual labs but also larger consortia that have already invested in constructing and annotating comprehensive scRNA-seq references.”

It can be dizzying to think about what the future may hold for single-cell genomics. As these measurements are combined with genome-scale CRISPR perturbations, we could enter a world with predictive maps between genotypes and phenotypes. What if the role of a specific gene in any cell type could be accurately resolved with a computational query? It’s hard to even reason about what this might unlock for biotechnology!

Thanks for reading this highlight of “Dictionary learning for integrative, multimodal, and scalable single-cell analysis.” If you’ve enjoyed this post and don’t want to miss the next one, you can sign up to have them automatically delivered to your inbox:

Until next time! 🧬

One noteworthy point here is that the quality of the R tooling for creating package docs/websites and testing is really phenomenal. Hadley Wickham and the tidyverse team have extended the level of design thinking that went into their famous ggplot2 graphics package to the entire set of tooling for R package development. This process is documented in the free book R Packages. While I enjoy many other languages including Python, I haven’t found an equivalent developer experience for packages.

For the machine learners, this is explicitly “a representation of the input data as a weighted linear combination of individual basic elements.” In some ways, the goal reminds me of how variational autoencoders are used, but instead of arriving at a latent space that is decoded, you have a set of “atoms” that are linearly combined to recreate the data. For more, there are diagrams in the paper showing the matrix transformations involved, and a comprehensive mathematical description in the supplementary material in the paper.

This is not the only way that accuracy was assessed. There is an entire section of the paper dedicated to “Robustness and benchmarking analysis” that is worth looking at if you are a scientist considering using this approach in your own work.

The Century of Biology

Discussion about this post