TotalVI: A transformative algorithm
As part of an effort to build a Human Cell Atlas (HCA), scientists around the world have turned their focus to mapping human organs and tissues at the single cell level. When completed — a goal that could be decades away — the HCA will provide a massive database of detailed information about billions of cells, ultimately affecting nearly every area of biology and medicine.
Like the Human Genome Project, an ambitious collaboration to sequence all human genes that was completed more than two decades ago, the HCA presents an extensive bioinformatics problem. Once all of the data — including millions of cells and thousands of genes and proteins — is collected, the challenge will be how to manage, analyze and distribute this information into a single organizational system.
To address this problem, UC Berkeley researchers in the labs of Aaron Streets, assistant professor of bioengineering, and Nir Yosef, associate professor of electrical engineering and computer sciences, have invented a computer algorithm called totalVI. Their approach — described in a study published in Nature Methods on Feb. 15 — uses deep learning to integrate gene and protein data about single cells that were gathered from different tissues and donors and that were processed in different labs.
Many researchers studying single cells use a technique known as CITE-seq to sequence RNA as well as to gather information on surface proteins, both of which relay very specific information about the cell. But when it comes to analysis, there are technical problems with relying exclusively on one or the other source, and scientists have struggled to use both sources when trying to analyze a cell. TotalVI allows researchers to pair both sets of information while taking into account uncertainties in the measurement process, revealing new insights about the fundamentals of cells and other biological discoveries.
“The combination of CITE-seq and totalVI allows us to estimate, from the same cell, not only its gene expression but also the expression of the cell membrane proteins,” said Yosef. “Those tell us a lot about the biology of the cells, since working with these proteins is kind of the standard in immunology.”
The algorithm is transformative for projects like the HCA, with much of the data being provided by CITE-seq and similar techniques. Because separate experiments from such a broad coalition of researchers use different antibody panels, previous algorithms couldn’t accommodate for all of these variations, which meant that some or all of the valuable information about the proteins would be lost. TotalVI will make collaborative efforts like the HCA not only possible, but also more accurate and efficient.
“It’s able to take data and ‘learn’ the statistical properties so that we can extract the interesting parts, like the differences between diseased and healthy cells, and ignore the uninteresting parts, like experimental noise,” said Streets.
In their study, the researchers conducted a series of CITE-seq experiments on spleen and lymph node cells. Then, using this data as well as that from several public databases, they showed that totalVI could conduct many single computational tasks as well as or better than other state-of-the-art methods — and could also scale to much larger dataset sizes.
“The new algorithm enables us to integrate single-cell datasets from labs around the world and will aid the progression of global knowledge bases like the Human Cell Atlas,” said Ph.D. student Adam Gayoso, who was co-first author of the study, along with Ph.D. student Zoë Steier. “Ultimately, totalVI will help us refine our understanding of cellular identity.”