A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing

Journal Article (2021)
Author(s)

Brian Aevermann (J. Craig Venter Institute)

Yun Zhang (J. Craig Venter Institute)

Mark Novotny (J. Craig Venter Institute)

Mohamed Keshk (J. Craig Venter Institute)

Trygve Bakken (Allen Institute)

Jeremy Miller (Allen Institute)

Rebecca Hodge (Allen Institute)

Boudewijn Lelieveldt (TU Delft - Electrical Engineering, Mathematics and Computer Science, Leiden University Medical Center)

Ed Lein (Allen Institute)

Richard H. Scheuermann (La Jolla Institute for Immunology, University of California, J. Craig Venter Institute)

Research Group
Pattern Recognition and Bioinformatics
DOI related publication
https://doi.org/10.1101/gr.275569.121 Final published version
More Info
expand_more
Publication Year
2021
Language
English
Research Group
Pattern Recognition and Bioinformatics
Issue number
10
Volume number
31
Pages (from-to)
1767-1780
Downloads counter
519
Collections
Institutional Repository
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Single-cell genomics is rapidly advancing our knowledge of the diversity of cell phenotypes, including both cell types and cell states. Driven by single-cell/-nucleus RNA sequencing (scRNA-seq), comprehensive cell atlas projects characterizing a wide range of organisms and tissues are currently underway. As a result, it is critical that the transcriptional phenotypes discovered are defined and disseminated in a consistent and concise manner. Molecular biomarkers have historically played an important role in biological research, from defining immune cell types by surface protein expression to defining diseases by their molecular drivers. Here, we describe a machine learning-based marker gene selection algorithm, NS-Forest version 2.0, which leverages the nonlinear attributes of random forest feature selection and a binary expression scoring approach to discover the minimal marker gene expression combinations that optimally capture the cell type identity represented in complete scRNA-seq transcriptional profiles. The marker genes selected provide an expression barcode that serves as both a useful tool for downstream biological investigation and the necessary and sufficient characteristics for semantic cell type definition. The use of NS-Forest to identify marker genes for human brain middle temporal gyrus cell types reveals the importance of cell signaling and noncoding RNAs in neuronal cell type identity.

Files

License info not available