Since the revolution of single-cell RNA-sequencing, the number of available datasets has increased enormously. In these datasets, cell identification is mainly done manually, which is subjective and time-consuming. As a consequence, most datasets are annotated at a different reso
...
Since the revolution of single-cell RNA-sequencing, the number of available datasets has increased enormously. In these datasets, cell identification is mainly done manually, which is subjective and time-consuming. As a consequence, most datasets are annotated at a different resolution. This is not surprising as cell types form a hierarchy, but it can be problematic for downstream analysis or comparison of datasets. Several supervised methods have already been developed to overcome the drawbacks of unsupervised learning. None of these, however, combines the information found in multiple datasets and preserves the definition of cell populations in each dataset, while this consistency is necessary for downstream analysis. Furthermore, a supervised classifier should be able to detect new cell populations in an unlabeled dataset. Here, we introduce a hierarchical progressive learning pipeline with a one-class classifier to face these challenges. Using this pipeline, it is possible to construct a hierarchical classification tree by combining the information of multiple datasets. If datasets are annotated at a different resolution, their cell populations will be at different levels in the tree and all definitions are thus preserved. By using a one-class classifier for each cell population it is also possible to have a correctly working rejection option and discover new cell populations. In this paper, we show that it is possible to construct a classification tree for simulated data and immune cells. When comparing the pipeline with a one-class to a linear classifier, we show that a one-class classifier can indeed improve the rejection option. Using a linear classifier, on the other hand, results in a higher accuracy. Choosing between a one-class and a linear classifier is a trade-off between the ability of discovering new cell populations and a higher performance.