Key Fragmentomics Features for Cancer Detection

An Analytical Approach to Identifying Essential Characteristics for Cancer Detection and Classification Using DNA Fragments from Blood Samples

Bachelor Thesis (2024)
Author(s)

D. Peţa (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

S. Makrodimitris – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

I.B. Pronk – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Daan Hazelaar – Mentor (Erasmus MC)

Marcel J. T. Reinders – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

JA Pouwelse – Graduation committee member (TU Delft - Data-Intensive Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
28-06-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Cancer represents a huge challenge in the medical world, necessitating early detection methods to improve treatment outcomes. The field of fragmentomics emerged as a promising option towards developing efficient non-invasive cancer diagnosis tools. By analysing the differences between the cfDNA fragments from blood samples of healthy patients and patients with cancer, this study aims to determine the most important fragmentomics features for cancer detection. The methods present in this work involve extracting features from the cfDNA fragments available in the experimental dataset, applying a pipeline of feature selection techniques that removes the redundant features, training and evaluating a logistic regression and random forest classifiers to differentiate between healthy and diseased samples, and finally extracting the feature weights from the trained models to understand which features contributed the most to the classification task. Filter-based variance thresholding and Correlation-based Feature Selection (CFS) were employed to refine the dataset. Independent t-test and the Mann-Whitney U test are used to calculate the relationship between the cancer and healthy samples. The Pearson correlation coefficient calculates the correlation between each pair of features. The classification performance of the two proposed models is assessed using the train/test split and the nested cross-validation techniques. The evaluation reveals that logistic regression constantly outperforms the random forest and that removing the redundant features increases the performance of both classifiers. Certain genomic bins, mostly on chromosomes 1, 7 and 8, contain significant features for the classification task. These findings suggest that understanding the importance of the fragmentomics features can lead to improved diagnostic tools such as cancer detection based on blood tests.

Files

License info not available