Electropherograms (EPGs) are produced in forensic laboratories when DNA traces are collected and processed. EPGs produced under good conditions with a single DNA contributor are relatively easy to interpret. With more contributors, EPGs are difficult to interpret fully correctly
...
Electropherograms (EPGs) are produced in forensic laboratories when DNA traces are collected and processed. EPGs produced under good conditions with a single DNA contributor are relatively easy to interpret. With more contributors, EPGs are difficult to interpret fully correctly due to overlapping peaks, variability across loci and runs, proliferation of stutters & other artefacts. These challenges are amplified in DNA mixtures with low-template or imbalanced contributors, where noise-signal ratio is high. Here, traditional rule-based and statistical approaches alone fail, and are dependent on human analysts for the final calls. Recent deep learning methods show promise but remain limited by scarce verified training data, ambiguous ground-truth labels, and poor generalization across kits and genotypes.
This thesis investigates how data augmentation and preprocessing can improve the robustness of deep learning models for allele detection from EPGs. First, a conditional generative adversarial network (cGAN) is employed to generate synthetic yet realistic electropherograms, extending training diversity beyond the limited set of verified contributors. Second, preprocessing techniques are introduced standardize EPG data. The key pre-processing technique, Scan Point Standardization (SPS), aligns signals across runs using the internal lane standard, complemented by dye-wise fluorescence (RFU) standardization and dynamic-range compression (DRC) to mitigate intensity imbalance and baseline noise. Third, a genotype-aware data-splitting protocol is developed using spectral clustering to prevent genotype leakage between train and test sets, ensuring fair generalization assessment. Lastly, the performance of models developed in this research are applied to NFI's R&D data, which uses a different EPG kit in order to investigate the cross-kit compatibility of DL models.
Experimental results show that introducing synthetic data alone did not improve performance on our dataset. On the other hand, of the preprocessing techniques, none of standardization, DRC SPS improved performance for pixel metrics. Genotype-aware splitting reveals that conventional random file-level splits significantly overestimate model performance by allowing for train-test leakage. Once SPS and genotype-aware leakage-free splitting were incorporated, synthetic data then results in significant performance improvements. Collectively, these findings demonstrate that preprocessing, realistic synthetic augmentation, and leakage-free evaluation are essential for achieving robust, generalizable allele calling in forensic DNA profiling. Models trained on purely synthetic data as well synthetic & real data using the GlobalFiler kit were then trained and tested on NFI's R&D data using the PPF6C kit, showing that transfer learning is possible between different PCR kits.