Representations of DNA Sequence Context and Mutational Spectra for Prediction of Repair Deficiencies

More Info
expand_more

Abstract

Double-strand break (DSB) repair is a critical cellular process which repairs breaks in both strands of the DNA double helix. Different repair mechanisms are tasked with repairing such breaks. Predicting deficiencies in repair mechanisms has been widely used for therapeutic purposes, such as targeting cancer cells that have specific DNA repair deficiencies. DSB repair, however, is not error-free, resulting in mutations. These mutations are also influenced by the DNA sequence surrounding the break site. To the best of our knowledge, sequence representations have not been considered when predicting DNA repair deficiencies. We hypothesise that higher-order information can be extracted from sequence representations. In this study, we research the problem of predicting Non-Homologous End Joining (NHEJ) repair deficiencies. Initially, we evaluate how accurately we can predict NHEJ repair deficiency using only the mutational outcome frequencies (mutational spectra). Afterwards, we examine how combining mutational spectra with representations of the sequence surrounding the break site can improve the prediction of NHEJ repair deficiency. We demonstrate that adding DNABERT sequence representations to mutational spectra features significantly improves prediction accuracy from 94.44% to 96.12%. We also show that even simple sequence representations, such as 1-mer frequencies, can lead to significant improvements. Our findings highlight the importance of including sequence representations with mutational spectra in repair deficiency prediction.