Constrained Codes for DNA-Based Storage Systems

Bachelor Thesis (2020)
Author(s)

C.J. van Leeuwen (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.H. Weber – Mentor (TU Delft - Discrete Mathematics and Optimization)

JAM de Groot – Graduation committee member (TU Delft - Mathematical Physics)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2020 Lot van Leeuwen
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 Lot van Leeuwen
Graduation Date
27-05-2020
Awarding Institution
Delft University of Technology
Programme
['Applied Mathematics']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Every day we produce an extremely high amount of data and a significant portion of this data is archival data. It is an important challenge to save this data in a cheap and environmentally friendly way. The current methods to save archival data are sufficient, but improvements can be made. Synthetic DNA based data storage is a great choice to do so. DNA is (roughly) made out of four nucleotides, Adine (A), Cytosine (C), Guanine (G) and Thymine (T). To save data using DNA the binary data is encoded to quaternary data and later DNA strands are created using this quaternary data. During the process of saving the data errors can occur. Previous research [3] has found that some errors can be prevented by taking two constraints into account.The no run-length constraint, which states that no DNA word can have two repeated symbolsand GC-weight constraint, which states that every DNA word must have a fixed number of G and C nucleotides. These constraints reduce the number of quaternary data sequences that can be used to save data. The aim of this thesis is to improve the lower bound on the maximum number of quaternary data sequences that satisfy the no run-length constraint, the fixed GC-weight and a minimum distance. Limbachiya et. al. [1] gave an algorithm to compute a set with a given minimum distance of quaternary data sequences that satisfies the constraints. The size of this set is the lower bound that we aim to improve. We do so by introducing two other algorithms that also compute a quaternary data sequences that satisfy the constraints and a minimum distance. The size of each computed set gives a lower bound on the maximum size. The lower bound computed with these two algorithms is always better or equal to the lower bound computed by Limbachiya et. al. for certain parameters. When the minimum distance of a code is 2 we know this maximum size. The formula for this maximum size is given in this thesis together with a proof for this formula.

Files

License info not available