Genre
- Honours
This study addresses the issue of the use of synthetic data by the DeLUCS pipeline, and investigates the replacement of artificially generated mimic sequences with biologically relevant data. This study introduces a novel pipeline, DeLUCS-F (Deep Learning of Unsupervised Clusters of DNA Sequences using Sequence Fragments), which builds upon the DeLUCS pipeline by increasing the trustworthiness of its predictions, and increasing its performance. This is accomplished through the use of sequence fragments, used in place of artificial data for the sake of data augmentation. DeLUCS is considered the state-of-the-art in the application of unsupervised deep learning to the problem of taxonomic assignment. The main advantages of such a model are that 1) it can be applied to a variety of genomic data, including DNA, mitochondrial DNA, DNA segments, and individual genes, and was tested on vertebrate, bacterial,and viral data; 2 ) it is alignment-free, and thus can overcome various obstacles faced by alignment-based solutions (namely, that alignment-based solutions are prohibitively slow); 3) it involves deep learning, which has the ability to learn more complex patterns than classical machine learning algorithms can learn from big datasets; and 4) that it is unsupervised, meaning that it can be used when labels are unavailable, or cannot be trusted, due to either the potential presence of errors, or the ever-changing nature of taxonomy. However, it relies on data augmentation to achieve its high performance (relative to other unsupervised methods), which is problematic as it is a "black box" model due to its use of both deep neural networks and a majority voting scheme, thus raising questions regarding whether the patterns it learns are from the true data, or the artificial data. This is especially concerning in the field of genomics, where it is still largely unknown how mutations can change the function of entire sequences, and thus artificial data that is biologically viable cannot be reliably generated. Thus, while DeLUCS achieves good results, its predictions are called in to question as a result of the fact that it could be learning from biologically irrelevant data. This study describes relevant information in the field of taxonomic assignment and the use of synthetic data, the contributions made by this novel pipeline, the comparison of the results of classical unsupervised methods, DeLUCS, and DeLUCS-F, and presents potential future research.
Language
- English
ETD Degree Name
- Bachelor of Science
ETD Degree Level
- Bachelor
ETD Degree Discipline
- Faculty of Science. Honours in Computer Science