Maximal Independent Sets of a K-mer Space
Open Access
- Author:
- Ma, Leran
- Area of Honors:
- Computer Science
- Degree:
- Bachelor of Science
- Document Type:
- Thesis
- Thesis Supervisors:
- Mingfu Shao, Thesis Supervisor
Jesse Louis Barlow, Thesis Honors Advisor - Keywords:
- k-mer space
maximal independent set
MIS
k-mer clustering
k-mer hashing - Abstract:
- In computational biology, k-mer based algorithms are playing a central role. Various state-of-the-art tools involve k-mer based approaches in solving fundamental problems like sequence alignment, sequence assembly, gene finding, etc. Hence, conducting specific research on k-mers can inspire computational biologists to develop new and innovative ideas. In this thesis, the author aims to understand the structure of the k-mer space by modeling the space as graphs with k-mers as vertices and studying maximal independent sets (MIS) of the graphs. Because a MIS is a sparse and small subset of k-mers selected from the dense and large k-mer space, it can have potential application in hashing and clustering problems. This task is challenging since the size of a k-mer space growth geometrically with respect to 𝑘𝑘. For 15-mer space, there is more than 1 billion vertices in the graph. The author proposes three methods for finding maximal independent sets in a k-mer space. The first and most intuitive method involves pairwise comparison between k-mers. The second method implements two heuristics to avoid redundant pairwise comparison: One considers the nearest neighbors of a k-mer, and the other takes advantage of the triangle inequality property of the k-mer space. The last method extends the breadth-first search on a different graph. These three methods are used to find maximal independent sets of small, medium, and large sizes respectively. Experimental results are reported for k-mer spaces with k ranging from 10 to 15. The most memory- and time-consuming experiment is of a medium-sized maximal independent set of the 15-mer space. The peak memory usage is about 2 GB, and the running time is approximately 15 hours. Source code is available at https://github.com/Shao-Group/kmerspace.