Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Author "Ma, Jun"

Sort by: Order: Results:

  • Ma, Jun (2021)
    Sequence alignment by exact or approximate string matching is one of the fundamental problems in bioinformatics. As the volume of sequenced genomes grows rapidly, pairwise sequence alignment becomes inefficient for pan-genomic analyses involving multiple sequences. The graph representation of multiple genomes has been an increasingly useful tool in pan-genomics research. Therefore, sequence-to-graph alignment becomes an important and challenging problem. For pairwise approximate sequence alignment under Levenshtein (edit) distance, subquadratic algorithms for finding an optimal solution are unknown. As a result, aligning sequences of millions of characters optimally is too challenging and impractical. Thus, many heuristics and techniques are developed for possibly suboptimal alignments. Among them, co-linear chaining (CLC) is a powerful and popular technique that approximates the alignment by finding a chain of short aligned fragments that may come from exact matching. The optimal solution to CLC on sequences can be found efficiently in subquadratic time. For sequence-to-graph alignment, the CLC problem has been solved theoretically on a special class of graphs that are narrow and have no cycles, i.e. directed acyclic graphs (DAGs) with small width, by Mäkinen et al. (ACM Transactions on Algorithms, 2019). Pan-genome graphs such as variation graphs satisfy these restrictions but allowing cycles may enable more general applications of the algorithm. In this thesis, we introduce an efficient algorithm to solve the CLC problem on general graphs with small width that may have cycles, by reducing it to a slightly modified CLC problem on DAGs. We implemented an initial version of the new algorithm on DAGs as a sequence-to-graph aligner GraphChainer. The aligner is evaluated and compared to an existing state-of-the-art aligner GraphAligner (Genome Biology, 2020) in experiments using both simulated and real genome assembly data on variation graphs. Our method improves the quality of alignments significantly in the task of aligning real human PacBio data. GraphChainer is freely available as an open source tool at https://github.com/algbio/GraphChainer.