Item Details

FastGSK: Fast and Efficient Sequence Analysis Using Gapped String Kernels

Blakely, Derrick
Format
Thesis/Dissertation; Online
Author
Blakely, Derrick
Advisor
Blakely, Derrick
Abstract
String kernel methods achieve strong classification performance on DNA, protein, and natural language data using modestly-sized training sets. However, existing kernel function algorithms suffer from slow kernel computation times, as they depend exponentially on the sub-sequence feature length and the task's alphabet size. In this work, we introduce a new string kernel algorithm using gapped k-mers called FastGSK. Compared to previous string kernels, it uses a simplified kernel algorithm that decomposes into a set of independent counting operations over the possible mismatch positions. This algorithm is naturally parallelizable, is performant on any sequence analysis task, and allows us to devise a fast Monte Carlo approximation method to scale to much greater feature lengths. We evaluate FastGSK on 10 DNA transcription factor binding site (TFBS) prediction tasks, 10 protein remote homology detection tasks, and 7 English-language medical named entity recognition tasks. FastGSK consistently matches or outperforms the state-of-the-art string kernel algorithms in AUC, while achieving average speedups in kernel computation of ~100x and speedups of ~800x for large feature lengths. We further show that FastGSK consistently outperforms character-level recurrent neural networks across these sequence analysis tasks. Our algorithm is available as a Python package and as C++ source code.
Language
English
Published
University of Virginia, Computer Science - School of Engineering and Applied Science, MS (Master of Science), 2019
Published Date
2019-12-08
Degree
MS (Master of Science)
Collection
Libra ETD Repository
Related Resources
https://github.com/QData/FastSK
Logo for Creative Commons Attribution LicenseCreative Commons Attribution License

Availability

Read Online