Kernels for Protein Homology Detection
Spalding, John Dylan
Thesis or dissertation
University of Exeter
Determining protein sequence similarity is an important task for protein classification and homology detection, which is typically performed using sequence alignment algorithms. Fast and accurate alignment-free kernel based classifiers exist, that treat protein sequences as a “bag of words”. Kernels implicitly map the sequences to a high dimensional feature space, and can be thought of as an inner product between two vectors in that space. This allows an algorithm that can be expressed purely in terms of inner products to be ‘kernelised’, where the algorithm implicitly operates in the kernel’s feature space. A weighted string kernel, where the weighting is derived using probabilistic methods, is implemented using a binary data representation, and the results reported. Alternative forms of data representation, such as Ising and frequency forms, are implemented and the results discussed. These results are then used to inform the development of a variety of novel kernels for protein sequence comparison. Alternative forms of classifier are investigated, such as nearest neighbour, support vector machines, and multiple kernel learning. A kernelized Gaussian classifier is derived and tested, which is informative as it returns a score related to the probability of a sequence belonging to a particular classification. Support vector machines are tested with the introduced kernels, and the results compared to alternate classifiers. As similarity can be thought of as having different components, such as composition and position, multiple kernel learning is investigated with the novel kernels developed here. The results show that a support vector machine, using either single or multiple kernels, is the best classifier for remote protein homology detection out of all the classifiers tested in this thesis.
PhD in Computer Science