dc.description.abstract | Determining protein sequence similarity is an important task for protein classification
and homology detection, which is typically performed using sequence
alignment algorithms. Fast and accurate alignment-free kernel based classifiers
exist, that treat protein sequences as a “bag of words”. Kernels implicitly map
the sequences to a high dimensional feature space, and can be thought of as an
inner product between two vectors in that space. This allows an algorithm that
can be expressed purely in terms of inner products to be ‘kernelised’, where the
algorithm implicitly operates in the kernel’s feature space.
A weighted string kernel, where the weighting is derived using probabilistic methods,
is implemented using a binary data representation, and the results reported.
Alternative forms of data representation, such as Ising and frequency forms, are
implemented and the results discussed. These results are then used to inform the
development of a variety of novel kernels for protein sequence comparison.
Alternative forms of classifier are investigated, such as nearest neighbour, support
vector machines, and multiple kernel learning. A kernelized Gaussian classifier
is derived and tested, which is informative as it returns a score related to the
probability of a sequence belonging to a particular classification. Support vector
machines are tested with the introduced kernels, and the results compared to alternate
classifiers. As similarity can be thought of as having different components,
such as composition and position, multiple kernel learning is investigated with the
novel kernels developed here.
The results show that a support vector machine, using either single or multiple
kernels, is the best classifier for remote protein homology detection out of all the
classifiers tested in this thesis. | en_GB |