Learning from Data Streams with Randomized Forests
Seah, T
Date: 7 February 2022
Publisher
University of Exeter
Degree Title
PhD in Computer Science
Abstract
Non-stationary streaming data poses a familiar challenge in machine learning: the need to
obtain fast and accurate predictions. A data stream is a continuously generated sequence of
data, with data typically arriving rapidly. They are often characterised by a non-stationary
generative process, with concept drift occurring as the ...
Non-stationary streaming data poses a familiar challenge in machine learning: the need to
obtain fast and accurate predictions. A data stream is a continuously generated sequence of
data, with data typically arriving rapidly. They are often characterised by a non-stationary
generative process, with concept drift occurring as the process changes. Such processes are
commonly seen in the real world, such as in advertising, shopping trends, environmental
conditions, electricity monitoring and traffic monitoring.
Typical stationary algorithms are ill-suited for use with concept drifting data, thus necessitating
more targeted methods. Tree-based methods are a popular approach to this problem,
traditionally focussing on the use of the Hoeffding bound in order to guarantee performance
relative to a stationary scenario. However, there are limited single learners available for
regression scenarios, and those that do exist often struggle to choose between similarly
discriminative splits, leading to longer training times and worse performance. This limited
pool of single learners in turn hampers the performance of ensemble approaches in which
they act as base learners.
In this thesis we seek to remedy this gap in the literature, developing methods which
focus on increasing randomization to both improve predictive performance and reduce the
training times of tree-based ensemble methods. In particular, we have chosen to investigate
the use of randomization as it is known to be able to improve generalization error in
ensembles, and is also expected to lead to fast training times, thus being a natural method
of handling the problems typically experienced by single learners.
We begin in a regression scenario, introducing the Adaptive Trees for Streaming with
Extreme Randomization (ATSER) algorithm; a partially randomized approach based on
the concept of Extremely Randomized (extra) trees. The ATSER algorithm incrementally
trains trees, using the Hoeffding bound to select the best of a random selection of splits.
Simultaneously, the trees also detect and adapt to changes in the data stream. Unlike many
traditional streaming algorithms ATSER trees can easily be extended to include nominal
features. We find that compared to other contemporary methods ensembles of ATSER
trees lead to improved predictive performance whilst also reducing run times.
We then demonstrate the Adaptive Categorisation Trees for Streaming with Extreme
Randomization (ACTSER) algorithm, an adaption of the ATSER algorithm to the more
traditional categorization scenario, again showing improved predictive performance and
reduced runtimes. The inclusion of nominal features is particularly novel in this setting
since typical categorization approaches struggle to handle them.
Finally we examine a completely randomized scenario, where an ensemble of trees is generated
prior to having access to the data stream, while also considering multivariate splits
in addition to the traditional axis-aligned approach. We find that through the combination
of a forgetting mechanism in linear models and dynamic weighting for ensemble members,
we are able to avoid explicitly testing for concept drift. This leads to fast ensembles
with strong predictive performance, whilst also requiring fewer parameters than other
contemporary methods.
For each of the proposed methods in this thesis, we demonstrate empirically that they are
effective over a variety of different non-stationary data streams, including on multiple
types of concept drift. Furthermore, in comparison to other contemporary data streaming
algorithms, we find the biggest improvements in performance are on noisy data streams.
Doctoral Theses
Doctoral College
Item views 0
Full item downloads 0