Diversity and generalisation error in classification ensembles
Ivascu, C
Date: 22 April 2024
Thesis or dissertation
Publisher
University of Exeter
Degree Title
PhD in Computer Science
Abstract
Ensembles are important tools in machine learning because they are often more accurate
than single predictors. Although it has been shown that an accurate ensemble would
benefit from having both accurate and diverse predictors, some studies in the literature
could not support the influence that diversity has on the overall accuracy ...
Ensembles are important tools in machine learning because they are often more accurate
than single predictors. Although it has been shown that an accurate ensemble would
benefit from having both accurate and diverse predictors, some studies in the literature
could not support the influence that diversity has on the overall accuracy of an ensemble.
In this thesis we are investigating the influence that diversity has on improving accuracy
or equivalently reducing the generalisation error.
There have been many diversity measures introduced in the literature, however as outlined
in [1] the only one that had a strong negative correlation with generalisation error, was
a diversity measure called ambiguity. The ambiguity measure was obtained by using the
bias-variance decomposition of classifiers along with the 0-1 loss. As a result, our first
set of experiments focuses on this type of diversity measure. We analyse the effect that
the ambiguity measure has on decreasing the generalisation error of forests created by
bootstrapping. We compare the effect of the ambiguity by having bootstrapping with or
without replacement, by varying the number of trees, by varying the patterns or features
used in building each tree. Our results show that bootstrapping without replacement
yields lower test errors. A similar effect has been seen on bigger ensembles or by providing
more data to the classifiers. We propose pruning approaches that involve ambiguity and
compare their effect on the generalisation error versus a pruning method that promotes
randomness. Our results show that there is no significant difference between the two types
of approaches.
Next, we define two new ambiguity measures derived from the cross entropy and hinge
loss. We analyse their properties and find that out of the three ambiguity measures defined
for classifiers (including the 0-1 loss introduced earlier), the only one that achieves all the
desired properties of a diversity measure is the one obtained from the cross entropy (being
always positive, and zero if and only if all the classifiers agree). We build ensembles
by using bagging and by varying the sampling rates, we find that there is a negative
correlation between generalisation error and diversity at high sampling rates; conversely
generalisation error is positively correlated with diversity when the sampling rate is low
and the diversity high. We use an evolutionary algorithm in order to maximise ambiguity
and we find that the evolved ensemble in general has lower generalisation error than the
initial ensemble. We define the term “ambiguous ensembles” as ensembles with high values
of ambiguity. Additionally, we investigate the effect of pruning on larger ensembles and
propose several pruning methods that prioritize ambiguity, as well as others that promote
less ambiguous ensembles. Our results show that the approaches the prefer ambiguous
ensembles reduce the generalisation error. Hence, our overall results support the influence
that the diversity has on minimising generalisation error.
Finally, we define diverse forests by building trees with different impurities. We choose
families of impurities which are characterized by different parameters and we analyse
the effect of choosing different parameters has on the generalisation performance. By
tuning the parameters we can define symmetric or asymmetric impurities. In the case
of imbalanced datasets the use of asymmetric impurities has been proven beneficial in
predicting the minority class which usually is of big interest. We contrast the behaviour
of the forests by using symmetric, asymmetric impurities with forests of trees built with
different impurities (different parameters). Our results do not show a significant difference
in performance.
Doctoral Theses
Doctoral College
Item views 0
Full item downloads 0