Machine Learning for Classification and Clustering of Dementia Data

Guest, F

Abstract

Dementia is a term used to describe heterogeneous diseases that can generally be characterised by a decline in cognitive ability that affects daily living. It has been predicted that the prevalence of dementia will increase significantly over the coming years, thus it is a priority worldwide. This thesis discusses research conducted ...

Dementia is a term used to describe heterogeneous diseases that can generally be characterised by a decline in cognitive ability that affects daily living. It has been predicted that the prevalence of dementia will increase significantly over the coming years, thus it is a priority worldwide. This thesis discusses research conducted with two primary aims. They were to investigate the use of machine learning for distinguishing between people with and without dementia, as well as differentiating between key dementia subtypes where appropriate; and to gain an understanding of the inherent structure of dementia data, to ultimately investigate disease signatures. Data was acquired from the National Alzheimer's Coordinating Center in the United States, and a data set comprising 32,573 observations and 260 features of mixed type was utilised. It included features whose values were constrained by relations with others, as well as two types of missingness which arose when data was unexpectedly not recorded and when the information was irrelevant or unobtainable for a known reason, respectively. Notably, the former genuinely missing values were imputed where possible, whilst the latter conditionally missing values were handled. An imputation approach was developed, which simultaneously builds a random forest classifier while handling conditionally missing values. It maintained the known relations in the data set, so far as possible. A clustering approach was also developed that ultimately measures the similarity of observations based on the similarity of their paths through the trees of an isolation forest before employing spectral clustering. Crucially, it can naturally draw on variables of mixed type. A dementia classifier with an area under the receiver operating characteristic curve (AUC) of 0.99 and 10 pairwise dementia subtype classifiers with AUCs ranging from 0.88 to 1.0 (rounded) were produced, suggesting machine learning could be a useful tool for diagnosing dementia and differentiating between the main subtypes. Key features were identified using these classifiers and were markedly different for the two types of diagnosis. Furthermore, preliminary experiments conducted using the clustering approach suggested that mild cognitive impairment may be a mild form of dementia as opposed to a clinical entity, over which there is much debate; and there could be evidence for the current subtypes. Ultimately, these findings have the potential to transform the way dementia is diagnosed.

Machine Learning for Classification and Clustering of Dementia Data

Doctoral Theses

Doctoral College