A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator
dc.contributor.author | Doherty, T | |
dc.contributor.author | Dempster, E | |
dc.contributor.author | Hannon, E | |
dc.contributor.author | Mill, J | |
dc.contributor.author | Poulton, R | |
dc.contributor.author | Corcoran, D | |
dc.contributor.author | Sugden, K | |
dc.contributor.author | Williams, B | |
dc.contributor.author | Caspi, A | |
dc.contributor.author | Moffitt, TE | |
dc.contributor.author | Delany, SJ | |
dc.contributor.author | Murphy, TM | |
dc.date.accessioned | 2023-07-07T13:04:09Z | |
dc.date.issued | 2023-05-01 | |
dc.date.updated | 2023-07-07T11:43:50Z | |
dc.description.abstract | BACKGROUND: The field of epigenomics holds great promise in understanding and treating disease with advances in machine learning (ML) and artificial intelligence being vitally important in this pursuit. Increasingly, research now utilises DNA methylation measures at cytosine-guanine dinucleotides (CpG) to detect disease and estimate biological traits such as aging. Given the challenge of high dimensionality of DNA methylation data, feature-selection techniques are commonly employed to reduce dimensionality and identify the most important subset of features. In this study, our aim was to test and compare a range of feature-selection methods and ML algorithms in the development of a novel DNA methylation-based telomere length (TL) estimator. We utilised both nested cross-validation and two independent test sets for the comparisons. RESULTS: We found that principal component analysis in advance of elastic net regression led to the overall best performing estimator when evaluated using a nested cross-validation analysis and two independent test cohorts. This approach achieved a correlation between estimated and actual TL of 0.295 (83.4% CI [0.201, 0.384]) on the EXTEND test data set. Contrastingly, the baseline model of elastic net regression with no prior feature reduction stage performed less well in general-suggesting a prior feature-selection stage may have important utility. A previously developed TL estimator, DNAmTL, achieved a correlation of 0.216 (83.4% CI [0.118, 0.310]) on the EXTEND data. Additionally, we observed that different DNA methylation-based TL estimators, which have few common CpGs, are associated with many of the same biological entities. CONCLUSIONS: The variance in performance across tested approaches shows that estimators are sensitive to data set heterogeneity and the development of an optimal DNA methylation-based estimator should benefit from the robust methodological approach used in this study. Moreover, our methodology which utilises a range of feature-selection approaches and ML algorithms could be applied to other biological markers and disease phenotypes, to examine their relationship with DNA methylation and predictive value. | en_GB |
dc.description.sponsorship | Science Foundation Ireland | en_GB |
dc.description.sponsorship | Brain and Behaviour Research Foundation (BBF) | en_GB |
dc.description.sponsorship | National Institute for Health and Care Research (NIHR) | en_GB |
dc.description.sponsorship | New Zealand Health Research Council | en_GB |
dc.description.sponsorship | New Zealand Ministry of Business, Innovation and Employment | en_GB |
dc.description.sponsorship | National Institutes of Health National Institute of Aging | en_GB |
dc.description.sponsorship | Medical Research Council (MRC) | en_GB |
dc.description.sponsorship | Jacobs Foundation | en_GB |
dc.identifier.citation | Vol. 24(1), article 178 | en_GB |
dc.identifier.doi | https://doi.org/10.1186/s12859-023-05282-4 | |
dc.identifier.grantnumber | 18/CRT/6183 | en_GB |
dc.identifier.grantnumber | R01AG032282 | en_GB |
dc.identifier.grantnumber | MR/P005918/1 | en_GB |
dc.identifier.uri | http://hdl.handle.net/10871/133566 | |
dc.identifier | ORCID: 0000-0001-6840-072X (Hannon, Eilis) | |
dc.language.iso | en | en_GB |
dc.publisher | BMC | en_GB |
dc.relation.url | https://www.ncbi.nlm.nih.gov/pubmed/37127563 | en_GB |
dc.relation.url | https://github.com/trevordoherty/DNA-methylation-based-Telomere-Length-estimator | en_GB |
dc.relation.url | https://moffittcaspi.trinity.duke.edu/research-topics/dunedin | en_GB |
dc.rights | © The Author(s) 2023, corrected publication 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data | en_GB |
dc.subject | Aging | en_GB |
dc.subject | DNA Methylation | en_GB |
dc.subject | Feature Reduction | en_GB |
dc.subject | Feature Selection | en_GB |
dc.subject | Machine Learning | en_GB |
dc.subject | Telomere Length | en_GB |
dc.title | A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator | en_GB |
dc.type | Article | en_GB |
dc.date.available | 2023-07-07T13:04:09Z | |
dc.identifier.issn | 1471-2105 | |
exeter.article-number | 178 | |
exeter.place-of-publication | England | |
dc.description | This is the final version. Available on open access from BMC via the DOI in this record | en_GB |
dc.description | Availability of data and materials: Source code and scripts are available in the GitHub repository https://github.com/trevordoherty/DNA-methylation-based-Telomere-Length-estimator. The Dunedin Study datasets reported in the current article are not publicly available due to a lack of informed consent and ethical approval for public data sharing. The Dunedin study datasets are available on request by qualified scientists. Requests require a concept paper describing the purpose of data access, ethical approval at the applicant’s university and provision for secure data access (https://moffittcaspi.trinity.duke.edu/research-topics/dunedin). We offer secure access on the Duke, Otago and King’s College campuses. For the TWIN study, data is freely available in the supplemental files of the previously published article [51]. The EXTEND study data is deposited in the Gene Expression Omnibus (GEO) database (accession number: GSE113725). For further information on data availability, please contact the corresponding author. | en_GB |
dc.identifier.eissn | 1471-2105 | |
dc.identifier.journal | BMC Bioinformatics | en_GB |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | en_GB |
dcterms.dateAccepted | 2023-04-11 | |
rioxxterms.version | VoR | en_GB |
rioxxterms.licenseref.startdate | 2023-05-01 | |
rioxxterms.type | Journal Article/Review | en_GB |
refterms.dateFCD | 2023-07-07T12:59:39Z | |
refterms.versionFCD | VoR | |
refterms.dateFOA | 2023-07-07T13:04:10Z | |
refterms.panel | A | en_GB |
refterms.dateFirstOnline | 2023-05-01 |
Files in this item
This item appears in the following Collection(s)
Except where otherwise noted, this item's licence is described as © The Author(s) 2023, corrected publication 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in
a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of
this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data