Comparison of Statistical Methods for Brain Age Prediction Using Neuroimaging Data

Pinamonti, M.; Sammassimo, V.; Moretto, M.; Veronese, M.

doi:10.1109/EMBC58623.2025.11254883

The growing global aging population underscores the need for reliable biomarkers of brain aging to inform early interventions for age-related diseases. Brain age estimation has emerged as a promising biomarker for assessing brain health, utilizing machine learning models trained on neuroimaging data. This study evaluates the performance of multiple machine learning models both kernel-based (i.e. Support Vector Machines, Relevance Vector Machines, Gaussian Process Regression) and ensemble-based (i.e. Random Forest, Extreme Gradient Boosting) for brain age prediction using anatomical features derived from T1-weighted Magnetic Resonance Imaging (MRI) scans. A total of 25 models (including ensemble and kernel-based models, with linear and non-linear kernels) were trained on the Cam-CAN dataset (n=627) using a robust cross-validation scheme and evaluated on the HCP-Aging dataset (n=607) for generalization. Results indicate that non-linear models, particularly Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel, outperformed linear models, achieving a mean absolute error (MAE) of 5.89 years and an explained variance on unseen data (Prediction R2) of 0.84. Validation on the external HCP-Aging dataset revealed that Extreme Gradient Boosting (XGB) performed best on non-harmonized data, achieving a MAE of 7.45 years and a Prediction R2 of 0.64. However, after using the ComBat pipeline to harmonize data across sites, the SVM with an RBF kernel achieved the highest accuracy, with a MAE of 7.05 years and a Prediction R2 of 0.63. These findings highlight the robustness of XGB to inter-dataset variability and the critical role of data harmonization for kernel-based models, like SVM. This study demonstrates the effectiveness of combining non-linear models and data harmonization techniques to improve the accuracy and generalizability of brain age prediction tools, enabling more reliable assessments of neurological health across heterogeneous datasets.