Evaluation of Different Machine Learning Approaches to Predict Antigenic Distance Among Newcastle Disease Virus (NDV) Strains

Franzo, Giovanni; Fusaro, Alice; Snoeck, Chantal J.; Dodovski, Aleksandar; Van Borm, Steven; Steensels, Mieke; Christodoulou, Vasiliki; Onita, Iuliana; Burlacu, Raluca; Sánchez, Azucena Sánchez; Chvala, Ilya A.; Torchetti, Mia Kim; Shittu, Ismaila; Olabode, Mayowa; Pastori, Ambra; Schivo, Alessia; Salomoni, Angela; Maniero, Silvia; Zambon, Ilaria; Bonfante, Francesco; Monne, Isabella; Cecchinato, Mattia; Bortolami, Alessio

doi:10.3390/v17040567

Newcastle disease virus (NDV) continues to present a significant challenge for vaccination due to its rapid evolution and the emergence of new variants. Although molecular and sequence data are now quickly and inexpensively produced, genetic distance rarely serves as a good proxy for cross-protection, while experimental studies to assess antigenic differences are time consuming and resource intensive. In response to these challenges, this study explores and compares several machine learning (ML) methods to predict the antigenic distance between NDV strains as determined by hemagglutination-inhibition (HI) assays. By analyzing F and HN gene sequences alongside corresponding amino acid features, we developed predictive models aimed at estimating antigenic distances. Among the models evaluated, the random forest (RF) approach outperformed traditional linear models, achieving a predictive accuracy with an R2 value of 0.723 compared to only 0.051 for linear models based on genetic distance alone. This significant improvement demonstrates the usefulness of applying flexible ML approaches as a rapid and reliable tool for vaccine selection, minimizing the need for labor-intensive experimental trials. Moreover, the flexibility of this ML framework holds promise for application to other infectious diseases in both animals and humans, particularly in scenarios where rapid response and ethical constraints limit conventional experimental approaches.