Background: Bioinformatics pipelines for variant calling have undergone significant advancements due to the decreasing costs of next-generation sequencing. Accurate mutation detection is crucial for personalised medicine in cancer, particularly in assignment of therapy. Somatic variant calling, however, remains challenging due to diverse cancer types, heterogeneity, complex mutational profiles, and unpredictable sequencing errors. A dataset of fully characterised tumoral genomes and sequencing reads, large enough to represent the variability inherent in different cancer types, is still lacking, even considering synthetic data. The lack of such datasets hampers rigorous evaluation, benchmarking and optimization of variant callers for specific cancer types. Results: The contribution of this work is twofold. First, we conducted a comprehensive analysis of nine somatic sample simulators (Synggen, BAMSurgeon, SVEngine, VarSim, Xome-Blender, tHapMix, Pysim-sv, SCNVSim, HeteroGenesis) assessing their ability to control biological parameters, including variants characteristics (type, number, position, length, content, zygosity), and sample characteristics (clonality, contamination); and technical parameters, including reads characteristics (sequencing errors, coverage, base qualities). No single simulator provided complete control over both biological and technical parameters, nor guidance on tuning biological parameters for cancer-specific simulations. Consequently, we developed MOV&RSim, a novel simulator that leverages data-driven information to set variants and reads characteristics, producing realistic tumoral samples, and providing full control on biological and technical parameters. Additionally, we leveraged well-annotated variant databases to create cancer-specific presets that inform the simulator's parameters for 21 cancer types. Conclusion: This new simulator, containerised with Docker and freely available for academic use, empowers users to define each biological parameter of a tumoral genome and faithfully replicates the variability of technical noise observed in real sequencing reads. The proposed simulator and presets represent the most adaptable and comprehensive framework currently available for generating tumor samples, enabling comprehensive benchmarking and, ultimately, the optimization of somatic variant callers across diverse cancer types.
MOV&RSim: computational modelling of cancer-specific variants and sequencing reads characteristics for realistic tumoral sample simulation
Baruzzo, Giacomo
;Hazizaj, Enidia;Di Camillo, Barbara
2025
Abstract
Background: Bioinformatics pipelines for variant calling have undergone significant advancements due to the decreasing costs of next-generation sequencing. Accurate mutation detection is crucial for personalised medicine in cancer, particularly in assignment of therapy. Somatic variant calling, however, remains challenging due to diverse cancer types, heterogeneity, complex mutational profiles, and unpredictable sequencing errors. A dataset of fully characterised tumoral genomes and sequencing reads, large enough to represent the variability inherent in different cancer types, is still lacking, even considering synthetic data. The lack of such datasets hampers rigorous evaluation, benchmarking and optimization of variant callers for specific cancer types. Results: The contribution of this work is twofold. First, we conducted a comprehensive analysis of nine somatic sample simulators (Synggen, BAMSurgeon, SVEngine, VarSim, Xome-Blender, tHapMix, Pysim-sv, SCNVSim, HeteroGenesis) assessing their ability to control biological parameters, including variants characteristics (type, number, position, length, content, zygosity), and sample characteristics (clonality, contamination); and technical parameters, including reads characteristics (sequencing errors, coverage, base qualities). No single simulator provided complete control over both biological and technical parameters, nor guidance on tuning biological parameters for cancer-specific simulations. Consequently, we developed MOV&RSim, a novel simulator that leverages data-driven information to set variants and reads characteristics, producing realistic tumoral samples, and providing full control on biological and technical parameters. Additionally, we leveraged well-annotated variant databases to create cancer-specific presets that inform the simulator's parameters for 21 cancer types. Conclusion: This new simulator, containerised with Docker and freely available for academic use, empowers users to define each biological parameter of a tumoral genome and faithfully replicates the variability of technical noise observed in real sequencing reads. The proposed simulator and presets represent the most adaptable and comprehensive framework currently available for generating tumor samples, enabling comprehensive benchmarking and, ultimately, the optimization of somatic variant callers across diverse cancer types.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




