Omics data have revolutionized molecular biology by introducing large-scale data analysis, pushing the field into the realm of big data and presenting substantial challenges in data storage and analysis. Despite describing distinct aspects of molecular biology, most omics data share common characteristics, such as being representable as large, sparse matrices, and requiring similar computational approaches, mainly involving embarrassing parallel tasks across rows or columns. While R is a popular choice for omics analysis, it encounters performance bottlenecks when handling large datasets due to its reliance on dense data formats and constraints like 32-bit indexing in some structures. Even when sparse representations are utilized, the inherent limitations of R lead to inefficiencies. Additionally, its lack of native support for shared-memory parallelism prevents it from fully utilizing modern parallel computing architectures. Similarly, many other data-intensive fields that rely on R face similar challenges with large, sparse data requiring fast and memory-efficient row-wise and column-wise operations. To address these challenges, we introduce quickSparseM, a time- and memory-efficient library for storing and processing large, sparse matrices, available as an R package. Developed in C++ with OpenMP for parallelism, quickSparseM achieves efficient performance while remaining compatible with existing R-based workflows. The library utilizes the R dgCMatrix format to represent sparse matrices in a compressed, column-oriented format and provide functions to compute basic statistics and operations commonly used in omics analyses. Experiments varying dataset sizes and core counts, as well as two case studies using omics data, demonstrate the library’s efficiency and scalability. The results indicate that quickSparseM outperforms state-of-the-art R packages for sparse matrix computation in terms of time, memory usage, and scalability.
quickSparseM: a library for memory- and time-efficient computation on large, sparse matrices with application to omics data
Baruzzo, Giacomo
;Cesaro, Giulia;Camillo, Barbara Di
2025
Abstract
Omics data have revolutionized molecular biology by introducing large-scale data analysis, pushing the field into the realm of big data and presenting substantial challenges in data storage and analysis. Despite describing distinct aspects of molecular biology, most omics data share common characteristics, such as being representable as large, sparse matrices, and requiring similar computational approaches, mainly involving embarrassing parallel tasks across rows or columns. While R is a popular choice for omics analysis, it encounters performance bottlenecks when handling large datasets due to its reliance on dense data formats and constraints like 32-bit indexing in some structures. Even when sparse representations are utilized, the inherent limitations of R lead to inefficiencies. Additionally, its lack of native support for shared-memory parallelism prevents it from fully utilizing modern parallel computing architectures. Similarly, many other data-intensive fields that rely on R face similar challenges with large, sparse data requiring fast and memory-efficient row-wise and column-wise operations. To address these challenges, we introduce quickSparseM, a time- and memory-efficient library for storing and processing large, sparse matrices, available as an R package. Developed in C++ with OpenMP for parallelism, quickSparseM achieves efficient performance while remaining compatible with existing R-based workflows. The library utilizes the R dgCMatrix format to represent sparse matrices in a compressed, column-oriented format and provide functions to compute basic statistics and operations commonly used in omics analyses. Experiments varying dataset sizes and core counts, as well as two case studies using omics data, demonstrate the library’s efficiency and scalability. The results indicate that quickSparseM outperforms state-of-the-art R packages for sparse matrix computation in terms of time, memory usage, and scalability.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.