A core task in computational genomics is transforming input sequences into their constituent k-mers. Efficiently storing these k-mer collections is crucial for scaling bioinformatics workflows. A common strategy involves representing the k-mers as a de Bruijn graph (dBG) and deriving a compact plain text form through a minimum path cover. In this article, we introduce USTAR-CR (Unitig STitch Advanced constRuction with Colors Reordering), a fast and space-efficient algorithm for compressing multiple k-mer sets. USTAR-CR exploits the structural properties of colored dBGs to construct a succinct plain text representation while also incorporating an effective scheme for encoding k-mer color information. We evaluate USTAR-CR on real sequencing datasets and benchmark it against the state-of-the-art tool GGCAT. USTAR-CR achieves superior compression ratios, significantly reduces memory usage, and offers substantial speed improvements-up to 64x faster-highlighting its effectiveness for large-scale genomic data processing.
USTAR-CR: Efficient and Compact Compression of k-Mer Sets Through Colored de Bruijn Graphs
Rossignolo, Enrico;Comin, Matteo
2026
Abstract
A core task in computational genomics is transforming input sequences into their constituent k-mers. Efficiently storing these k-mer collections is crucial for scaling bioinformatics workflows. A common strategy involves representing the k-mers as a de Bruijn graph (dBG) and deriving a compact plain text form through a minimum path cover. In this article, we introduce USTAR-CR (Unitig STitch Advanced constRuction with Colors Reordering), a fast and space-efficient algorithm for compressing multiple k-mer sets. USTAR-CR exploits the structural properties of colored dBGs to construct a succinct plain text representation while also incorporating an effective scheme for encoding k-mer color information. We evaluate USTAR-CR on real sequencing datasets and benchmark it against the state-of-the-art tool GGCAT. USTAR-CR achieves superior compression ratios, significantly reduces memory usage, and offers substantial speed improvements-up to 64x faster-highlighting its effectiveness for large-scale genomic data processing.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




