A core task in computational genomics is transforming input sequences into their constituent k-mers. Efficiently storing these k-mer collections is crucial for scaling bioinformatics workflows. A common strategy involves representing the k-mers as a de Bruijn graph (dBG) and deriving a compact plain text form through a minimum path cover. In this article, we introduce USTAR-CR (Unitig STitch Advanced constRuction with Colors Reordering), a fast and space-efficient algorithm for compressing multiple k-mer sets. USTAR-CR exploits the structural properties of colored dBGs to construct a succinct plain text representation while also incorporating an effective scheme for encoding k-mer color information. We evaluate USTAR-CR on real sequencing datasets and benchmark it against the state-of-the-art tool GGCAT. USTAR-CR achieves superior compression ratios, significantly reduces memory usage, and offers substantial speed improvements-up to 64x faster-highlighting its effectiveness for large-scale genomic data processing.

USTAR-CR: Efficient and Compact Compression of k-Mer Sets Through Colored de Bruijn Graphs

Rossignolo, Enrico;Comin, Matteo
2026

Abstract

A core task in computational genomics is transforming input sequences into their constituent k-mers. Efficiently storing these k-mer collections is crucial for scaling bioinformatics workflows. A common strategy involves representing the k-mers as a de Bruijn graph (dBG) and deriving a compact plain text form through a minimum path cover. In this article, we introduce USTAR-CR (Unitig STitch Advanced constRuction with Colors Reordering), a fast and space-efficient algorithm for compressing multiple k-mer sets. USTAR-CR exploits the structural properties of colored dBGs to construct a succinct plain text representation while also incorporating an effective scheme for encoding k-mer color information. We evaluate USTAR-CR on real sequencing datasets and benchmark it against the state-of-the-art tool GGCAT. USTAR-CR achieves superior compression ratios, significantly reduces memory usage, and offers substantial speed improvements-up to 64x faster-highlighting its effectiveness for large-scale genomic data processing.
2026
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3582101
Citazioni
  • ???jsp.display-item.citation.pmc??? 0
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 0
social impact