Removing newlines in FASTA file increases ZSTD compression ratio by 10x

Zstandard's Long Range Mode and Genome Sequence Compression Date: September 12, 2025 Zstandard (zstd) introduced a long range match finder (--long mode) in version 1.3.2 (2017) that increases the compressor’s search window to at least 128MiB, improving deduplication for large files. Initially, --long mode caused substantial performance overhead but has since been optimized to approach the speed of zstd’s fast defaults. --- Application to Genome Sequence Compression Genome sequences are challenging to compress efficiently due to their size and structure. Typical general-purpose compressors like zstd offer high speed but lower compression ratios (CRs), whereas specialized DNA compressors achieve higher CRs but at slower speeds. Benchmark Dataset 661k Dataset: Grace Blackwell’s 2.6Tbp microbial genome collection includes 661,405 bacterial genome assemblies in FASTA format. Size: 2.46 TiB uncompressed Specialized method: Karel Břinda’s MiniPhy reduces this to 27 GiB (CR of 91) by clustering related genomes. Zstandard defaults compress to 777 GiB (CR ~3), much faster but less space-efficient. Key Findings with Zstandard --long Mode Using zstd --long on the default multiline (60 chars per line) FASTA file slightly improves compression from 777 GiB to 641 GiB (CR 3.8). The presence of newline characters (0x0A) every 60 bases interrupts pattern matching, reducing long range effectiveness. Removing newlines to make each sequence a single uninterrupted line with seqtk seq -l 0 triples compression ratio to 11 (232 GiB) with only ~20% slowdown relative to default zstd. Further Improvements Increasing the long-range window to 2 GiB using --long=31 on single-line FASTA reduces size to 80 GiB (CR of 31), tripling compression again. This results in ~80% slower compression time compared to zstd defaults, with more memory usage and reduced decompression compatibility (decompression also needs --long=31). Results can vary by dataset but the trade-off between compression and speed is often worthwhile. --- Summary Table of Compression Results (661k dataset) | Compression | Line Length | Size (GiB) | Compression Ratio | |----------------------------|-------------|------------|-------------------| | Uncompressed | 60 | 2460 | 1 | | Gzip (pigz) | 60 | 751 | 3.3 | | Zstandard (default) | 60 | 777 | 3.2 | | Zstandard --long | 60 | 641 | 3.8 | | Zstandard --long | 0 (single line) | 232 | 11 | | Zstandard --long=31 | 0 (single line) | 80 | 31 | --- Practical Advice For compressing large genome FASTA files, removing newlines inside sequences before compression is critical to leverage --long mode effectively. Using --long=31 delivers a compelling balance between compression ratio and speed, achieving results close (within an order of magnitude) to slower, state-of-the-art specialist compressors. Be sure to pass the same --long parameter on decompression for compatibility. --- Additional Context The --long mode greatly benefits assemblies and similar highly redundant genome data. Its usage narrows the gap between fast general compressors and slower, specialized genomic compression tools. Compression gains come with increased memory and CPU resources during compression and decompression. --- Quote from Bede Constantinides: "Zstandard's --long range mode works wonders for assemblies, but needs uninterrupted single line sequences." AllTheBacteria 661k multiline fasta: gzip (pigz) 751GB, zstd --long 641GB (30% original) Single line fasta: gzip (pigz) 700GB, zstd --long 232GB (10% original) --- This exploration illustrates that **Zstandard with optimized usage of long range mode is a