How To Compress A File In Linux

How To Compress A File In Linux

Reading time1 min
#Linux#Compression#File#Tar#Gzip#Zstd

Mastering File Compression in Linux: Practical Techniques with tar, gzip, xz, zstd, and More

Years of managing backups and server storage teach you quickly: defaulting to gzip often leaves performance or savings untapped. Compression is about trade-offs—speed, ratio, compatibility. Hardware, sample data, and even your workflow will dictate which tool fits.


Why File Compression Really Matters

Large Postgres dumps, rotating logs in /var/log, artifact storage for CI runs—raw files accumulate quickly. Compression becomes critical when:

  • Transfer times or bandwidth cost are non-negligible (cloud migration, offsite sync).
  • Storage quotas are tight (home directories on shared hosts, cloud object stores).
  • Archiving heterogeneous filetypes into a single transportable unit.

But which tool? And why one over another?


Canonical Tools: tar, gzip, bzip2, xz

tar—The Archiver, Not the Compressor

If you’re only archiving (no compression), keep it simple:

tar -cvf backup_2024-06-18.tar /srv/media/

Creates a stream; large files, small overhead. Note: file permissions and symlinks are preserved—critical for system restores.

gzip—Balanced for General Use

Default on nearly every system, fast, widely supported.

tar -czvf archive.tar.gz /var/www/html/
  • z pipes through gzip
  • Compression ratios: 2-4× typical reduction on text-heavy content
    To extract:
tar -xzvf archive.tar.gz

Single file? Direct:

gzip -9 error.log  # Highest ratio (but slower)

Gotcha: gzip is single-threaded. For >5GB files, consider pigz.

bzip2—Higher Ratio, Higher Cost

For workloads where speed is non-essential.

tar -cjvf archive_20240618.tar.bz2 /opt/datasets/

Observations: often 10–20% smaller archive, but at least 2–4× slower than gzip.

xz—If Every Byte Counts

tar -cJvf archive.tar.xz /home/images/

  • J triggers xz backend (LZMA2 algorithm).
  • Practical: 15–30% smaller than gzip, but can take minutes or hours for very large archives.
  • Memory use is significant for large sets (seen >1GB resident for large tarballs).

Extract:

tar -xJvf archive.tar.xz

Tip: Use xz -T0 to auto-select threads if running directly.


Modern Alternatives: zstd, pigz, and lz4

zstd—Speed Without Compromising Ratio

Developed by Facebook’s Yann Collet. Native multi-core, tunable compression level.
Install:

sudo apt-get install zstd

Compress tar with zstd:

tar --use-compress-program="zstd -T0" -cvf site_backup_2024-06-18.tar.zst /srv/sites/
  • -T0 uses all CPUs
  • Compression levels: -1 (fastest) to -19 (max compression), usually -3 is a sweet spot for most data sets.

Extract:

tar --use-compress-program="zstd -d -T0" -xvf site_backup_2024-06-18.tar.zst

Error to watch:

tar: Archive is compressed. Use -a or --auto-compress option

Means you omitted the compress program.

zstd often matches xz on size, but runs in a fraction of the time—especially on modern CPUs (tested with zstd 1.5.2 on Debian 12).

pigz—Parallel gzip

Drop-in replacement for gzip when speed matters and cores are available.

tar -cf - /data | pigz -p 8 > archive-20240618.tar.gz

Restores with:

pigz -d -p 8 archive-20240618.tar.gz

Note: pigz is not always installed by default (apt install pigz).

lz4—When Time is the Only Concern

Lightning fast, but ratios lag behind gzip.

lz4 /var/log/journal.log
# Output: /var/log/journal.log.lz4

Good for interim processing or cache layers, not for final archival.


Practical Comparison Table

ToolThreadsTypical RatioSpeedUse Case
gzip1mediumfastLegacy, interoperability
bzip21highslowBackup where time is irrelevant
xzmultivery highvery slowImmutable archives, release assets
zstdmultihighvery fastSnapshot/backup, modern default
pigzmultimediumvery fastLarge logs, frequent rotates
lz4multilowfastestCache, temporary data

Note: Actual results vary wildly by file type. Text compresses better than JPG.


Real-World Scenario: Backup and Restore of Web Root

  1. Measure the uncompressed data:

    du -sh /srv/www/
    
  2. Compress with zstd for optimal balance

    tar --use-compress-program='zstd -T0 -19' -cvf www-backup-20240618.tar.zst /srv/www/
    

    Compression at -19 can take noticeably more CPU (benchmark and adjust as needed).

  3. Quick estimation (before/after):

    ls -lh www-backup-20240618.tar.zst
    
  4. Restoration:

    tar --use-compress-program="zstd -T0 -d" -xvf www-backup-20240618.tar.zst
    
  5. Validate the archive:

    tar --diff --use-compress-program="zstd -d" -vf www-backup-20240618.tar.zst
    

    Will report mismatches.


Non-Obvious Considerations

  • Compression destroys random access. Large tarballs become a liability if only a single small file is needed. For multi-file datasets, consider alternative layouts or per-file compression.
  • Encryption: Inline with gpg:
    tar -czf - /etc | gpg --symmetric --cipher-algo AES256 -o /secure/backups/etc-backup.tar.gz.gpg
    
  • Multi-core scaling isn't uniform: bzip2 remains single-threaded (unless using pbzip2), xz and zstd adapt well to high core-count systems.
  • File compatibility: Some legacy systems may lack zstd or modern xz support (notably RHEL/CentOS 7 or earlier).

Known issue: Some versions of tar fail with --use-compress-program if the program path includes spaces. Use absolute paths or symlinks as a workaround.


Summary Table: Tool Selection by Requirement

ScenarioRecommended Tool
Fastest, for logslz4, pigz
Archival, disk constraintxz, zstd (high level)
Best all-around defaultzstd
Maximum portabilitygzip

Final point: always benchmark against representative data. Theoretical maximums rarely match field results.


Side note: Tools and underlying compression libraries evolve. For long-term retention, re-test your approach on new OS releases—expected compression ratios (and bugs) do shift over time.