How To Gzip A File

How To Gzip A File

Reading time1 min
#Compression#CommandLine#Performance#Gzip#Linux#FileCompression

Mastering File Compression: Efficient Gzip Usage on Linux Systems

Infinitely large logs, regular application dumps, and recurring data exports—over time, uncompressed files quietly consume terabytes. In most DevOps and data engineering workflows, using gzip directly at the CLI offers control, predictability, and integration that graphical tools simply can’t match.


Why gzip? Practical Outcomes

Text-heavy logs (e.g., Nginx, MySQL, or custom app output) see reductions beyond 70%. This isn’t trivial: smaller files mean shorter backups, reduced S3 costs, and accelerated transfers, especially across slow or congested links. Most UNIX-like systems bundle gzip by default; on RHEL 8.7, for example, it’s included in the gzip-1.10-7.el8 package.

Some might ask, “Why not use bzip2 or xz?” They do provide better ratios, but at a significant time and CPU trade-off. For most real-world pipelines, gzip lands at the sweet spot of compression speed and compatibility.


Basic Compression: The One-Liner

Compress a file and replace the original:

gzip error.log

Result: error.log.gz exists; error.log is gone.

Preserve the original? Use -c:

gzip -c access.log > access.log.gz

Useful in automated scripts, particularly to avoid race conditions where the uncompressed version is still needed downstream.


Tuning Compression: Speed vs. Ratio

gzip supports levels -1 (fast, less compressed) to -9 (slow, max compression).

LevelCompressionRelative Speed
-1LowerFastest
-6BalancedDefault
-9MaxSlowest

Example: compress with higher ratio for archiving historical logs.

gzip -9 backup-2024-06-01.json

Anecdotal: on a 500MB rotated log, gzip -9 shaved just ~10% more bytes vs -6, but took 40% longer. For bulk, stick with -6 or drop to -3 when speed is critical.


Batch Compression

Wildcards speed up repetitive work:

gzip *.csv

Caveat: Each file becomes a standalone .gz. For a single archive, combine with tar:

tar -czf logs-archive.tar.gz *.log

This approach preserves directory structure and file attributes; plain gzip does not.


Keeping Originals: Looping Technique

Need to compress multiple files while retaining the originals? Bash handles it cleanly:

for file in *.json; do
  gzip -c "$file" > "$file.gz"
done

Note: For a few thousand files, this is fast. At massive scale (millions), consider GNU parallel or a find-based batch to avoid fork-bombing your shell.


Decompression and Data Inspection

Standard decompression:

gunzip error.log.gz

Or, to preview content on the fly:

gzip -dc access.log.gz | head -20

Here, -d decompresses, -c streams to stdout—no temp files needed.

Some versions of less even natively display .gz files; try less error.log.gz.


Compression Metrics and Diagnostics

Immediately check efficiency with:

gzip -l backup.sql.gz

Output:

         compressed        uncompressed  ratio uncompressed_name
        16214388          85533875     81.0%  backup.sql

Tip: If the ratio looks unexpectedly low, check for already compressed data (JPG, MP4, ZIP)—gzip is ineffective on media or binaries.


Troubleshooting and Gotchas

  • Gzipping a file twice (*.gz as input) results in negligible savings and can disrupt automated decompressors.
  • Not all tools auto-detect .gz files; scripts may require explicit decompression.
  • For in-use files (such as logs rotated by logrotate), time gzip operations to avoid partial data.

Common Error

Trying to gzip a directory directly:

gzip mydir/

Error:

gzip: mydir/ is a directory -- ignored

Solution: Use tar -czf as shown above.


Real-World Scenario: Compressing a 10GB Audit Log

Objective: Compress /var/log/audit/audit.log.1 with reasonable speed and detail.

time gzip -v -6 audit.log.1

-v emits progress:

audit.log.1:     100.0% -- replaced with audit.log.1.gz

Compression time output via time shown for pipeline tuning.

Non-obvious tip: If disk space is tight and you can't risk intermediate files, combine gzip with an upload or copy in a single stream:

gzip -c largefile.sql | aws s3 cp - s3://bucket/largefile.sql.gz

Summary and Recommendations

  • Use gzip -c to avoid data loss when scripting.
  • Choose compression level based on actual timing benchmarks; don’t default to -9 unless you measure a clear advantage.
  • Combine tar and gzip for directories or file bundles.
  • Always check compression ratio for unexpected results—compressed or encrypted files don’t benefit.
  • Automate batch compressions with shell loops or tools like parallel; avoid brute-force approaches on massive datasets.

Note: Some cloud-native backup pipelines now shift to zstd for even better compression/speed. But until it’s everywhere, gzip remains the universal denominator for lightweight, scriptable compression.


Questions, new edge-cases, or better workflows? Knowledge evolves—contributions welcome.