How To Make Tar File In Linux

How To Make Tar File In Linux

Reading time1 min
#Linux#Backup#Compression#Tar#Archiving#Encryption

Mastering tar: Efficient, Secure Archival on Linux

The tar utility remains fundamental for Linux archiving—whether you’re consolidating nightly backups, moving application state between hosts, or integrating with a CI pipeline. Creating robust, lean tarballs isn’t about memorizing tar cvf; it’s about knowing what each switch offers and when best to leverage advanced options for integrity, portability, and security.


Why Not Just tar cvf?

Consider a scenario: you need to archive /srv/app-config to move it between servers. Basic usage:

tar cvf app-config-$(date +%F).tar /srv/app-config

This is functional, but:

  • Generates uncompressed output (poor for bandwidth or large datasets)
  • May skip subtle-but-critical metadata
  • Lacks integrity mechanisms
  • Leaves sensitive configuration exposed in transit

Compression: Gzip, Bzip2, XZ — Which and When

Gzip remains the practical default for speed and broad compatibility. Add the z flag:

tar czvf archive.tar.gz /srv/app-config
# On extraction: tar xzvf archive.tar.gz
  • On an Ubuntu 22.04 kernel, gzip compresses a 1GB unstructured log set to roughly 400MB in ~25s on mid-tier storage.
  • [Known issue] Compression ratio is moderate compared to modern algorithms.

Bzip2: use j for better compression, slower speeds.

tar cjvf archive.tar.bz2 /srv/app-config

XZ (J) yields the smallest archives, especially for repetitive data (think: source code snapshots, Docker contexts):

tar cJvf archive.tar.xz /srv/app-config

Trade-off: extraction is CPU-bound. For >10GB archives in scripts, expect slower I/O, sometimes to the point that simple scp + gzip is faster for remote transfers on commodity hardware.


Permissions, Timestamps, xattrs: Getting Meta

Default tar on GNU coreutils preserves most file attributes, but not always all. On system backups, or when moving ACLs, preserve capabilities with p (permissions), and sometimes consider --xattrs (extended attributes) on modern kernels:

tar czvpf archive.tar.gz --xattrs /etc /home/myuser
  • p: preserves ownership/permissions (critical if extracting as root).
  • --xattrs: extended attributes, e.g., SELinux labels (Fedora/RHEL).

Note: extracting as a non-root user will not fully reinstate ownership data—this can trigger misleading tar: ...: Cannot change ownership warnings.


Selective Exclusion: Keep Archives Clean

Archiving Node.js projects? No need for node_modules bloat or CI logs. Exclude with:

tar czvf project.tar.gz --exclude='node_modules' --exclude='*.log' /srv/www/project

Multiple --exclude statements can dramatically shrink archive size and backup/restore time—review what actually needs to persist.


Integrity: Checksum Every Archive

tar only reports internal read/write errors—it won’t protect you against bitrot or incomplete transfers. For critical backups, always store a checksum next to every archive:

sha256sum archive.tar.gz > archive.tar.gz.sha256

Later, verify with:

sha256sum -c archive.tar.gz.sha256
# Output:
# archive.tar.gz: OK

Note: On large datasets, hash calculation can take time. For rapid CI artifact validation, consider doing this in parallel with archive creation.


Splitting: Overcoming Filesystem and Transfer Limits

Pushing to S3, USB sticks, or legacy FAT/NTFS partitions? Split at the stream level:

tar czf - /data | split -b 1G - backup.tar.gz.part-
# Generates backup.tar.gz.part-aa, ab, etc.

On destination, combine and extract:

cat backup.tar.gz.part-* | tar xzf -

Needed because filesystems like FAT32 cap individual file size at 4GB.

Gotcha: Quick interruptions in transfer can corrupt one segment—sha256sum each part before reassembly for high-value data.


Encryption: Securing Data-in-Transit

No native encryption in tar, so use OpenPGP (gpg) for symmetric protection:

tar cJvf - /srv/secrets | gpg --cipher-algo AES256 -c -o secrets-backup.tar.xz.gpg

Decryption and extraction is just the reverse pipeline:

gpg -d secrets-backup.tar.xz.gpg | tar xJvf -
  • Avoid leaving decrypted output on disk; always stream where possible.

Alternative: For per-file encryption, consider age or openssl enc, but for compatibility and scriptability, gpg -c is reliable.


Practical Example: Project Backup, Optimized

Required: Back up /home/dev/project as a compressed, encrypted tar archive, excluding build and logs, preserving all metadata, and ready for long-term offsite storage.

tar cJvpf --exclude=build --exclude='*.log' /home/dev/project | gpg -c -o project-$(date +%F).tar.xz.gpg
  • cJvpf: create, XZ compress, verbose, preserve permissions, use file.
  • --exclude: filters non-essential data.
  • gpg -c: prompt for passphrase, AES-256 by default.

Reference Table: Common tar Workflows

RequirementCommand SnippetNote
Basic archivetar cvf backup.tar /dataNo compression
Fast compressiontar czvf backup.tar.gz /datagzip; trade-off: moderate ratio
Best compressiontar cJvf backup.tar.xz /dataSlower, smaller file
Preserve permissionsAdd p; e.g., tar czvpf ...Essential for root/system files
Exclude temp files--exclude='*.tmp' --exclude='cache/'Pattern matching
Archive verificationsha256sum backup.tar.gz > backup.sha256Integrity check
Split for transport`tar czf - datasplit -b 2G - data.tar.gz.part-`
Encrypt backup`tar czf - datagpg -c -o data.tar.gz.gpg`

Side Notes and Non-Obvious Details

  • For maximum portability (BSD systems, old rescue shells, or Docker images), avoid nonstandard flags like --xattrs; check available version with tar --version.
  • Extraction recreates relative path trees. To avoid surprises, change to the destination directory first, or use -C /target/path on extraction.
  • Error: “tar: Removing leading '/' from member names” — by design, to make safe relative archives.
  • GNU tar v1.30+ supports parallel (multi-threaded) compression with --use-compress-program=pigz for gzip/bzip2; can accelerate big backups on multicore hardware.

Summary

Mastering tar isn’t about memorizing single commands—it’s about composing the right flags and auxiliary tools for security, reproducibility, and long-term reliability. Store checksums. Encrypt what matters. Exclude clutter. Documentation is in man tar, but as always, the real world brings edge cases. Test restores regularly—restorability is the only backup that counts.