Mastering 'tar' for Reliable, Efficient File Archiving
A corrupt backup on a customer’s NFS share kept me up late once—ownerships gone, timestamps mangled, lost SELinux context. “You should have used tar --xattrs --selinux
,” the storage vendor said. They weren’t wrong.
tar
(Tape ARchive) remains a default tool for aggregating and compressing files in Linux/Unix environments. Despite origins in streaming to tape, it's the backbone for modern filesystem snapshots, deployable artifacts, and backup pipelines. Its versatility, and occasional sharp edges, demand attention to detail.
Syntax, At a Glance
tar [flags] [archive-file.tar] [sources]
Typical create operation:
tar czf /srv/backup-2024-06-10.tar.gz /etc/ /opt/app/
c
: Create archive.z
: Compress using gzip.f
: File name to write/read.
Swapping z
for j
triggers bzip2; J
selects xz.
Table: Common Flags & Their Uses
Flag | Function | Gotcha |
---|---|---|
c | Create archive | N/A |
x | Extract archive | Mistake: extracting as root w/ bad paths |
v | Verbose output | Noisy on large trees |
z /j /J | gzip/bzip2/xz compression | xz is slow but smallest archives |
p | Preserve permissions | Needs root to fully honor |
--exclude | Exclude paths/globs | Careful: quotes required |
--xattrs | Preserve extended attributes | Not available on every tar build |
Beyond Simple Compression: Archival Integrity in Practice
A daily tar czf
job works for most home directories. Production and regulated environments—compliance archiving, root filesystem backups—require more rigor.
Preserve Everything Worth Preserving
System files, SELinux labels, or Docker volumes: default tar
sometimes misses ACLs and extended attributes unless options are explicitly set.
Full-preservation example (GNU tar 1.30+):
tar cpf backup.tar \
--xattrs --acls --selinux \
/etc /var/lib/app
Note: Not all file systems or tar builds on minimal containers ship with full feature support. Always test restore against a scratch VM.
Exclude Intelligently
Backups balloon quickly. Excluding build artifacts, node_modules, or temp folders saves bandwidth and disk:
tar czf project-src.tar.gz \
--exclude='*.log' \
--exclude='build/' \
src/
For complex rules, store excludes in a file:
tar czf home.tar.gz --exclude-from=exclude.list /home/user/
Contents of exclude.list
:
*.cache
Downloads/tmp/
Compression: Performance vs. Size
Gzip (default in most scripts) is a balanced choice. For large archives and multi-core systems, consider pigz
:
tar cf - ./dataset | pigz -9 > dataset-$(date +%F).tar.gz
pigz
: Parallel gzip; ~4–8x faster on modern CPUs- Downside: requires external package (
apt install pigz
) - Trade-off: maximal
-9
is slowest but smallest; adjust per I/O budget
xz compression:
tar cJf logs.tar.xz /var/log/
- Smallest size, slowest compression/decompression.
Extraction: Minimize Damage, Maximize Safety
Typical restore (with permission preservation):
tar xzvpf backup.tar.gz -C /tmp/recovery/
-C
: Explicit extraction path (never restore directly to/
unless required)p
: Tries to restore original ownership/mode—only feasible as root
If the archive includes absolute paths (e.g., /etc/passwd
), set --strip-components=1
to flatten and avoid overwriting critical system files.
tar xzpf backup.tar.gz --strip-components=1 -C /my/chroot
Known issue: Extraction with --strip-components
may lose parent structure required by some relative links.
Real-World Problem: Incomplete Backups
tar
will skip files if read permissions are missing; it reports skipped files on stderr. Always inspect for errors:
tar: ./ssl/private.key: Warning: Cannot open: Permission denied
Solution: Run under sufficient privileges and redirect error logs for audit:
tar czpf /srv/snapshot.tar.gz /srv/app > /var/log/backup.log 2>&1
Ensuring Archive Integrity
Post-backup, always validate archive integrity:
sha256sum archive.tar.gz > archive.tar.gz.sha256
sha256sum -c archive.tar.gz.sha256
For critical volumes, consider running tar --diff
(GNU tar only) post-restore to confirm correct permission and content restoration.
Shortcuts, But Carefully
Avoid archiving with absolute paths unless the archive is restoring to its original location; otherwise, risk overwriting unrelated files. Explicitly add --absolute-names
if genuinely needed, but rarely recommended.
Tip: Archive for Portability
If sharing tarballs between hosts with different user/group structures, or older tar versions (e.g., RHEL 6’s tar 1.23), preempt issues with ownership mismatches and missing xattrs. Prefer using plain --numeric-owner
for cross-system transfer, and verify tar version compatibility:
tar --version
In Summary
“tar” is only as good as its arguments—backup reliability hinges on details: metadata preservation, error monitoring, compression trade-offs. Test both creation and extraction paths before trusting your data to automation.
Hidden complexity? Absolutely. But once mastered, tar
is a rock-solid building block for filesystem backup, container image build chains, and artifact delivery.
Not perfect—no tool is. Internal repo? Use checksum manifests, or for large datasets, consider .tar
plus zstd
piping or filesystem snapshots as alternatives.
Gotcha: Always test your restores. An untested backup is one step away from data loss.