How To Tar A File In Linux

How To Tar A File In Linux

Reading time1 min
#Linux#Backup#OpenSource#Tar#LinuxCommands#FileArchiving

Mastering Efficient File Archiving: How to Tar a File in Linux Like a Pro

Experienced Linux administrators rarely use tar the same way twice. When compressing backups, shipping code to staging, or preserving permissions across heterogeneous environments, one-size-fits-all commands fall short. Nuance matters.


Context: Why Bother With tar?

File archiving isn't just about packing files together. When you need to back up /etc on RHEL 8 with custom SELinux policies, or migrate a legacy application’s assets from CentOS 7, a generic tar -cvf can quietly drop extended attributes. Overlook that, and restores can fail or permissions silently change.

Tar (tape archive) aggregates files, directories, and—critically—metadata. It preserves timestamps, symlinks, Unix permissions, group and ownership, hard links, xattrs, ACLs, and, with --selinux, SELinux contexts. Nobody relies on tar for performance; they rely on it for fidelity.


Beyond the Basics: Real-World Tar Usage

Let’s skip generic demos. Here’s a production backup scenario:

Compressing a project folder, excluding noisy temp files, ensuring consistent permissions:

tar -cJpvf /srv/backup/devsite_$(date +%Y-%m-%d).tar.xz \
  --exclude='*.tmp' --exclude='node_modules/' /srv/www/devsite
  • -c: create archive
  • -J: xz compression (best size, slowest)
  • -p: preserve permissions exactly (ownership, modes)
  • -v: progress visibility
  • -f: output file
  • --exclude: exclude unwanted files/directories

Note: On x86_64 servers with multi-core CPUs, pigz/xz parallelization accelerates compression. Stock GNU tar (v1.30+) invokes xz natively but xz’s parallel mode (xz -T0) is only available via pipe, e.g.:

tar -cpf - /data | xz -T0 > /backup/data_$(date +%F).tar.xz

This shortcut is popular for large datasets on modern hardware.


Key Options: Pick What Actually Matters

Here are frequently overlooked flags with specific consequences:

OptionPurposeNotes
--exclude=<pat>Omit matching pathsAccepts wildcards
--same-ownerRetain user/group mapping on extractRequires root
--aclsPreserve POSIX ACLsNot default
--selinuxSave SELinux labels (Fedora/RHEL)RHEL-tested
--sparseCompact sparse filesE.g. VM disk images
--xattrsSave extended attributesFor custom metadata
-C <dir>Change to directory before add/extractUseful in restore

Extraction: Avoid Common Hazards

Archive extraction is where most mishaps occur—files can overwrite unintended system locations if paths aren’t controlled. Always review archive content first:

tar -tvf archive.tar.xz | less

For controlled restores to a specific location:

tar -xJpf archive.tar.xz -C /tmp/test_restore
  • -x: extract
  • -J: xz
  • -p: preserve file modes (when run as root)
  • -C: target directory

Gotcha: Relative paths in the archive won’t overwrite /etc/passwd, but absolute paths might—that’s a questionable archive, but they exist.


Sparse and Large Files: Avoid Bloat

Virtual machine disk images and sparse database files expand rapidly when archived naively. GNU tar versions ≥1.27 prevent this if you use:

tar --sparse -cvf vm.img.tar vm.qcow2

Otherwise, that 50GB thin-provisioned image suddenly fills your NAS.


Practical Automation

A cron-friendly script:

#!/bin/bash
set -e

src='/var/www'
dst="/mnt/nas/backups/www_$(date +%F).tar.gz"
log='/var/log/backup.log'

tar -czpf "$dst" --exclude=cache/ --exclude=*.log --xattrs "$src" >> $log 2>&1 \
  || { echo "Backup failed: $(date)" >> $log; exit 1; }

echo "Backup succeeded: $(date)" >> $log

Observed Issue: Tar exit code 1 often means "some files changed during archive." Decide if that's critical for your use case. For live MySQL data, it is. For static app servers, usually not.


The Overlooked: Restore Validity, Compression Efficiency

Compression algorithms in tar are a trade-off:

  • gzip (-z): Fast, widely compatible (use for ephemeral or intermediate backups).
  • bzip2 (-j): Stronger compression, much slower (rarely justified today).
  • xz (-J): Highest compression, CPU-intensive (for archival).

Non-obvious tip: Always verify archive integrity before deleting source data:

tar -tzf backup.tar.gz | head   # lists a few files, confirms structure

or for full CRC checks:

gunzip -c backup.tar.gz | tar -tv

In production, consider pairing with sha256sum:

sha256sum backup.tar.gz > backup.tar.gz.sha256

Side Note: Tar and CI/CD

If packaging deployment artifacts for a CI/CD pipeline (e.g., GitHub Actions runners, GitLab CI), archiving with absolute paths or mixed permissions can cause deployment failures or silent overrides in target systems. Strip path prefixes or sanitize with --transform:

tar -czvf webapp.tar.gz --transform='s|^dist/|release/|' dist/

This prepends release/ to extracted files.


Summary Table: Tar Usage Patterns

ScenarioCommand ExampleNotes
Fast temp backup, no compressiontar -cpf /tmp/b.tar /dataFor throw-away snapshots
Compressed, meta-preserving web assetstar -czpf assets.tar.gz --xattrs /srv/webOwnership, xattrs saved
Avoid archiving logs in prod replicatar --exclude='*.log' -cf prod.tar /appReduces archive size
Sparse file archivetar --sparse -cf vm.tar /var/lib/libvirt/images/disk.qcow2Shrinks virtual disks
SELinux/ACL-respecting rootfs snapshottar --acls --selinux -czpf etc.tar.gz /etcRHEL/Fedora only

Observations & Recommendations

  • Tar isn’t perfect for massive datasets—consider alternatives (rsync, BorgBackup, restic) if deduplication or incremental backups are required.
  • Always upgrade tar beyond v1.29 if handling xattrs or SELinux contexts.
  • Never trust archives without testing extraction. Restoration is the only proof.

Further Reading

Avoid surprises. Tar carefully.