Precision Install: NVIDIA Driver Deployment on Linux
Optimizing compute workloads, accelerating ML training, or stabilizing graphical environments—each demands direct control over how NVIDIA drivers interface with the Linux kernel. A mismatched or outdated driver isn’t just suboptimal; it’s a source of kernel panics, DKMS build failures, or black screens. After dozens of production deployments and developer workstations, a few things consistently matter: cleaning out cruft, explicit version pinning, and understanding the trade-offs hidden behind a simple apt install
.
Proprietary Versus Nouveau: Why It Matters
Default Linux installations typically load the open-source nouveau
driver stack for NVIDIA hardware. Functional for desktops, yes, but with performance ceilings, lacking CUDA, and with shaky support for modern GPGPU tasks or advanced HDMI signaling.
Driver | CUDA Support | Performance | Kernel Compatibility |
---|---|---|---|
nouveau | No | Moderate | High* |
nvidia | Yes | Optimal | Kernel-dependent |
*Note: Nouveau tracks kernel updates well, but rapidly lags on new hardware generations.
Recommendation: Default to proprietary drivers for all CUDA, gaming, deep learning, or multi-monitor workflows.
Preparation: System State and Dependency Hygiene
Residual drivers and mismatched DKMS modules routinely cause silent boot failures. It’s not hypothetical; the logfiles will prove it (/var/log/Xorg.0.log
will be littered with unresolved symbol errors if you skip cleanup).
Purge first. On Ubuntu/Debian:
sudo apt-get purge '^nvidia-.*' '^libnvidia-.*' 'nouveau-dkms'
sudo apt autoremove --purge
sudo apt autoclean
Update your package indexes (and take this moment to snapshot your system if it’s a shared workstation):
sudo apt update && sudo apt upgrade -y
On Fedora:
sudo dnf remove '*nvidia*' 'xorg-x11-drv-nouveau'
sudo dnf upgrade --refresh
Disabling Nouveau: Required for Clean Installs
The open-source stack will load before NVIDIA modules unless explicitly blacklisted. This is a common root cause behind the notorious nvidia-modeset: module not found
error.
Blacklist nouveau:
cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
Regenerate your initramfs:
sudo update-initramfs -u
or, on Fedora:
sudo dracut --force
Finalize by rebooting. Partial steps here account for >70% of failed installations I’ve seen in project onboarding scripts.
Sourcing the Correct Driver: Pinning Release Channels
For stability and timely bugfixes, rely on PPA (Ubuntu) or RPM Fusion (Fedora) instead of direct .run
files from NVIDIA’s site—except when exact version pinning is essential (i.e., legacy compute environments, kernel version lockstep).
Ubuntu/Debian:
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
Fedora (using RPM Fusion):
sudo dnf install https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
Known issue: Mixing installer sources (ppa, local .run, rpmfusion) often leads to incomplete removal or parallel driver stacks, causing transient Xorg failures or XWayland mismatches.
Identification and Version Selection
Too often users skip hardware ID, installing the “latest” and encountering regression bugs with Turing/Ampere cards. Always confirm:
lspci -nn | grep -i nvidia
# Or, for detail:
sudo ubuntu-drivers devices # Ubuntu only
This returns a list like:
model : GA104 [GeForce RTX 3070]
driver : nvidia-driver-535 - distro non-free recommended
Pin that version:
sudo apt install nvidia-driver-535
Apt’s metapackage mechanism (nvidia-driver-XXX
) prevents dependency drift. For kernel upgrades, DKMS builds new modules automatically—unless header packages are missing (see below).
On Fedora:
sudo dnf install akmod-nvidia xorg-x11-drv-nvidia-cuda
Arch Linux:
sudo pacman -S nvidia nvidia-utils nvidia-settings
Reboot and Validate — Don’t Assume, Confirm
Reboot is required to load new modules. Sometimes, failing to reboot results in a hybrid userland/kernel version where nvidia-smi
gives No devices were found
.
Post-reboot, check driver and card status:
nvidia-smi
Sample output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 |
|--------------------------+----------------------+--------------------------|
| GPU Name ... | Processes | |
+--------------------------+----------------------+--------------------------+
Or, verify that Xorg is using NVIDIA:
glxinfo | grep "OpenGL renderer"
# Should return: "OpenGL renderer string: NVIDIA ..."
Installing CUDA Toolkit (Optional, but Often Required)
Most ML frameworks require the toolkit to build against headers and link CUDA binaries. Installing only the driver is insufficient. Note: CUDA versions are not always backward-compatible. Check framework (e.g. PyTorch, TensorFlow) requirements first.
Ubuntu Example:
-
Download from https://developer.nvidia.com/cuda-downloads (select OS and version to match your driver)
-
Install with package manager (not the
.run
installer—unless building from scratch environments). -
Append to your PATH and LD_LIBRARY_PATH in
~/.bashrc
:export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
-
Reload:
source ~/.bashrc
Validate install:
nvcc --version
If you see command not found
, environment variables weren’t set or toolkit is incomplete.
Troubleshooting — Lessons From the Field
-
Black Screen or Login Loop:
Switch to a TTY (Ctrl+Alt+F3
), check logs. Most frequent root cause: stale nouveau modules or header mismatch.modprobe: ERROR: could not insert 'nvidia': No such device
Solution: purge again, rebuild initramfs, verify kernel headers:
sudo apt install linux-headers-$(uname -r)
-
DKMS Build Failures:
Common after kernel upgrades without a corresponding DKMS/driver module rebuild. Inspect/var/lib/dkms/nvidia/
for error logs. -
Multiple GPUs:
Usenvidia-settings
orprime-select
to choose active GPU on hybrid laptops. Multi-GPU isn’t just plug-and-play—some BIOS settings (e.g., "Hybrid Graphics") must be disabled. -
Common Gotcha:
If installing via a remote SSH session, do not run X display update commands—this can crash remote desktops.
Summary Table: Key Steps
Step | Command/Action | Critical Notes |
---|---|---|
Purge old drivers | apt-get purge '^nvidia-.*' | Also remove nouveau |
Blacklist nouveau | cat > /etc/modprobe.d/blacklist-nouveau.conf | Regenerate initramfs, then reboot |
Add driver repo | add-apt-repository ppa:graphics-drivers/ppa | Pin versions afterwards |
Install driver | apt install nvidia-driver-XXX | Use recommended version, confirm with lspci |
Validate | nvidia-smi , glxinfo | Check for correct version, active GPU |
CUDA toolkit | Download and set PATH/LD_LIBRARY_PATH | Match toolkit/driver to framework reqs |
Practical Example:
A researcher deploying PyTorch with an RTX 4090 on Ubuntu 22.04, kernel 6.5, required nvidia-driver-545
and cuda-toolkit-12-3
. Initial DKMS build failed due to outdated kernel headers. After installing linux-headers-6.5.0-XX-generic
, DKMS succeeded, and both nvidia-smi
and nvcc --version
produced valid outputs.
Alternative: For true reproducibility in CI pipelines, consider driver install via container images with pinned driver/toolkit versions. Host kernel compatibility remains an external constraint.
No deployment is perfect, but disciplined cleanup, explicit version selection, and verification eliminate 90% of “black screen” or module failure cases before they start.