Precision Install: NVIDIA Driver Deployment on Linux

Optimizing compute workloads, accelerating ML training, or stabilizing graphical environments—each demands direct control over how NVIDIA drivers interface with the Linux kernel. A mismatched or outdated driver isn’t just suboptimal; it’s a source of kernel panics, DKMS build failures, or black screens. After dozens of production deployments and developer workstations, a few things consistently matter: cleaning out cruft, explicit version pinning, and understanding the trade-offs hidden behind a simple apt install.

Proprietary Versus Nouveau: Why It Matters

Default Linux installations typically load the open-source nouveau driver stack for NVIDIA hardware. Functional for desktops, yes, but with performance ceilings, lacking CUDA, and with shaky support for modern GPGPU tasks or advanced HDMI signaling.

Driver	CUDA Support	Performance	Kernel Compatibility
nouveau	No	Moderate	High*
nvidia	Yes	Optimal	Kernel-dependent

*Note: Nouveau tracks kernel updates well, but rapidly lags on new hardware generations.

Recommendation: Default to proprietary drivers for all CUDA, gaming, deep learning, or multi-monitor workflows.

Preparation: System State and Dependency Hygiene

Residual drivers and mismatched DKMS modules routinely cause silent boot failures. It’s not hypothetical; the logfiles will prove it (/var/log/Xorg.0.log will be littered with unresolved symbol errors if you skip cleanup).

Purge first. On Ubuntu/Debian:

sudo apt-get purge '^nvidia-.*' '^libnvidia-.*' 'nouveau-dkms'
sudo apt autoremove --purge
sudo apt autoclean

Update your package indexes (and take this moment to snapshot your system if it’s a shared workstation):

sudo apt update && sudo apt upgrade -y

On Fedora:

sudo dnf remove '*nvidia*' 'xorg-x11-drv-nouveau'
sudo dnf upgrade --refresh

Disabling Nouveau: Required for Clean Installs

The open-source stack will load before NVIDIA modules unless explicitly blacklisted. This is a common root cause behind the notorious nvidia-modeset: module not found error.

Blacklist nouveau:

cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

Regenerate your initramfs:

sudo update-initramfs -u

or, on Fedora:

sudo dracut --force

Finalize by rebooting. Partial steps here account for >70% of failed installations I’ve seen in project onboarding scripts.

Sourcing the Correct Driver: Pinning Release Channels

For stability and timely bugfixes, rely on PPA (Ubuntu) or RPM Fusion (Fedora) instead of direct .run files from NVIDIA’s site—except when exact version pinning is essential (i.e., legacy compute environments, kernel version lockstep).

Ubuntu/Debian:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update

Fedora (using RPM Fusion):

sudo dnf install https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm

Known issue: Mixing installer sources (ppa, local .run, rpmfusion) often leads to incomplete removal or parallel driver stacks, causing transient Xorg failures or XWayland mismatches.

Identification and Version Selection

Too often users skip hardware ID, installing the “latest” and encountering regression bugs with Turing/Ampere cards. Always confirm:

lspci -nn | grep -i nvidia
# Or, for detail:
sudo ubuntu-drivers devices    # Ubuntu only

This returns a list like:

model    : GA104 [GeForce RTX 3070]
driver   : nvidia-driver-535 - distro non-free recommended

Pin that version:

sudo apt install nvidia-driver-535

Apt’s metapackage mechanism (nvidia-driver-XXX) prevents dependency drift. For kernel upgrades, DKMS builds new modules automatically—unless header packages are missing (see below).

On Fedora:

sudo dnf install akmod-nvidia xorg-x11-drv-nvidia-cuda

Arch Linux:

sudo pacman -S nvidia nvidia-utils nvidia-settings

Reboot and Validate — Don’t Assume, Confirm

Reboot is required to load new modules. Sometimes, failing to reboot results in a hybrid userland/kernel version where nvidia-smi gives No devices were found.

Post-reboot, check driver and card status:

nvidia-smi

Sample output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05    Driver Version: 535.86.05    CUDA Version: 12.2     |
|--------------------------+----------------------+--------------------------|
|  GPU  Name        ...    |   Processes         |                          |
+--------------------------+----------------------+--------------------------+

Or, verify that Xorg is using NVIDIA:

glxinfo | grep "OpenGL renderer"
# Should return: "OpenGL renderer string: NVIDIA ..."

Installing CUDA Toolkit (Optional, but Often Required)

Most ML frameworks require the toolkit to build against headers and link CUDA binaries. Installing only the driver is insufficient. Note: CUDA versions are not always backward-compatible. Check framework (e.g. PyTorch, TensorFlow) requirements first.

Ubuntu Example:

Download from https://developer.nvidia.com/cuda-downloads (select OS and version to match your driver)
Install with package manager (not the .run installer—unless building from scratch environments).

Append to your PATH and LD_LIBRARY_PATH in ~/.bashrc:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Reload: source ~/.bashrc

Validate install:

nvcc --version

If you see command not found, environment variables weren’t set or toolkit is incomplete.

Troubleshooting — Lessons From the Field

Black Screen or Login Loop:
Switch to a TTY (Ctrl+Alt+F3), check logs. Most frequent root cause: stale nouveau modules or header mismatch.
```
modprobe: ERROR: could not insert 'nvidia': No such device
```
Solution: purge again, rebuild initramfs, verify kernel headers:
```
sudo apt install linux-headers-$(uname -r)
```
DKMS Build Failures:
Common after kernel upgrades without a corresponding DKMS/driver module rebuild. Inspect /var/lib/dkms/nvidia/ for error logs.
Multiple GPUs:
Use nvidia-settings or prime-select to choose active GPU on hybrid laptops. Multi-GPU isn’t just plug-and-play—some BIOS settings (e.g., "Hybrid Graphics") must be disabled.
Common Gotcha:
If installing via a remote SSH session, do not run X display update commands—this can crash remote desktops.

Summary Table: Key Steps

Step	Command/Action	Critical Notes
Purge old drivers	`apt-get purge '^nvidia-.*'`	Also remove nouveau
Blacklist nouveau	`cat > /etc/modprobe.d/blacklist-nouveau.conf`	Regenerate initramfs, then reboot
Add driver repo	`add-apt-repository ppa:graphics-drivers/ppa`	Pin versions afterwards
Install driver	`apt install nvidia-driver-XXX`	Use recommended version, confirm with lspci
Validate	`nvidia-smi`, `glxinfo`	Check for correct version, active GPU
CUDA toolkit	Download and set PATH/LD_LIBRARY_PATH	Match toolkit/driver to framework reqs

Practical Example:
A researcher deploying PyTorch with an RTX 4090 on Ubuntu 22.04, kernel 6.5, required nvidia-driver-545 and cuda-toolkit-12-3. Initial DKMS build failed due to outdated kernel headers. After installing linux-headers-6.5.0-XX-generic, DKMS succeeded, and both nvidia-smi and nvcc --version produced valid outputs.

Alternative: For true reproducibility in CI pipelines, consider driver install via container images with pinned driver/toolkit versions. Host kernel compatibility remains an external constraint.

No deployment is perfect, but disciplined cleanup, explicit version selection, and verification eliminate 90% of “black screen” or module failure cases before they start.

How To Install Nvidia Drivers On Linux