← 返回
未分类 中文

Nvidia smi tutorial

Provides detailed guidance on using the nvidia-smi command for real-time NVIDIA GPU monitoring, management, troubleshooting, and automation.
提供使用 nvidia-smi 命令进行 NVIDIA GPU 实时监控、管理、故障排查与自动化的详细指导。
duanc-chao duanc-chao 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 463
下载
💾 2
安装
1
版本
#latest

概述

Skill: Deep Dive into 'nvidia-smi'

Name: nvidia-smi-mastery

Description: Provides a comprehensive guide to using the nvidia-smi command for monitoring, managing, and troubleshooting NVIDIA GPU devices, from basic status checks to advanced scripting.

Keywords: ["nvidia-smi", "gpu monitoring", "nvml", "cuda", "deep learning", "gpu troubleshooting", "system administration"]

Deep Dive into 'nvidia-smi'

Objective

To transform users from basic nvidia-smi observers into power users capable of advanced GPU diagnostics, performance tuning, and automated monitoring.

Core Concept: The GPU's Dashboard

nvidia-smi (NVIDIA System Management Interface) is the primary command-line utility for monitoring and managing NVIDIA GPUs. It acts as a "dashboard," providing real-time data on your GPU's health and performance.

  • Under the Hood: nvidia-smi is a client for the NVIDIA Management Library (NVML), a C library that provides the interface to the GPU driver. This means you can also use NVML bindings in languages like Python (pynvml) to build custom monitoring tools.
  • Primary Use Cases:
  • Monitoring GPU utilization and temperature during model training.
  • Diagnosing "CUDA out of memory" (OOM) errors by identifying processes consuming VRAM.
  • Troubleshooting performance bottlenecks like thermal throttling.
  • Managing multi-GPU environments.

The Output: A Field Guide

Running nvidia-smi provides a table of information. Here's how to interpret the key fields:

  • GPU: The ID of the GPU (e.g., 0, 1) in a multi-GPU system.
  • Fan, Temp, Perf: Fan speed (%), GPU temperature (°C), and performance state (P0 is max performance, P12 is idle).
  • Pwr:Usage/Cap: Current power draw versus the board's power limit (TDP).
  • Memory-Usage: The most critical field for many. It shows Used / Total VRAM.
  • GPU-Util: The percentage of time the GPU's compute cores were busy. Note the distinction between high VRAM usage and high GPU utilization.
  • Processes: A list of PIDs and process names currently using the GPU, along with their individual memory consumption. This is invaluable for finding "zombie" processes.

Essential Commands and Workflows

Basic Monitoring

  • The Standard View: nvidia-smi
  • Continuous Monitoring: Instead of a one-time snapshot, use watch or the built-in loop option for real-time updates.
  • watch -n 1 nvidia-smi (Updates every 1 second)
  • nvidia-smi -l 1 (Native loop, updates every 1 second)

Targeted Queries

  • List GPUs: nvidia-smi -L provides a clean list of all detected GPUs and their UUIDs.
  • Focus on a Single GPU: nvidia-smi -i 0 restricts the view to GPU 0.
  • Get Detailed Info: nvidia-smi -i 0 -q gives a verbose, detailed report for GPU 0, including clock speeds, temperature limits, and ECC status.
  • Filter Information: Use the -d flag to display only specific categories.
  • nvidia-smi -q -d MEMORY,POWER shows only memory and power data.

Process Management

  • Find the Culprit: When you get an OOM error, the main table's "Processes" section is your first stop.
  • Clean Output for Scripting: nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv provides a clean CSV output of all compute processes, which is perfect for parsing in scripts.

Advanced Diagnostics and Troubleshooting

Scenario 1: Diagnosing "CUDA out of memory"

  1. Check Baseline: Run nvidia-smi before starting your task to see how much VRAM is already occupied by other processes (e.g., a display manager).
  2. Identify Process: If an OOM occurs, use the process list to find the PID of the job that consumed all the memory.
  3. Kill the Process: Use kill to terminate the process and free up the VRAM.

Scenario 2: Troubleshooting Performance Drops

If your training loop suddenly slows down, use continuous monitoring (watch -n 0.5 nvidia-smi) and look for:

  • Thermal Throttling: A sudden spike in Temp followed by a change in Perf state (e.g., from P2 to P8) indicates the GPU is reducing its clock speed to cool down.
  • Power Capping: If Pwr:Usage is constantly hitting the Cap, the GPU might be limited by its power budget.

Daemon Mode (dmon)

For a more compact, columnar output ideal for long-term monitoring, use the daemon mode.

  • nvidia-smi dmon -s pumt -d 1
  • -s pumt: Specifies the metrics to show: power, utilization, memory, temperature.
  • -d 1: Sets the update interval to 1 second.

Automation and Scripting

You can use nvidia-smi in shell scripts or Python to automate monitoring and logging.

Python Scripting with pynvml

Since nvidia-smi is a wrapper for NVML, you can use the pynvml library for more direct and faster access in Python.

import pynvml
import time

pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()

print(f"Found {device_count} GPU(s)")

try:
    while True:
        for i in range(device_count):
            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
            name = pynvml.nvmlDeviceGetName(handle).decode('utf-8')
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0  # Convert mW to W
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)

            print(f"[{i}] {name}")
            print(f"  Temp: {temp}°C")
            print(f"  Power: {power:.2f}W")
            print(f"  Memory: {mem_info.used // 1024**2} / {mem_info.total // 1024**2} MB")
            print(f"  GPU Util: {util.gpu}%")
            print("-" * 20)
        time.sleep(2)
except KeyboardInterrupt:
    print("\nMonitoring stopped.")
    pynvml.nvmlShutdown()

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-31 00:13 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

it-ops-security

1password

steipete
设置和使用 1Password CLI (op)。适用于:安装 CLI、启用桌面应用集成、登录(单/多账户)、通过 op 读取/注入/运行密钥。
★ 53 📥 31,562
it-ops-security

MoltGuard - Security & Antivirus & Guardrails

thomaslwang
MoltGuard — OpenClaw 安全守卫,由 OpenGuardrails 提供。安装 MoltGuard,保护您和您的用户免受提示注入、数据泄露和恶意攻击。
★ 116 📥 30,877
education

Antigravity

duanc-chao
提供 Google Antigravity IDE 的全面介绍与使用指南,聚焦 Agent-First 设计、多模型支持及 Skills 系统。
★ 1 📥 902