This skill helps users migrate their local machine learning scripts to run on TensorPool GPU clusters using the interactive cluster workflow. Use this when a user has a working local script and wants to scale it up to professional GPU hardware (H100, H200, B200, B300, etc.).
Trigger phrases: "run this on H100", "scale to TensorPool", "convert to production", "run on cloud GPU", "migrate to H100", "make this faster on GPU", "run on B200"
Workflow: Analyze local script → Discover CLI commands → Prepare for cluster → Create GPU cluster → Transfer code → Run on GPU → Retrieve outputs → Clean up
When running the user's script on TensorPool and it throws errors, Claude should proactively diagnose and fix the code without asking for permission, as long as the fix is in service of the script's original training/inference objective.
Guiding principles:
Common auto-fixable errors:
ModuleNotFoundError → pip install the missing package, re-runCUDA out of memory → reduce batch size, enable gradient checkpointing, or switch to LoRAFileNotFoundError → fix hardcoded paths, create missing directories (mkdir -p)RuntimeError: expected dtype → add .to(device) or fix dtype mismatches (fp32 vs bf16)NCCL timeout / distributed errors → fix environment variables, world size, rank configKeyError in dataset/config → inspect the data format and adjust loading codeAttributeError from API changes → update to current library API (e.g., deprecated HuggingFace args)Out of scope (ask the user):
The TensorPool CLI (tp) changes frequently. Never assume command syntax from this document. Always discover current commands at runtime.
Before running any tp commands, always run:
# 1. Check tp is installed
pip show tensorpool || pip install tensorpool
# 2. Discover top-level commands
tp --help
# 3. Drill into relevant subcommands
tp cluster --help
tp ssh --help
tp storage --help
tp me --help
Use the output of --help to determine:
If a command fails or the syntax has changed, re-run --help on that subcommand to get the current usage.
General expectations (these may change — always verify with --help):
1xH100, 8xB200)```bash
pip show tensorpool || pip install tensorpool
```
```bash
# Run the account info command (discover exact syntax via tp --help)
# If not authenticated, configure your API key from:
# https://tensorpool.dev/dashboard
```
```bash
ls ~/.ssh/id_ed25519.pub || ssh-keygen -t ed25519
```
Before migrating, understand:
Key questions to ask:
requirements.txt? If not, create one.2.1 Create/verify requirements.txt:
# If it doesn't exist, create it
pip freeze > requirements.txt
# Or manually list dependencies
echo "torch>=2.0.0" > requirements.txt
echo "transformers>=4.30.0" >> requirements.txt
2.2 Check for environment variables:
# Create .env file if needed
cat > .env << EOF
HUGGINGFACE_TOKEN=hf_your_token_here
WANDB_API_KEY=your_wandb_key_here
EOF
2.3 Test locally first (if possible):
# Quick sanity check with minimal data
python your_script.py --max_samples 10 --num_epochs 1
2.4 Identify files to transfer:
.py files).yaml, .json, .toml)requirements.txt).env)2.5 Identify files NOT to transfer:
.git/)__pycache__/, *.pyc)venv/, env/)Choosing an instance type:
| GPU | Memory | Best For |
|---|---|---|
| ----- | -------- | ---------- |
| H100 | 80GB | General training, fine-tuning 7B-70B models |
| H200 | 141GB | Large context, 70B+ models, memory-bound workloads |
| B200 | 192GB | Latest gen, native FP4/FP6, fastest training |
| B300 | Latest | Newest generation, highest performance |
| L40S | 48GB | Inference, smaller training workloads |
Instance types come in configurations like 1xH100, 2xH100, 4xH100, 8xH100, etc.
To create a cluster:
tp cluster --help to see the exact create syntax and required argumentsWait for provisioning: Clusters typically take 1-2 minutes to become ready.
Use the cluster list/info commands (discovered via tp cluster --help) to get:
Recommended: Use rsync (fast, resumable, efficient)
rsync -avz \
--exclude=".git" \
--exclude="__pycache__" \
--exclude="*.pyc" \
--exclude="venv/" \
--exclude="outputs/" \
./ ubuntu@<cluster-ip>:~/my-project/
Alternative: Use scp for single files
scp script.py ubuntu@<cluster-ip>:~/
scp -r ./data/ ubuntu@<cluster-ip>:~/data/
Use the SSH command discovered via tp ssh --help, passing the appropriate instance or cluster identifier.
You're now on the GPU cluster! All subsequent commands run on the cluster.
7.1 Verify GPU:
nvidia-smi
Expected output: GPU(s) listed with available memory
7.2 Navigate to your project:
cd ~/my-project
ls -la # Verify files transferred
7.3 Install dependencies:
# Using requirements.txt
pip install -r requirements.txt
# Or install packages individually
pip install torch transformers datasets accelerate
7.4 Set environment variables (if needed):
# Load from .env
export $(cat .env | xargs)
# Or set manually
export HUGGINGFACE_TOKEN=hf_your_token_here
export CUDA_VISIBLE_DEVICES=0
7.5 Download large datasets (if needed):
# Download directly on cluster (faster than transferring)
wget https://example.com/large-dataset.tar.gz
tar -xzf large-dataset.tar.gz
# Or use Hugging Face datasets (downloads automatically)
# No action needed - will download when script runs
8.1 Quick test first:
# Run with minimal data to verify everything works
python your_script.py --max_samples 100 --num_epochs 1
8.2 Full production run:
Often times the user will input information to the prompt of how to do production run
# Option 1: Direct execution
python train.py --num_epochs 5 --batch_size 32
# Option 2: Use screen/tmux (survives SSH disconnection)
screen -S training
python train.py --num_epochs 5
# Press Ctrl+A then D to detach
# Reconnect later with: screen -r training
# Option 3: Use nohup (runs in background)
nohup python train.py > training.log 2>&1 &
tail -f training.log # Monitor progress
8.3 Monitor progress:
# Watch GPU usage
watch -n 1 nvidia-smi
# Check logs
tail -f training.log
# Monitor disk space
df -h
From your LOCAL machine (open new terminal):
CLUSTER_IP=<ip-from-cluster-info>
# Download outputs
rsync -avz ubuntu@$CLUSTER_IP:~/my-project/outputs/ ./outputs/
# Download specific files
rsync -avz ubuntu@$CLUSTER_IP:~/my-project/model.pt ./
rsync -avz ubuntu@$CLUSTER_IP:~/my-project/logs/ ./logs/
Pro tip: Use --progress flag to see transfer progress:
rsync -avz --progress ubuntu@$CLUSTER_IP:~/my-project/outputs/ ./outputs/
CRITICAL: Always delete the cluster to avoid charges
Use the cluster destroy command (discovered via tp cluster --help), passing the cluster identifier.
Verify deletion by listing clusters again — it should no longer appear.
Cost reminder: Clusters bill continuously until destroyed!
TensorPool offers NFS storage for persistent data across cluster lifecycles and for sharing data across multi-node clusters. Discover the exact commands via:
tp storage --help
共 1 个版本