← 返回
未分类 中文

nanogpt-training

Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, tokenized dataset loading (e.g., HuggingFace Hub shard...
在单块 GPU 上高效训练约 1.24 亿参数的 GPT‑2 规模模型。涵盖 GPT‑124M架构、分词数据集加载(如 HuggingFace Hub 分片)等内容。
lnj22 lnj22 来源
未分类 clawhub v0.1.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 336
下载
💾 0
安装
1
版本
#latest

概述

NanoGPT Training

Overview

Training GPT-2 scale models (~124M parameters) efficiently on a single GPU. It provides:

  • GPT-124M Architecture: Standard transformer with RoPE and modern optimizations
  • Tokenized Datasets: Loading pre-tokenized shards from HuggingFace Hub or local files
  • Modern Optimizers: Muon optimizer with Newton-Schulz orthogonalization
  • Mixed Precision: bfloat16 training on A100 for 2x speedup

Training options:

  • Baseline GPT: Standard residual connections
  • Experimental residual variants: Optional alternative residual schemes for stability/efficiency

Quick Reference

TopicReference
------------------
Model ArchitectureGPT Architecture
Data LoadingTokenized Data
OptimizersOptimizers
Training LoopTraining Loop
HyperparametersHyperparameters

Installation

pip install torch einops numpy huggingface_hub

Minimal Example

import modal

app = modal.App("gpt-training")

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "einops", "numpy", "huggingface_hub"
)

@app.function(gpu="A100", image=image, timeout=3600)
def train():
    import torch
    from dataclasses import dataclass

    @dataclass
    class GPTConfig:
        block_size: int = 1024
        vocab_size: int = 50257
        n_layer: int = 12
        n_head: int = 12
        n_embd: int = 768
        dropout: float = 0.0
        bias: bool = False

    # Download data, build model, train
    # ... (see references for full implementation)

    return {"final_loss": final_loss}

@app.local_entrypoint()
def main():
    results = train.remote()
    print(results)

Common Imports

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
from dataclasses import dataclass
from einops import rearrange, repeat, reduce
import numpy as np
import math

When to Use What

ScenarioApproach
--------------------
Standard GPT trainingUse baseline model with standard residuals
Stability experimentsTry alternative residual variants or extra streams
Small experimentsUse T4/A10G GPU
Full trainingUse A100 with bfloat16
Custom dataModify the dataset loader class
Different model sizeAdjust GPTConfig parameters

Metrics to Monitor

MetricTypical SignalNotes
-------------------------------
Validation lossSteady decreaseAbsolute value depends on dataset/tokenizer
Grad normModerate, stable rangeLarge spikes indicate instability
Training stabilitySmooth curvesFrequent spikes suggest LR/batch issues
ThroughputConsistent tokens/secUse for comparing configs

External Resources

  • nanoGPT: https://github.com/karpathy/nanoGPT
  • build-nanogpt: https://github.com/karpathy/build-nanogpt
  • modded-nanogpt: https://github.com/KellerJordan/modded-nanogpt
  • FineWeb-Edu token shards: https://huggingface.co/datasets/karpathy/fineweb-edu-100B-gpt2-token-shards

版本历史

共 1 个版本

  • v0.1.0 当前
    2026-05-07 17:06 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

pdf

lnj22
全面PDF工具,支持文本/表格提取、新PDF创建、合并/拆分文档、表单处理。当Claude需要...
★ 0 📥 587
dev-programming

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 687 📥 331,445
dev-programming

Mcporter

steipete
使用 mcporter CLI 直接列出、配置、认证及调用 MCP 服务器/工具(支持 HTTP 或 stdio),涵盖临时服务器、配置编辑及 CLI/类型生成功能。
★ 198 📥 68,309