← 返回
未分类 Key 中文

SageMaker Training Job

Submit ML training jobs to AWS SageMaker — package code, upload to S3, launch on GPU/CPU instances, poll status, download artifacts. Use when training machin...
提交机器学习训练任务到 AWS SageMaker——打包代码、上传至 S3、在 GPU/CPU 实例上启动、轮询状态并下载产物。适用于训练机器学习模型。
zyyhhxx zyyhhxx 来源
未分类 clawhub v1.0.2 1 版本 100000 Key: 需要
★ 1
Stars
📥 356
下载
💾 0
安装
1
版本
#latest

概述

SageMaker Training

Submit ML training jobs to AWS SageMaker from the command line. Supports PyTorch,

TensorFlow, scikit-learn, and XGBoost with managed spot training for cost savings.

Prerequisites

  • boto3 Python package installed (pip install boto3). sagemaker recommended.
  • AWS credentials available — EC2 instance profile (recommended), or aws configure / env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  • S3 bucket for training artifacts
  • Two IAM roles configured — see references/setup.md for exact policies:
  • Role A (Caller): SageMaker job management + S3 access + ECR image pull
  • Role B (Execution): S3 data access + CloudWatch logs + ECR images

Security Notes

  • AWS credentials are never logged, embedded in scripts, or uploaded to S3.

boto3 resolves credentials from the standard chain (instance profile → env → config file).

  • Source packaging excludes .git, .env, venv, __pycache__, and other

non-essential files. Use --source-dir to explicitly scope what gets packaged.

Always review --dry-run output before submitting to production.

  • IAM scope: Both caller and execution role policies should be scoped to your

specific S3 bucket and SageMaker execution role ARN. See references/setup.md.

Quick Start

1. Write a training script

Follow the SageMaker training script contract: read data from SM_CHANNEL_TRAIN,

save model to SM_MODEL_DIR. See references/training-scripts.md for templates.

2. Submit a training job

python3 scripts/sagemaker_train.py \
  --job-name my-experiment-001 \
  --script ./train.py \
  --role arn:aws:iam::ACCOUNT:role/SageMakerRole \
  --bucket my-sagemaker-bucket \
  --instance-type ml.g5.xlarge \
  --spot \
  --framework pytorch \
  --input-data s3://my-bucket/data/train/ \
  --hyperparameters '{"epochs":"50","lr":"0.001"}' \
  --output-dir ./results

The script packages your code, uploads to S3, submits the job, polls until

complete, and downloads model artifacts to --output-dir.

3. Check cost

# Estimate before running
python3 scripts/sagemaker_cost.py --instance-type ml.g5.xlarge --duration 3600 --spot

# Check actual cost after job completes
python3 scripts/sagemaker_cost.py --job-name my-experiment-001

4. List recent jobs

python3 scripts/sagemaker_list.py --max 5
python3 scripts/sagemaker_list.py --status Failed

Key Options

FlagPurposeDefault
------------------------
--spotManaged spot training (up to 70% savings)off
--instance-typeCompute instanceml.g5.xlarge
--max-runtimeKill job after N seconds3600
--frameworkpytorch, tensorflow, sklearn, xgboostpytorch
--image-uriCustom Docker image (overrides framework)auto
--requirementsrequirements.txt for extra depsnone
--dry-runPrint config without submittingoff
--no-waitSubmit and exit without pollingoff
--resume JOBReconnect to a running/completed job (skip submission)
--source-dirDirectory with all training codescript's parent
--input-dataS3 input(s), format: channel:s3://...none
--envJSON environment variables{}

Instance Selection

For tabular/Kaggle workloads:

  • Gradient boosting (LightGBM/XGBoost): ml.m5.2xlarge (CPU, $0.54/hr)
  • Small neural nets: ml.g4dn.xlarge (T4, $0.74/hr) — cheapest GPU
  • Standard deep learning: ml.g5.xlarge (A10G, $1.41/hr) — best price/performance
  • Heavy training: ml.p3.2xlarge (V100, $4.28/hr)

Always use --spot for non-urgent training — typical savings of 30-70%.

Workflow Integration

For autonomous agents running training jobs in a loop:

  1. Prepare data locally or upload to S3
  2. Write training script following the contract in references/training-scripts.md
  3. Use --dry-run first to validate config
  4. Submit with sagemaker_train.py — it blocks until completion by default
  5. Results download automatically to --output-dir
  6. Parse metrics from the output for experiment tracking

For parallel experiments, use --no-wait and poll with sagemaker_list.py.

Smoke Test

Verify the entire pipeline works end-to-end (~$0.01, takes ~3 min):

python3 scripts/sagemaker_smoke_test.py \
  --role arn:aws:iam::ACCOUNT:role/SageMakerTrainingExecutionRole \
  --bucket my-sagemaker-bucket

This runs a local pre-flight, submits a minimal job to SageMaker, verifies

the downloaded model artifact, and checks cost. Use --keep to preserve output files.

版本历史

共 1 个版本

  • v1.0.2 当前
    2026-05-07 11:44 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Stock Analysis

udiedrichsen
利用Yahoo Finance数据深度分析股票和加密货币。支持投资组合管理、关注列表与提醒、股息分析、八维度股票评分、热门趋势扫描(热点扫描器)及谣言/早期信号检测。适用于股票分析、投资组合追踪、财报反应、加密货币监控、热门股票发现及在主流
★ 282 📥 58,246
data-analysis

AdMapix

fly0pants
AdMapix 原始数据层,提供广告创意、应用、排名、下载/收入及市场元数据。返回 AdMapix API 的结构化 JSON;调用方...
★ 298 📥 142,970
data-analysis

Tavily 搜索

jacky1n7
通过 Tavily API 进行网页搜索(Brave 替代方案)。当用户要求搜索网页、查找来源或链接,且 Brave 网页搜索不可用时使用。
★ 278 📥 101,556