← 返回
未分类 Key 中文

Alibabacloud Pai Dlc Job

Alibaba Cloud PAI-DLC (Deep Learning Containers) job management skill. Covers: distributed training job CRUD, monitoring (logs and events), and GPU sanity ch...
阿里云 PAI-DLC(深度学习容器)作业管理技能,涵盖分布式训练作业的增删改查、监控(日志和事件)及 GPU 状态检查。
sdk-team sdk-team 来源
未分类 clawhub v0.0.1 3 版本 100000 Key: 需要
★ 0
Stars
📥 342
下载
💾 0
安装
3
版本
#latest

概述

PAI-DLC Deep Learning Job Management

Manage deep learning training jobs on Alibaba Cloud PAI-DLC (Platform for AI - Deep

Learning Containers) service.

Scenario Description

PAI-DLC is a distributed training service provided by Alibaba Cloud's AI Platform PAI,

supporting:

  • Job Creation and Execution — Create distributed training jobs for TensorFlow,

PyTorch, XGBoost, and other frameworks

  • Job Monitoring — Get job status, logs, events, and monitoring metrics
  • Compute Health Check — Check health status of GPU and other compute devices
  • Job Management — Update and stop jobs

Architecture: PAI Workspace + DLC Job + Computing Resources (ECS public pay-as-you-go

or Lingjun dedicated quota) + AIWorkSpace catalog (images / datasets / code sources /

quotas / workspaces).

Installation Requirements

> Pre-check: Aliyun CLI >= 3.3.1 required

> Run aliyun version to verify version >= 3.3.1. If not installed or version is too low,

> see references/cli-installation-guide.md for

> installation instructions.

> Then [Required] run aliyun configure set --auto-plugin-install true to enable

> automatic plugin installation.

> Note on --user-agent: Every API-invoking aliyun command in this skill MUST

> include --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job. Client-side helpers

> (aliyun version, aliyun configure ..., aliyun plugin ...,

> aliyun --help) do not invoke remote APIs and therefore do not require

> the flag.

> Network timeout & retry (rule --help doesn't enforce): aliyun CLI

> defaults to 10s connect / 10s read with no retry. For long-running flows

> (large list, slow region) explicitly raise via global flags

> --connect-timeout 15 --read-timeout 30 --retry-count 2. Never rely on the

> default for user-confirmed high-risk calls (stop-job / delete-*).

aliyun version
aliyun configure set --auto-plugin-install true
aliyun pai-dlc --help
aliyun aiworkspace --help >/dev/null 2>&1 || aliyun plugin install --names aliyun-cli-aiworkspace
aliyun plugin update

aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job"
# After session: aliyun configure ai-mode disable

Environment Variables

This skill does not require any custom environment variables. Credentials are handled

by the Alibaba Cloud CLI configuration (see Authentication below). Optionally:

VariableRequiredPurpose
-----------------------------
ALIBABA_CLOUD_PROFILEOptionalSelects a non-default aliyun configure profile
ALIBABA_CLOUD_REGION_IDOptionalDefault region when --region is omitted (still recommended to pass --region explicitly)

Do NOT export ALIBABA_CLOUD_ACCESS_KEY_ID / ALIBABA_CLOUD_ACCESS_KEY_SECRET from

within this session; configure them outside (aliyun configure or shell profile).

Authentication Configuration

> Pre-check: Alibaba Cloud Credentials Required

>

> Security Rules:

> - NEVER read, echo, or print AK/SK values (e.g., echo $ALIBABA_CLOUD_ACCESS_KEY_ID is FORBIDDEN)

> - NEVER ask the user to input AK/SK directly in the conversation or command line

> - NEVER use aliyun configure set with literal credential values

> - ONLY use aliyun configure list to check credential status

>

> ```bash

> aliyun configure list

> ```

> Check the output for a valid profile (AK, STS, or OAuth identity).

>

> If no valid profile exists, STOP here.

> 1. Obtain credentials from Alibaba Cloud Console

> 2. Configure credentials outside of this session (via aliyun configure in terminal

> or environment variables in shell profile)

> 3. Return and re-run after aliyun configure list shows a valid profile

RAM Permissions

> [MUST] Permission Failure Handling: When any command or API call fails due to

> permission errors at any point during execution, follow this process:

> 1. Read references/ram-policies.md to get the full list of permissions required by this SKILL

> 2. Use ram-permission-diagnose skill to guide the user through requesting the necessary permissions

> 3. Pause and wait until the user confirms that the required permissions have been granted

For detailed permission list, see references/ram-policies.md.

Required Permissions Overview:

OperationRequired Permission
--------------------------------
Create Jobpai:CreateJob
List Jobspai:ListJobs
Get Job Detailspai:GetJob
Get Pod Logspai:GetPodLogs
Get Job Eventspai:GetJobEvents
Update Jobpai:UpdateJob
Stop Jobpai:StopJob
AIWorkSpace Resource Discoverypaiworkspace:ListWorkspaces / paiimage:ListImages,GetImage / paidataset:ListDatasets,GetDataset / paicodesource:ListCodeSources,GetCodeSource

> AIWorkSpace authorization note: Image / DataSourceId / CodeSourceId /

> WorkspaceId field values for create-job come from the

> AIWorkSpace resource-discovery APIs. --resource-id (QuotaId) is manually provided by the user.

> RAM users MUST hold the corresponding

> AIWorkSpace-namespaced permissions listed above (do not abbreviate as aiworkspace:*).

Parameter Confirmation

> Authoritative parameter reference is aliyun pai-dlc --help (must-read

> before every call). This skill only documents what --help does not tell

> you: cross-field rules, cross-product dependencies, hidden behaviors, business

> labels, and reject patterns. Whenever a rule below contradicts --help, the

> reason is stated inline.

>

> Confirm before call: all user-customizable values (region, names, CIDR,

> specs, etc.) MUST be confirmed with the user — never assume defaults.

Hard rules that override --help

RuleWhy this skill overrides --help
-----------------------------------------
--workspace-id is always required--help marks it optional, but server silently falls back to the user's default workspace if omitted → job often lands in the wrong workspace. Always confirm with user.
--job-specs[].Image MUST be a verbatim ImageUri from aiworkspace list-imagesCross-product contract; --help only describes the field type. See §7.6 red line.
--data-sources[].DataSourceId from aiworkspace list-datasets; --code-source.CodeSourceId from list-code-sourcesCross-product discovery; --help cannot point you to the source product.
--resource-id (QuotaId) is manually suppliedNo CLI discovery step.

Cross-field mutual exclusion (--help cannot catch these)

  • EcsSpecResourceConfig — within a single TaskSpec, pick exactly one.
  • UriDataSourceId — within --data-sources[].
  • UriCodeSourceId — within --code-source.

--job-type — Worker Type hints per framework

--help lists the 9 legal enum values verbatim. What --help doesn't tell you

is which JobSpecs[].Type roles each framework expects:

--job-typeValid JobSpecs[].Type roles
------
TFJobChief / PS / Worker / Evaluator / GraphLearn
PyTorchJobWorker (+ optional Master, auto-promoted)
MPIJobWorker + Master
XGBoostJob / OneFlowJob / ElasticBatchJobWorker + optional Master
RayJobWorker
SlurmJob / DataJuicerJobframework-specific roles

> Case-sensitive, no aliases. tensorflow, pytorch, tf-job, Pytorch,

> PYTORCH_JOB, Custom, CustomJob — all rejected.

>

> No Custom enum. For single-container custom workloads, map to

> PyTorchJob (most permissive role set).

>

> Locked after create: JobType cannot be changed via update-job.

Full field reference: see references/related-apis.md.

Core Workflows

7.1 Resource Selection Decision Guide

Before calling create-job, determine the resource path:

  • Public pay-as-you-go → Use EcsSpec in TaskSpec; do NOT pass --resource-id.
  • Use cases: quick start, testing, no dedicated quota.
  • Example: "EcsSpec": "ecs.gn6i-c4g1.xlarge"
  • Dedicated quota (Lingjun / enterprise quota) → Use ResourceConfig in TaskSpec

AND pass --resource-id .

  • Use cases: dedicated resource group, Lingjun smart compute, Spot bidding.
  • Example: --resource-id quotaXXX + "ResourceConfig": {"CPU": "4", "Memory": "8Gi", "GPU": "1"}

> EcsSpec and ResourceConfig MUST NOT both appear in the same TaskSpec.

> Also required before create-job: --job-specs[].Image MUST come from

> aliyun aiworkspace list-images; --data-sources[].DataSourceId from

> list-datasets; --code-source.CodeSourceId from list-code-sources.

> Full discovery flow → see §7.6.

Distributed architecture choices:

TopologyJobSpecs shape
------
Single-nodeOne Worker only
TFJob PS-WorkerBoth PS (CPU) and Worker (GPU) roles
PyTorch multi-nodeOne Worker with PodCount > 1

Optional flags: --enable-gang-scheduling true (all-or-nothing scheduling),

Settings.EnableRDMA: true (high-performance network for multi-node GPU),

Settings.EnableSanityCheck: true (GPU health verification).

> All commands below require --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job (omitted in snippets for brevity — see Installation Requirements).

7.2 Create Training Job

Minimal single-node PyTorch job (public pay-as-you-go) parameter combination:

aliyun pai-dlc create-job --region <region> --workspace-id <ws-id> \
  --display-name "my-pytorch-training" --job-type PyTorchJob \
  --job-specs '[{"Type":"Worker","PodCount":1,"Image":"<ImageUri>","EcsSpec":"ecs.gn6i-c4g1.xlarge"}]' \
  --user-command 'python train.py' \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job

Multi-node / Spot / RDMA / data mounting — use create-job --help.

Subsequent snippets omit --user-agent for brevity — always include it.

7.3 List / Get Job

Use --cli-query to project specific fields (essential for log/event flows):

aliyun pai-dlc list-jobs --region <region> --status Running
aliyun pai-dlc get-job  --region <region> --job-id <id>
aliyun pai-dlc get-job  --region <region> --job-id <id> --cli-query "Pods[0].PodId"

7.4 Logs and Events

> Always cap return size: --max-lines 100 (logs), --max-events-num 50 (events).

Get PodId first, then query logs/events:

POD_ID=$(aliyun pai-dlc get-job --region <r> --job-id <id> --cli-query "Pods[0].PodId")
aliyun pai-dlc get-pod-logs    --region <r> --job-id <id> --pod-id $POD_ID --max-lines 100
aliyun pai-dlc get-pod-events  --region <r> --job-id <id> --pod-id $POD_ID --max-events-num 20
aliyun pai-dlc get-job-events  --region <r> --job-id <id> --max-events-num 50

Diagnosis order: get-job (status) → get-job-eventsget-pod-logsget-pod-events.

7.5 Compute Health Check

aliyun pai-dlc list-job-sanity-check-results --region <r> --job-id <id>
aliyun pai-dlc get-job-sanity-check-result   --region <r> --job-id <id> --sanity-check-number 1

7.6 Pre-Create Resource Discovery (AIWorkSpace)

Discovery flow: list-workspaceslist-image-labels

list-imageslist-datasetslist-code-sourcespai-dlc create-job.

> Quota (--resource-id): user-supplied. No CLI discovery step.

aliyun aiworkspace list-workspaces     --region <r>                        # → --workspace-id
aliyun aiworkspace list-image-labels   --region <r>                        # → valid label Key=Value pairs
aliyun aiworkspace list-images         --region <r> --labels "K1=V1,K2=V2" # → --job-specs[].Image (use ImageUri verbatim)
aliyun aiworkspace list-datasets       --region <r> --workspace-id <ws>    # → DataSources[].DataSourceId
aliyun aiworkspace list-code-sources   --region <r> --workspace-id <ws>    # → CodeSource.CodeSourceId

> Labels rules (not in --help): comma-separated Key=Value pairs, no

> JSON / no spaces. Values MUST come from list-image-labels — never invent.

> Do not pass --workspace-id to list-images when discovering **official

> public images** (they are global). Pass --workspace-id only when filtering

> custom / private images scoped to a specific workspace.

>

> RED LINE: --job-specs[].Image MUST be a verbatim ImageUri (not

> Name / ImageId).

>

Field-mapping, full parameters, and error codes: see

references/related-apis.md and

references/verification-method.md.

7.7 Job Lifecycle Management (Stop / Update / Web Terminal)

Stop is a high-risk operation. Before proceeding, query status with

get-job, present the result to the user, and require explicit confirmation.

> Rules --help doesn't tell you (update-job silent-no-op family):

>

> - Stop Job applies only when status is Running or Queuing.

> - update-job --priority takes effect only when (a) the job uses

> quota resources (--resource-id) AND (b) status is Creating,

> Queuing, or EnvPreparing. Once the job enters Running or later,

> priority cannot be modified — the API returns 200 OK but the change

> is silently NOT applied. Always pre-check status with get-job.

> - update-job --accessibility takes effect immediately in any status.

> - update-job does NOT expose --display-name (--help lists only

> --job-id, --accessibility, --description, --job-specs, --priority).

> To rename a job, recreate it.

For the full pre-check + confirmation + execution templates, plus the

update-job low-risk path and get-web-terminal / get-token sharing

commands, see references/job-management.md.

7.8 Ecs Spec Discovery

Discover available instance types; the returned EcsSpec value goes

verbatim into --job-specs[].EcsSpec.

aliyun pai-dlc list-ecs-specs --region <r> --accelerator-type GPU --resource-type ECS --page-size 20
# Lingjun dedicated: --quota-id <id> (whitelisted users only)

> list-ecs-specs does not support --sort-by — even values shown as

> valid in --help (e.g. CPU / GPU / Memory / GmtCreateTime) are

> rejected by the server. Always omit --sort-by here and sort the JSON

> output client-side with jq — e.g.

> ... | jq '.EcsSpecs | sort_by(-.AcceleratorNumber)'.

Success Verification Method

For step-by-step end-to-end verification scripts (resource discovery →

CreateJob → log query → cleanup), see

references/verification-method.md.

Quick verification:

  • get-job → Status should be Creating / Queuing / Running shortly after

create-job returns.

  • list-jobs --status Running → Should return the freshly created Job until it

finishes or is stopped.

  • get-pod-logs → Should return non-empty log content once the Pod is past

EnvPreparing.

Command Tables

The full command index (5 categories × ~40 commands, with plugin

attribution) is consolidated in

references/related-apis.md §1.

Best Practices

> Items below are decision rules and operational habits — not parameter

> values (those live in --help).

  1. Job naming — use meaningful, sortable names: project-model-date

(e.g. resnet50-imagenet-20260320). Recreate (not update-job) is the

only way to rename.

  1. Resource sizing — pick GPU type / count by model & dataset size. Verify

availability with list-ecs-specs --accelerator-type GPU before picking

EcsSpec (see §7.8).

  1. Diagnose early — follow the order get-jobget-job-events

get-pod-logsget-pod-events. Cap responses (--max-lines 100,

--max-events-num 50) to keep agent context lean.

  1. Priority adjustment — prefer setting --priority at create-job time.

Post-creation update-job --priority only works for quota jobs in

Creating / Queuing / EnvPreparing phase (§7.7); once Running,

priority cannot be modified.

  1. Cost control — use --job-max-running-time-minutes as an auto-stop guard

for every long-running experiment. Spot via SpotSpec reduces cost at the

risk of preemption.

  1. Health check — enable Settings.EnableSanityCheck: true for GPU

training to catch faulty devices before training starts.

  1. Resource cleanupstop-job on completed jobs to free quota.
  2. Idempotency on writes — PAI-DLC create-* APIs do NOT expose

--client-token (verified via aliyun pai-dlc create-job --help). Network

retries can therefore create duplicate Jobs. Mitigation: before re-issuing

a failed create-*, run list-jobs --display-name to detect a

half-committed prior attempt.

Reference Links

Reference DocumentDescription
---------------------------------
references/related-apis.mdCommand index, cross-product field map, lifecycle, red lines, error catalog
references/ram-policies.mdRAM permission policy details
references/verification-method.mdEnd-to-end verification scripts
references/job-management.mdHigh-risk Stop/Delete/Update flow + Web Terminal
references/acceptance-criteria.mdSkill testing acceptance criteria
references/cli-installation-guide.mdCLI installation guide

版本历史

共 3 个版本

  • v0.0.1 当前
    2026-05-29 21:14 安全 安全
  • v0.0.1-beta.2
    2026-05-26 18:09 安全 安全
  • v0.0.1-beta.1
    2026-05-21 23:54 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

it-ops-security

MoltGuard - Security & Antivirus & Guardrails

thomaslwang
MoltGuard — OpenClaw 安全守卫,由 OpenGuardrails 提供。安装后可防止您和您的用户受到提示注入、数据泄露及恶意行为的侵害。
★ 116 📥 31,007
it-ops-security

Free Ride - Unlimited free AI

shaivpidadi
管理OpenClaw的OpenRouter免费AI模型,自动按质量排名模型,配置速率限制备用方案,并更新opencla...
★ 471 📥 78,476
it-ops-security

OpenClaw Backup

alex3alex
备份与恢复 OpenClaw 数据。适用于创建备份、设置自动备份计划、从备份恢复或管理备份轮转。处理 ~/.openclaw 目录归档并包含适当的排除规则。
★ 90 📥 31,052