← 返回
未分类 Key 中文

Alibabacloud Pai Rec Diagnosis

Alibaba Cloud PAI-Rec Engine Diagnostic and Configuration Validation Skill. Use for diagnosing PAI-Rec engine interface issues and validating engine configur...
阿里云PAI-Rec引擎诊断与配置校验技能,用于诊断引擎接口问题并校验配置。
sdk-team sdk-team 来源
未分类 clawhub v0.0.2 2 版本 100000 Key: 需要
★ 0
Stars
📥 401
下载
💾 0
安装
2
版本
#latest

概述

PAI-Rec Engine Diagnosis and Configuration Validation

This skill provides comprehensive diagnostic and validation capabilities for Alibaba Cloud PAI-Rec (Programmable Recommendation System) engines, including interface troubleshooting and configuration analysis.

Scenario Description

PAI-Rec is Alibaba Cloud's programmable recommendation system that provides intelligent recommendation capabilities. This skill helps users:

  1. Diagnose PAI-Rec Engine Interface Issues: When engine API returns errors or unexpected results, trace the request through EAS service logs and engine configurations to identify root causes.
  1. Validate Engine Configurations: Analyze engine configuration files for potential issues, inconsistencies, or misconfigurations before deployment.

Architecture: PAI-EAS Service + PAI-Rec Engine + Engine Configuration Management

Key Components

  • PAI-EAS Service: Elastic Algorithm Service hosting the recommendation engine
  • PAI-Rec Engine: The recommendation engine processing requests
  • Engine Configuration: Configuration files defining engine behavior
  • Service Logs: EAS service logs containing request traces

Installation

Pre-check: Aliyun CLI >= 3.3.3 required

> Run aliyun version to verify >= 3.3.3. If not installed or version too low,

> run curl -fsSL https://aliyuncli.alicdn.com/setup.sh | bash to install/update,

> or see references/cli-installation-guide.md for installation instructions.

Pre-check: Aliyun CLI plugin update required

> [MUST] run aliyun configure set --auto-plugin-install true to enable automatic plugin installation.

> [MUST] run aliyun plugin update to ensure that any existing plugins are always up-to-date.


Authentication

> Pre-check: Alibaba Cloud Credentials Required

>

> Security Rules:

> - NEVER read, echo, or print AK/SK values (e.g., echo $ALIBABA_CLOUD_ACCESS_KEY_ID is FORBIDDEN)

> - NEVER ask the user to input AK/SK directly in the conversation or command line

> - NEVER use aliyun configure set with literal credential values

> - ONLY use aliyun configure list to check credential status

>

> ```bash

> aliyun configure list

> ```

> Check the output for a valid profile (AK, STS, or OAuth identity).

>

> If no valid profile exists, STOP here.

> 1. Obtain credentials from Alibaba Cloud Console

> 2. Configure credentials outside of this session (via aliyun configure in terminal or environment variables in shell profile)

> 3. Return and re-run after aliyun configure list shows a valid profile


RAM Policy

> [MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:

> 1. Read references/ram-policies.md to get the full list of permissions required by this SKILL

> 2. Use ram-permission-diagnose skill to guide the user through requesting the necessary permissions

> 3. Pause and wait until the user confirms that the required permissions have been granted

For detailed RAM policies required by this skill, see references/ram-policies.md.


Parameter Confirmation

> IMPORTANT: Parameter Confirmation — Before executing any command or API call,

> ALL user-customizable parameters (e.g., RegionId, service names, instance IDs,

> environment names, config names, etc.) MUST be confirmed with the user.

> Do NOT assume or use default values without explicit user approval.

Required Parameters

Parameter NameRequired/OptionalDescriptionDefault Value
---------------------------------------------------------------
service-nameRequiredPAI-EAS service nameN/A
cluster-idRequiredEAS cluster/region IDN/A
instance-idRequiredPAI-Rec instance IDN/A
request-idRequired (for diagnosis)Request ID from engine API responseN/A
config-nameOptionalEngine configuration nameN/A
environmentOptionalEngine environment (Prod/Pre)N/A
engine-config-idOptionalSpecific engine config version IDN/A
regionOptionalAlibaba Cloud regionN/A

Core Workflow

> At the start of the Core Workflow (before any CLI invocation):

> [MUST] Enable AI-Mode — AI-mode is required for Agent Skill execution.

> Run the following commands before any CLI invocation:

> ```bash

> aliyun configure ai-mode enable

> aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-rec-diagnosis"

> ```

> [MUST] Disable AI-Mode at EVERY exit point — Before delivering the final response for ANY reason, always disable AI-mode first. This applies to ALL exit paths: workflow success, workflow failure, error/exception, user cancellation, session end, or any other scenario where no further CLI commands will be executed.

> AI-mode is only used for Agent Skill invocation scenarios and MUST NOT remain enabled after the skill stops running.

> ```bash

> aliyun configure ai-mode disable

> ```

> [MUST] Allocate a per-session work directory — All transient artifacts

> (raw config dumps, log dumps, sanitized output) MUST be written under a unique

> per-session directory to avoid concurrent overwrite between parallel skill

> invocations. Run the following at the start of the workflow, before any

> artifact-producing CLI invocation, and reuse $WORKDIR for the whole session:

> ```bash

> # Create an isolated working directory for this session, pinned under /tmp.

> # Pass the full template as a positional arg (works on both BSD/macOS and

> # GNU/Linux mktemp); do NOT use -t prefix, which falls back to $TMPDIR

> # (e.g. /var/folders on macOS) and may even fail under sandboxed shells.

> export WORKDIR=$(mktemp -d /tmp/pairec-diag-XXXXXX)

> ```

> All file paths shown in the steps below ($WORKDIR/engine_configs_list.json,

> $WORKDIR/raw_engine_config.json, etc.) live inside this directory and MUST NOT

> be replaced with hard-coded /tmp/... paths.

Workflow 1: PAI-Rec Engine Interface Diagnosis

This workflow helps diagnose issues when a PAI-Rec engine API returns errors or unexpected results.

Input Example:

Service Name: embedding_recall
API Response:
{
    "code": 299,
    "msg": "items size not enough",
    "request_id": "941b4e14-d1c5-489f-a184-b2b17f8b4fdb",
    "size": 0,
    "experiment_id": "",
    "items": []
}

Step 1: Retrieve EAS Service Information

Get the service details to find the EAS service ID and configuration:

aliyun eas describe-service \
  --cluster-id <cluster-id> \
  --service-name <service-name>

What to extract:

  • Resource: EAS service resource ID (e.g., eas-r-1v4qb1yan3qmnjwxqe)
  • ServiceConfig.envs: Environment variables containing:
  • REGION: The region
  • INSTANCE_ID: PAI-Rec instance ID
  • CONFIG_NAME: Engine configuration name
  • PAIREC_ENVIRONMENT: Environment (product/prepub)

Step 2: Extract Request ID from API Response

Parse the API response JSON to get the request_id field. This will be used to search service logs.

Step 3: Query EAS Service Logs

Use the request ID as the sole filter to search service logs. Do NOT pass --start-time / --end-time when searching PAI-Rec business logs:

aliyun eas describe-service-log \
  --cluster-id <cluster-id> \
  --service-name <service-name> \
  --keyword <request-id> \
  --page-size 500

[CRITICAL] --keyword is MANDATORY — local post-processing is FORBIDDEN:

  • You MUST pass --keyword to the CLI command. This is a server-side filter; the API only returns log lines matching the keyword.
  • You MUST NOT omit --keyword and then filter locally (e.g., piping through head, grep, python3, jq, or any script to search for the request_id in the output).
  • You MUST NOT make multiple describe-service-log calls without --keyword hoping to find relevant lines by scanning the full log stream.
  • If a call with --keyword returns empty results, report that no matching logs were found — do NOT fall back to fetching unfiltered logs.

> ❌ WRONG (fetches ALL logs, filters locally — FORBIDDEN):

> ```bash

> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | head -300

> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | python3 filter.py

> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | grep "request_id"

> ```

>

> ✅ RIGHT (server-side keyword filter — REQUIRED):

> ```bash

> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall --keyword 0c6cbd91-5618-4705-8e08-9126bf4600f7 --page-size 500

> ```

[CRITICAL] Known CLI pitfall — keyword-only lookup is required for business logs:

  • When only --keyword is supplied (no time range), the CLI returns the full PAI-Rec application trace (controller.go / feed.go / recall.go / rank_service.go etc.) matching the request_id.
  • As soon as --start-time / --end-time are added — even if the window covers the real log timestamp — the CLI silently drops business logs and only returns infrastructure noise (/bin/sh wrapper heartbeats, 502 Bad Gateway retries, postgres.go dbstat).
  • Therefore: for request-level diagnosis, always omit the time range and rely on --keyword alone.

Notes:

  • --keyword: Use the full request_id extracted from the API response (case-sensitive exact match).
  • --page-size: Raise to 500 to capture the entire trace in a single page; total matched entries for one request is usually < 30.
  • --start-time / --end-time: Only use these for broad time-window scans without --keyword (e.g., when investigating non-request-specific issues). Required format is yyyy-MM-dd HH:mm:ss in UTC (space separator, no T / no Z). ISO-8601 forms like 2025-04-28T00:00:00Z will be rejected with InvalidParameter.

Step 4: List Engine Configurations

Map the environment and list matching configurations:

Environment Mapping:

  • productProd
  • prepubPre
aliyun pairecservice list-engine-configs \
  --instance-id <instance-id> \
  --environment <Prod|Pre> \
  --status Released \
  --name <config-name> > "$WORKDIR/engine_configs_list.json" 2>&1

[MUST] Always pass --name for server-side filtering:

  • is already known from Step 1 (ServiceConfig.envs.CONFIG_NAME); it MUST be forwarded to this call as --name.
  • Omitting --name returns the entire instance's config inventory (often hundreds of unrelated entries), forces client-side filtering, wastes tokens, and risks hitting CLI default pagination so the target row is silently dropped.
  • --name is an exact-match filter on the server; do NOT substitute with grep / jq select post-processing.
  • The same rule applies to every list-engine-configs invocation in this skill (Workflow 2 Step 1 included).

What to extract:

  • Find the configuration with Status: Released
  • Get EngineConfigId and Version

Step 5: Get Engine Configuration Details

aliyun pairecservice get-engine-config \
  --instance-id <instance-id> \
  --engine-config-id <engine-config-id> > "$WORKDIR/raw_engine_config.json" 2>&1

[MUST] Sanitize before display — Config may contain plaintext passwords or

access keys. Always pipe through the sanitizer before printing to terminal:

python3 scripts/sanitize_config.py "$WORKDIR/raw_engine_config.json"

Only the sanitized output (with credentials replaced by REDACTED) should

appear in the terminal. The raw file at $WORKDIR/raw_engine_config.json can be

passed directly to scripts/validate.py for validation (validate.py does not

print credential values).

What to extract:

  • ConfigValue: The actual engine configuration (JSON/YAML)

Step 5.5 (Optional): Static Config Sanity Check

Optionally run scripts/validate.py against the retrieved ConfigValue to quickly

rule out structural / reference / naming errors in the engine configuration

before diving into the log trace. See Workflow 2 § Step 3 and

references/config-validation.md for usage,

exit codes, and the full rule list.

printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin

When to run: when the log trace points at a specific configuration element

(e.g. a RecallConfs / FilterConfs / SceneConfs entry), or when the

configuration is being diagnosed for the first time in this skill session.

When to skip: when the log trace already shows a decisive non-config root

cause (e.g. a scene_id not present in SceneConfs, a 5xx from an upstream

EAS dependency, a missing feature table). validate.py is a static checker and

cannot detect request-time mismatches between client input and configuration.

[MUST] Scoping rule for the final report:

  • validate.py findings may enter the final diagnosis ONLY when they are

directly tied to the log evidence for the current request_id

(e.g. the log blames a RecallConf name that validate.py flags as

duplicated or dangling).

  • Findings unrelated to the current request_id trace MUST NOT be added to the

final conclusion. They remain an internal sanity-check signal only. This

preserves the evidence-only reporting rule in Step 6.

Step 6: Comprehensive Analysis

Analyze the following components together:

  1. API Response: Error code, message, and returned data
  2. Service Logs: Trace logs for the request_id showing processing flow
  3. Engine Configuration: Settings that may affect the behavior

Common Issues to Check:

  • Configuration mismatches (e.g., recall settings, filtering rules)
  • Resource limitations (e.g., insufficient items, timeout settings)
  • Data source issues (e.g., table access, feature availability)
  • Environment inconsistencies (e.g., prod config in prepub environment)

[MUST] Evidence-only reporting rule:

The final diagnosis delivered to the user MUST be grounded strictly in what the EAS service logs and the engine configuration directly show. Apply the following constraints:

  • Report only what is observed. Quote the exact log line (file:line, level, message) and the exact config fragment that proves each claim.
  • State the direct causal chain from log evidence to the API response, and stop there.
  • Do NOT add any of the following unless the user explicitly asks:
  • Speculative root causes not visible in logs/config (e.g., "client probably sent wrong X")
  • Fix recommendations or remediation steps
  • Conditional "if X then Y" scenarios
  • Tangential best-practice advice (security, fallback design, naming, etc.)
  • Guesses about upstream systems, client code, or data sources not covered by the logs/config
  • If the evidence is insufficient to reach a conclusion, state explicitly what additional data (specific log lines, other config versions, other environments) is needed, instead of guessing.
  • Recommendations are opt-in only. Provide fixes/suggestions only when the user explicitly requests them in a follow-up.

Workflow 2: PAI-Rec Engine Configuration Validation

This workflow validates engine configurations for potential issues.

Input: Configuration name and environment (Prod/Pre)

Step 1: List Configuration Versions

If user doesn't provide engine-config-id, list available versions:

aliyun pairecservice list-engine-configs \
  --instance-id <instance-id> \
  --environment <Prod|Pre> \
  --name <config-name>

Display to user:

  • Version: Version number
  • Status: Configuration status (Released/Draft/Archived)
  • GmtCreateTime: Creation timestamp
  • EngineConfigId: Version ID

Ask user to select a version or provide the engine-config-id.

Step 2: Retrieve Configuration Details

aliyun pairecservice get-engine-config \
  --instance-id <instance-id> \
  --engine-config-id <engine-config-id> > "$WORKDIR/raw_engine_config.json" 2>&1

[MUST] Sanitize before display — Always sanitize before printing to terminal:

python3 scripts/sanitize_config.py "$WORKDIR/raw_engine_config.json"

Step 3: Run Schema + Rule Validation

[MUST] Feed the extracted ConfigValue JSON into scripts/validate.py. The script

enforces JSON Schema (references/schema.json) + reference-consistency rules and exits

with status 0 on pass, 1 on failure.

# From stdin (recommended when ConfigValue is already in memory)
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin

# From a saved JSON file
python3 scripts/validate.py "$WORKDIR/raw_engine_config.json"

# From an inline JSON string
python3 scripts/validate.py '{"RunMode":"product","RecallConfs":[...]}'

Requires jsonschema (pip install jsonschema); if missing the script falls back to

rule-only validation without Schema checks.

What the script checks (summary):

  1. Structure — JSON well-formedness, required fields, types (RunMode,

RecallConfs, FilterConfs, SortConfs, AlgoConfs, SceneConfs, RankConf,

FeatureConfs, UserFeatureConfs, DebugConfs, FeatureLogConfs,

CallBackConfs, PipelineConfs, etc.)

  1. Enum valuesRecallType / FilterType / SortType / RunMode /

DebugConfs.OutputType / GeneralRankConfs.ActionConfs[].ActionType

  1. Reference consistencySceneConfs.RecallNamesRecallConfs;

FilterNamesFilterConfs; SortNamesSortConfs;

RankConf.RankAlgoListAlgoConfs; any DaoConf.AdapterType +

Name → the corresponding Confs (Hologres / Redis / MySQL / TableStore /

FeatureStore / …)

  1. Business rules
    • User2ItemExposureFilter with WriteLog=true + FeatureStore adapter: must set

TimeInterval > 0

  • PriorityAdjustCountFilter in accumulator mode: Count must be strictly

increasing (use Type="fix" for independent per-recall caps)

  • PipelineConfs.*.Name must be globally unique
  • DebugConfs.Rate must be an integer in [0, 100]
  1. Duplicate name detection within RecallConfs, FilterConfs, SortConfs,

AlgoConfs

Detailed usage, exit codes, example outputs and the full rule list live in

references/config-validation.md.

Step 4: Evidence-Grounded Report

Report to the user based strictly on the script's output plus any additional

inspection of ConfigValue:

  • ✅ Checks passed (Schema clean, references resolved, no rule violations)
  • ⚠️ Warnings reported by the script (severity=warning) or inconsistencies

observed in ConfigValue (e.g. naming collisions between RankScore variables

and model output fields, env/region mismatches)

  • ❌ Errors reported by the script (severity=error) or missing required fields
  • Missing-evidence notes — what extra data (other config versions, model

signatures, etc.) would be needed to turn a warning into a confirmed error

Do not add speculative fixes or best-practice tangents; suggestions are provided

only when the user explicitly asks for them.


Success Verification Method

For detailed verification steps, see references/verification-method.md.

Quick Verification:

  1. For Diagnosis Workflow:
    • Service information retrieved successfully
    • Logs found containing the request_id
    • Configuration loaded correctly
    • Root cause identified
  1. For Validation Workflow:
    • Configuration retrieved successfully
    • All validation checks executed
    • Issues clearly reported
    • Recommendations provided (if applicable)

Cleanup

This skill performs read-only Alibaba Cloud API calls (no remote resources are

created). Transient artifacts are written to a per-session local working

directory $WORKDIR under /tmp (see Core Workflow preamble). The skill does

NOT delete $WORKDIR automatically — the OS-level temporary file policy is

relied on for eventual reclamation (macOS reaps /tmp periodically; most Linux

distros reap on reboot or via systemd-tmpfiles).

If an operator wants to free disk space sooner, they may manually run

rm -rf /tmp/pairec-diag-* outside the workflow.


Best Practices

  1. Always capture request_id: When reporting API issues, include the full response with request_id for accurate log correlation.
  1. Log queries — keyword only, no time range, no local filtering: For request-level diagnosis, pass --keyword to aliyun eas describe-service-log and leave --start-time / --end-time unset. NEVER omit --keyword and post-process locally (e.g., | head, | grep, | python3) — this defeats server-side filtering, wastes tokens, and may miss logs beyond the first page. Combining keyword with a time range filters out business logs due to a CLI quirk (see Workflow 1, Step 3). Only use time ranges for broad non-request scans, and only with the yyyy-MM-dd HH:mm:ss UTC format (no T / no Z).
  1. Environment awareness: Always verify that configurations match the target environment (Prod vs Pre).
  1. Version control: When validating configurations, check multiple versions if issues persist across deployments.
  1. Log retention: EAS service logs are retained for limited periods; diagnose issues promptly after occurrence.
  1. Configuration backup: Before applying changes based on validation results, ensure current configurations are backed up.
  1. Cross-reference: Compare working configurations with problematic ones to identify differences.
  1. Service status: Check EAS service status before diagnosing; service-level issues may mask configuration problems.
  1. Evidence-only conclusions: Ground every statement in the diagnosis on a specific log line or config fragment. Do not speculate, do not propose fixes, and do not volunteer best-practice advice unless the user explicitly asks. If the evidence is insufficient, say what is missing rather than inferring.
  1. Structured analysis: Follow the systematic workflow rather than jumping to conclusions based on error messages alone.
  1. Document findings: Keep track of recurring issues and their resolutions for faster future diagnosis.

Reference Links

Reference DocumentDescription
---------------------------------
RAM PoliciesRequired RAM permissions for PAI-Rec and EAS APIs
Related CommandsComplete CLI command reference
Verification MethodDetailed verification procedures
CLI Installation GuideAlibaba Cloud CLI installation instructions
Configuration ExamplesSample engine configurations and common patterns
Config Validationscripts/validate.py usage, exit codes, rule catalogue
Troubleshooting GuideCommon issues and solutions
Config SanitizationCredential redaction before LLM analysis

版本历史

共 2 个版本

  • v0.0.2 当前
    2026-05-29 13:42
  • v0.0.1
    2026-05-09 17:08 安全 安全

安全检测

腾讯云安全 (Keen)

队列中

腾讯云安全 (Sanbu)

队列中

🔗 相关推荐

Alibabacloud Rds Copilot

sdk-team
阿里云RDS Copilot智能运维助手技能。用于RDS相关智能问答、SQL优化、实例运维和故障排查。
★ 1 📥 826

Alibabacloud Lindorm Agent Skill

sdk-team
阿里云Lindorm云原生多模型数据库技能,涵盖实例管理、监控、性能、存储、连接、备份、迁移等。
★ 1 📥 614

Alibabacloud Pds Intelligent Workspace

sdk-team
阿里云 PDS(智能云盘/网盘)文件操作技能。支持:文件搜索、文件上传、文件下载、文档/音视频分析、打包下载、图像编辑(缩放、裁剪、旋转、分割、移除、水印等)、以图搜图、挂载网盘、文件分享链接管理。 当用户提到 PDS、网盘、云盘、个人空间
★ 0 📥 622