This skill provides comprehensive diagnostic and validation capabilities for Alibaba Cloud PAI-Rec (Programmable Recommendation System) engines, including interface troubleshooting and configuration analysis.
PAI-Rec is Alibaba Cloud's programmable recommendation system that provides intelligent recommendation capabilities. This skill helps users:
Architecture: PAI-EAS Service + PAI-Rec Engine + Engine Configuration Management
Pre-check: Aliyun CLI >= 3.3.3 required
> Run aliyun version to verify >= 3.3.3. If not installed or version too low,
> run curl -fsSL https://aliyuncli.alicdn.com/setup.sh | bash to install/update,
> or see references/cli-installation-guide.md for installation instructions.
Pre-check: Aliyun CLI plugin update required
> [MUST] run aliyun configure set --auto-plugin-install true to enable automatic plugin installation.
> [MUST] run aliyun plugin update to ensure that any existing plugins are always up-to-date.
> Pre-check: Alibaba Cloud Credentials Required
>
> Security Rules:
> - NEVER read, echo, or print AK/SK values (e.g., echo $ALIBABA_CLOUD_ACCESS_KEY_ID is FORBIDDEN)
> - NEVER ask the user to input AK/SK directly in the conversation or command line
> - NEVER use aliyun configure set with literal credential values
> - ONLY use aliyun configure list to check credential status
>
> ```bash
> aliyun configure list
> ```
> Check the output for a valid profile (AK, STS, or OAuth identity).
>
> If no valid profile exists, STOP here.
> 1. Obtain credentials from Alibaba Cloud Console
> 2. Configure credentials outside of this session (via aliyun configure in terminal or environment variables in shell profile)
> 3. Return and re-run after aliyun configure list shows a valid profile
> [MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
> 1. Read references/ram-policies.md to get the full list of permissions required by this SKILL
> 2. Use ram-permission-diagnose skill to guide the user through requesting the necessary permissions
> 3. Pause and wait until the user confirms that the required permissions have been granted
For detailed RAM policies required by this skill, see references/ram-policies.md.
> IMPORTANT: Parameter Confirmation — Before executing any command or API call,
> ALL user-customizable parameters (e.g., RegionId, service names, instance IDs,
> environment names, config names, etc.) MUST be confirmed with the user.
> Do NOT assume or use default values without explicit user approval.
| Parameter Name | Required/Optional | Description | Default Value |
|---|---|---|---|
| ---------------- | ------------------- | ------------- | --------------- |
service-name | Required | PAI-EAS service name | N/A |
cluster-id | Required | EAS cluster/region ID | N/A |
instance-id | Required | PAI-Rec instance ID | N/A |
request-id | Required (for diagnosis) | Request ID from engine API response | N/A |
config-name | Optional | Engine configuration name | N/A |
environment | Optional | Engine environment (Prod/Pre) | N/A |
engine-config-id | Optional | Specific engine config version ID | N/A |
region | Optional | Alibaba Cloud region | N/A |
> At the start of the Core Workflow (before any CLI invocation):
> [MUST] Enable AI-Mode — AI-mode is required for Agent Skill execution.
> Run the following commands before any CLI invocation:
> ```bash
> aliyun configure ai-mode enable
> aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-rec-diagnosis"
> ```
> [MUST] Disable AI-Mode at EVERY exit point — Before delivering the final response for ANY reason, always disable AI-mode first. This applies to ALL exit paths: workflow success, workflow failure, error/exception, user cancellation, session end, or any other scenario where no further CLI commands will be executed.
> AI-mode is only used for Agent Skill invocation scenarios and MUST NOT remain enabled after the skill stops running.
> ```bash
> aliyun configure ai-mode disable
> ```
> [MUST] Allocate a per-session work directory — All transient artifacts
> (raw config dumps, log dumps, sanitized output) MUST be written under a unique
> per-session directory to avoid concurrent overwrite between parallel skill
> invocations. Run the following at the start of the workflow, before any
> artifact-producing CLI invocation, and reuse $WORKDIR for the whole session:
> ```bash
> # Create an isolated working directory for this session, pinned under /tmp.
> # Pass the full template as a positional arg (works on both BSD/macOS and
> # GNU/Linux mktemp); do NOT use -t prefix, which falls back to $TMPDIR
> # (e.g. /var/folders on macOS) and may even fail under sandboxed shells.
> export WORKDIR=$(mktemp -d /tmp/pairec-diag-XXXXXX)
> ```
> All file paths shown in the steps below ($WORKDIR/engine_configs_list.json,
> $WORKDIR/raw_engine_config.json, etc.) live inside this directory and MUST NOT
> be replaced with hard-coded /tmp/... paths.
This workflow helps diagnose issues when a PAI-Rec engine API returns errors or unexpected results.
Input Example:
Service Name: embedding_recall
API Response:
{
"code": 299,
"msg": "items size not enough",
"request_id": "941b4e14-d1c5-489f-a184-b2b17f8b4fdb",
"size": 0,
"experiment_id": "",
"items": []
}
Get the service details to find the EAS service ID and configuration:
aliyun eas describe-service \
--cluster-id <cluster-id> \
--service-name <service-name>
What to extract:
Resource: EAS service resource ID (e.g., eas-r-1v4qb1yan3qmnjwxqe)ServiceConfig.envs: Environment variables containing:REGION: The regionINSTANCE_ID: PAI-Rec instance IDCONFIG_NAME: Engine configuration namePAIREC_ENVIRONMENT: Environment (product/prepub)Parse the API response JSON to get the request_id field. This will be used to search service logs.
Use the request ID as the sole filter to search service logs. Do NOT pass --start-time / --end-time when searching PAI-Rec business logs:
aliyun eas describe-service-log \
--cluster-id <cluster-id> \
--service-name <service-name> \
--keyword <request-id> \
--page-size 500
[CRITICAL] --keyword is MANDATORY — local post-processing is FORBIDDEN:
--keyword to the CLI command. This is a server-side filter; the API only returns log lines matching the keyword.--keyword and then filter locally (e.g., piping through head, grep, python3, jq, or any script to search for the request_id in the output).describe-service-log calls without --keyword hoping to find relevant lines by scanning the full log stream.--keyword returns empty results, report that no matching logs were found — do NOT fall back to fetching unfiltered logs.> ❌ WRONG (fetches ALL logs, filters locally — FORBIDDEN):
> ```bash
> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | head -300
> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | python3 filter.py
> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | grep "request_id"
> ```
>
> ✅ RIGHT (server-side keyword filter — REQUIRED):
> ```bash
> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall --keyword 0c6cbd91-5618-4705-8e08-9126bf4600f7 --page-size 500
> ```
[CRITICAL] Known CLI pitfall — keyword-only lookup is required for business logs:
--keyword is supplied (no time range), the CLI returns the full PAI-Rec application trace (controller.go / feed.go / recall.go / rank_service.go etc.) matching the request_id.--start-time / --end-time are added — even if the window covers the real log timestamp — the CLI silently drops business logs and only returns infrastructure noise (/bin/sh wrapper heartbeats, 502 Bad Gateway retries, postgres.go dbstat).--keyword alone.Notes:
--keyword: Use the full request_id extracted from the API response (case-sensitive exact match).--page-size: Raise to 500 to capture the entire trace in a single page; total matched entries for one request is usually < 30.--start-time / --end-time: Only use these for broad time-window scans without --keyword (e.g., when investigating non-request-specific issues). Required format is yyyy-MM-dd HH:mm:ss in UTC (space separator, no T / no Z). ISO-8601 forms like 2025-04-28T00:00:00Z will be rejected with InvalidParameter.Map the environment and list matching configurations:
Environment Mapping:
product → Prodprepub → Prealiyun pairecservice list-engine-configs \
--instance-id <instance-id> \
--environment <Prod|Pre> \
--status Released \
--name <config-name> > "$WORKDIR/engine_configs_list.json" 2>&1
[MUST] Always pass --name for server-side filtering:
is already known from Step 1 (ServiceConfig.envs.CONFIG_NAME); it MUST be forwarded to this call as --name.--name returns the entire instance's config inventory (often hundreds of unrelated entries), forces client-side filtering, wastes tokens, and risks hitting CLI default pagination so the target row is silently dropped.--name is an exact-match filter on the server; do NOT substitute with grep / jq select post-processing.list-engine-configs invocation in this skill (Workflow 2 Step 1 included).What to extract:
Status: ReleasedEngineConfigId and Versionaliyun pairecservice get-engine-config \
--instance-id <instance-id> \
--engine-config-id <engine-config-id> > "$WORKDIR/raw_engine_config.json" 2>&1
[MUST] Sanitize before display — Config may contain plaintext passwords or
access keys. Always pipe through the sanitizer before printing to terminal:
python3 scripts/sanitize_config.py "$WORKDIR/raw_engine_config.json"
Only the sanitized output (with credentials replaced by REDACTED) should
appear in the terminal. The raw file at $WORKDIR/raw_engine_config.json can be
passed directly to scripts/validate.py for validation (validate.py does not
print credential values).
What to extract:
ConfigValue: The actual engine configuration (JSON/YAML)Optionally run scripts/validate.py against the retrieved ConfigValue to quickly
rule out structural / reference / naming errors in the engine configuration
before diving into the log trace. See Workflow 2 § Step 3 and
references/config-validation.md for usage,
exit codes, and the full rule list.
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin
When to run: when the log trace points at a specific configuration element
(e.g. a RecallConfs / FilterConfs / SceneConfs entry), or when the
configuration is being diagnosed for the first time in this skill session.
When to skip: when the log trace already shows a decisive non-config root
cause (e.g. a scene_id not present in SceneConfs, a 5xx from an upstream
EAS dependency, a missing feature table). validate.py is a static checker and
cannot detect request-time mismatches between client input and configuration.
[MUST] Scoping rule for the final report:
validate.py findings may enter the final diagnosis ONLY when they are directly tied to the log evidence for the current request_id
(e.g. the log blames a RecallConf name that validate.py flags as
duplicated or dangling).
request_id trace MUST NOT be added to thefinal conclusion. They remain an internal sanity-check signal only. This
preserves the evidence-only reporting rule in Step 6.
Analyze the following components together:
Common Issues to Check:
[MUST] Evidence-only reporting rule:
The final diagnosis delivered to the user MUST be grounded strictly in what the EAS service logs and the engine configuration directly show. Apply the following constraints:
This workflow validates engine configurations for potential issues.
Input: Configuration name and environment (Prod/Pre)
If user doesn't provide engine-config-id, list available versions:
aliyun pairecservice list-engine-configs \
--instance-id <instance-id> \
--environment <Prod|Pre> \
--name <config-name>
Display to user:
Version: Version numberStatus: Configuration status (Released/Draft/Archived)GmtCreateTime: Creation timestampEngineConfigId: Version IDAsk user to select a version or provide the engine-config-id.
aliyun pairecservice get-engine-config \
--instance-id <instance-id> \
--engine-config-id <engine-config-id> > "$WORKDIR/raw_engine_config.json" 2>&1
[MUST] Sanitize before display — Always sanitize before printing to terminal:
python3 scripts/sanitize_config.py "$WORKDIR/raw_engine_config.json"
[MUST] Feed the extracted ConfigValue JSON into scripts/validate.py. The script
enforces JSON Schema (references/schema.json) + reference-consistency rules and exits
with status 0 on pass, 1 on failure.
# From stdin (recommended when ConfigValue is already in memory)
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin
# From a saved JSON file
python3 scripts/validate.py "$WORKDIR/raw_engine_config.json"
# From an inline JSON string
python3 scripts/validate.py '{"RunMode":"product","RecallConfs":[...]}'
Requires jsonschema (pip install jsonschema); if missing the script falls back to
rule-only validation without Schema checks.
What the script checks (summary):
RunMode, RecallConfs, FilterConfs, SortConfs, AlgoConfs, SceneConfs, RankConf,
FeatureConfs, UserFeatureConfs, DebugConfs, FeatureLogConfs,
CallBackConfs, PipelineConfs, etc.)
RecallType / FilterType / SortType / RunMode / DebugConfs.OutputType / GeneralRankConfs.ActionConfs[].ActionType
SceneConfs.RecallNames → RecallConfs; FilterNames → FilterConfs; SortNames → SortConfs;
RankConf.RankAlgoList → AlgoConfs; any DaoConf.AdapterType +
Name → the corresponding Confs (Hologres / Redis / MySQL / TableStore /
FeatureStore / …)
User2ItemExposureFilter with WriteLog=true + FeatureStore adapter: must set TimeInterval > 0
PriorityAdjustCountFilter in accumulator mode: Count must be strictly increasing (use Type="fix" for independent per-recall caps)
PipelineConfs.*.Name must be globally uniqueDebugConfs.Rate must be an integer in [0, 100]RecallConfs, FilterConfs, SortConfs, AlgoConfs
Detailed usage, exit codes, example outputs and the full rule list live in
references/config-validation.md.
Report to the user based strictly on the script's output plus any additional
inspection of ConfigValue:
observed in ConfigValue (e.g. naming collisions between RankScore variables
and model output fields, env/region mismatches)
signatures, etc.) would be needed to turn a warning into a confirmed error
Do not add speculative fixes or best-practice tangents; suggestions are provided
only when the user explicitly asks for them.
For detailed verification steps, see references/verification-method.md.
Quick Verification:
This skill performs read-only Alibaba Cloud API calls (no remote resources are
created). Transient artifacts are written to a per-session local working
directory $WORKDIR under /tmp (see Core Workflow preamble). The skill does
NOT delete $WORKDIR automatically — the OS-level temporary file policy is
relied on for eventual reclamation (macOS reaps /tmp periodically; most Linux
distros reap on reboot or via systemd-tmpfiles).
If an operator wants to free disk space sooner, they may manually run
rm -rf /tmp/pairec-diag-* outside the workflow.
--keyword to aliyun eas describe-service-log and leave --start-time / --end-time unset. NEVER omit --keyword and post-process locally (e.g., | head, | grep, | python3) — this defeats server-side filtering, wastes tokens, and may miss logs beyond the first page. Combining keyword with a time range filters out business logs due to a CLI quirk (see Workflow 1, Step 3). Only use time ranges for broad non-request scans, and only with the yyyy-MM-dd HH:mm:ss UTC format (no T / no Z).| Reference Document | Description |
|---|---|
| -------------------- | ------------- |
| RAM Policies | Required RAM permissions for PAI-Rec and EAS APIs |
| Related Commands | Complete CLI command reference |
| Verification Method | Detailed verification procedures |
| CLI Installation Guide | Alibaba Cloud CLI installation instructions |
| Configuration Examples | Sample engine configurations and common patterns |
| Config Validation | scripts/validate.py usage, exit codes, rule catalogue |
| Troubleshooting Guide | Common issues and solutions |
| Config Sanitization | Credential redaction before LLM analysis |
共 2 个版本