PAI-Rec Engine Diagnosis and Configuration Validation

This skill provides comprehensive diagnostic and validation capabilities for Alibaba Cloud PAI-Rec (Programmable Recommendation System) engines, including interface troubleshooting and configuration analysis.

Scenario Description

PAI-Rec is Alibaba Cloud's programmable recommendation system that provides intelligent recommendation capabilities. This skill helps users:

Diagnose PAI-Rec Engine Interface Issues: When engine API returns errors or unexpected results, trace the request through EAS service logs and engine configurations to identify root causes.

Validate Engine Configurations: Analyze engine configuration files for potential issues, inconsistencies, or misconfigurations before deployment.

Architecture: PAI-EAS Service + PAI-Rec Engine + Engine Configuration Management

Key Components

PAI-EAS Service: Elastic Algorithm Service hosting the recommendation engine
PAI-Rec Engine: The recommendation engine processing requests
Engine Configuration: Configuration files defining engine behavior
Service Logs: EAS service logs containing request traces

Installation

Pre-check: Aliyun CLI >= 3.3.3 required

> Run aliyun version to verify >= 3.3.3. If not installed or version too low,

> run curl -fsSL https://aliyuncli.alicdn.com/setup.sh | bash to install/update,

> or see references/cli-installation-guide.md for installation instructions.

Pre-check: Aliyun CLI plugin update required

> [MUST] run aliyun configure set --auto-plugin-install true to enable automatic plugin installation.

> [MUST] run aliyun plugin update to ensure that any existing plugins are always up-to-date.

Authentication

> Pre-check: Alibaba Cloud Credentials Required

> Security Rules:

> - NEVER read, echo, or print AK/SK values (e.g., echo $ALIBABA_CLOUD_ACCESS_KEY_ID is FORBIDDEN)

> - NEVER ask the user to input AK/SK directly in the conversation or command line

> - NEVER use aliyun configure set with literal credential values

> - ONLY use aliyun configure list to check credential status

> ```bash

> aliyun configure list

> ```

> Check the output for a valid profile (AK, STS, or OAuth identity).

> If no valid profile exists, STOP here.

> 1. Obtain credentials from Alibaba Cloud Console

> 2. Configure credentials outside of this session (via aliyun configure in terminal or environment variables in shell profile)

> 3. Return and re-run after aliyun configure list shows a valid profile

RAM Policy

> [MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:

> 1. Read references/ram-policies.md to get the full list of permissions required by this SKILL

> 2. Use ram-permission-diagnose skill to guide the user through requesting the necessary permissions

> 3. Pause and wait until the user confirms that the required permissions have been granted

For detailed RAM policies required by this skill, see references/ram-policies.md.

Parameter Confirmation

> IMPORTANT: Parameter Confirmation — Before executing any command or API call,

> ALL user-customizable parameters (e.g., RegionId, service names, instance IDs,

> environment names, config names, etc.) MUST be confirmed with the user.

> Do NOT assume or use default values without explicit user approval.

Required Parameters

Parameter Name	Required/Optional	Description	Default Value
----------------	-------------------	-------------	---------------
`service-name`	Required	PAI-EAS service name	N/A
`cluster-id`	Required	EAS cluster/region ID	N/A
`instance-id`	Required	PAI-Rec instance ID	N/A
`request-id`	Required (for diagnosis)	Request ID from engine API response	N/A
`config-name`	Optional	Engine configuration name	N/A
`environment`	Optional	Engine environment (Prod/Pre)	N/A
`engine-config-id`	Optional	Specific engine config version ID	N/A
`region`	Optional	Alibaba Cloud region	N/A

Core Workflow

> At the start of the Core Workflow (before any CLI invocation):

> [MUST] Enable AI-Mode — AI-mode is required for Agent Skill execution.

> Run the following commands before any CLI invocation:

> ```bash

> aliyun configure ai-mode enable

> aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-rec-diagnosis"

> ```

> [MUST] Disable AI-Mode at EVERY exit point — Before delivering the final response for ANY reason, always disable AI-mode first. This applies to ALL exit paths: workflow success, workflow failure, error/exception, user cancellation, session end, or any other scenario where no further CLI commands will be executed.

> AI-mode is only used for Agent Skill invocation scenarios and MUST NOT remain enabled after the skill stops running.

> ```bash

> aliyun configure ai-mode disable

> ```

> [MUST] Allocate a per-session work directory — All transient artifacts

> (raw config dumps, log dumps, sanitized output) MUST be written under a unique

> per-session directory to avoid concurrent overwrite between parallel skill

> invocations. Run the following at the start of the workflow, before any

> artifact-producing CLI invocation, and reuse $WORKDIR for the whole session:

> ```bash

> # Create an isolated working directory for this session, pinned under /tmp.

> # Pass the full template as a positional arg (works on both BSD/macOS and

> # GNU/Linux mktemp); do NOT use -t prefix, which falls back to $TMPDIR

> # (e.g. /var/folders on macOS) and may even fail under sandboxed shells.

> export WORKDIR=$(mktemp -d /tmp/pairec-diag-XXXXXX)

> ```

> All file paths shown in the steps below ($WORKDIR/engine_configs_list.json,

> $WORKDIR/raw_engine_config.json, etc.) live inside this directory and MUST NOT

> be replaced with hard-coded /tmp/... paths.

Workflow 1: PAI-Rec Engine Interface Diagnosis

This workflow helps diagnose issues when a PAI-Rec engine API returns errors or unexpected results.

Input Example:

Service Name: embedding_recall
API Response:
{
    "code": 299,
    "msg": "items size not enough",
    "request_id": "941b4e14-d1c5-489f-a184-b2b17f8b4fdb",
    "size": 0,
    "experiment_id": "",
    "items": []
}

Step 1: Retrieve EAS Service Information

Get the service details to find the EAS service ID and configuration:

aliyun eas describe-service \
  --cluster-id <cluster-id> \
  --service-name <service-name>

What to extract:

Resource: EAS service resource ID (e.g., eas-r-1v4qb1yan3qmnjwxqe)
ServiceConfig.envs: Environment variables containing:
REGION: The region
INSTANCE_ID: PAI-Rec instance ID
CONFIG_NAME: Engine configuration name
PAIREC_ENVIRONMENT: Environment (product/prepub)

Step 2: Extract Request ID from API Response

Parse the API response JSON to get the request_id field. This will be used to search service logs.

Step 3: Query EAS Service Logs

Use the request ID as the sole filter to search service logs. Do NOT pass --start-time / --end-time when searching PAI-Rec business logs:

aliyun eas describe-service-log \
  --cluster-id <cluster-id> \
  --service-name <service-name> \
  --keyword <request-id> \
  --page-size 500

[CRITICAL] --keyword is MANDATORY — local post-processing is FORBIDDEN:

You MUST pass --keyword to the CLI command. This is a server-side filter; the API only returns log lines matching the keyword.
You MUST NOT omit --keyword and then filter locally (e.g., piping through head, grep, python3, jq, or any script to search for the request_id in the output).
You MUST NOT make multiple describe-service-log calls without --keyword hoping to find relevant lines by scanning the full log stream.
If a call with --keyword returns empty results, report that no matching logs were found — do NOT fall back to fetching unfiltered logs.

> ❌ WRONG (fetches ALL logs, filters locally — FORBIDDEN):

> ```bash

> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | head -300

> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | python3 filter.py

> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | grep "request_id"

> ```

> ✅ RIGHT (server-side keyword filter — REQUIRED):

> ```bash

> aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall --keyword 0c6cbd91-5618-4705-8e08-9126bf4600f7 --page-size 500

> ```

[CRITICAL] Known CLI pitfall — keyword-only lookup is required for business logs:

When only --keyword is supplied (no time range), the CLI returns the full PAI-Rec application trace (controller.go / feed.go / recall.go / rank_service.go etc.) matching the request_id.
As soon as --start-time / --end-time are added — even if the window covers the real log timestamp — the CLI silently drops business logs and only returns infrastructure noise (/bin/sh wrapper heartbeats, 502 Bad Gateway retries, postgres.go dbstat).
Therefore: for request-level diagnosis, always omit the time range and rely on --keyword alone.

Notes:

--keyword: Use the full request_id extracted from the API response (case-sensitive exact match).
--page-size: Raise to 500 to capture the entire trace in a single page; total matched entries for one request is usually < 30.
--start-time / --end-time: Only use these for broad time-window scans without --keyword (e.g., when investigating non-request-specific issues). Required format is yyyy-MM-dd HH:mm:ss in UTC (space separator, no T / no Z). ISO-8601 forms like 2025-04-28T00:00:00Z will be rejected with InvalidParameter.

Step 4: List Engine Configurations

Map the environment and list matching configurations:

Environment Mapping:

product → Prod
prepub → Pre

aliyun pairecservice list-engine-configs \
  --instance-id <instance-id> \
  --environment <Prod|Pre> \
  --status Released \
  --name <config-name> > "$WORKDIR/engine_configs_list.json" 2>&1

[MUST] Always pass --name for server-side filtering:

is already known from Step 1 (ServiceConfig.envs.CONFIG_NAME); it MUST be forwarded to this call as --name.
Omitting --name returns the entire instance's config inventory (often hundreds of unrelated entries), forces client-side filtering, wastes tokens, and risks hitting CLI default pagination so the target row is silently dropped.
--name is an exact-match filter on the server; do NOT substitute with grep / jq select post-processing.
The same rule applies to every list-engine-configs invocation in this skill (Workflow 2 Step 1 included).

What to extract:

Find the configuration with Status: Released
Get EngineConfigId and Version

Step 5: Get Engine Configuration Details

aliyun pairecservice get-engine-config \
  --instance-id <instance-id> \
  --engine-config-id <engine-config-id> > "$WORKDIR/raw_engine_config.json" 2>&1

[MUST] Sanitize before display — Config may contain plaintext passwords or

access keys. Always pipe through the sanitizer before printing to terminal:

python3 scripts/sanitize_config.py "$WORKDIR/raw_engine_config.json"

Only the sanitized output (with credentials replaced by REDACTED) should

appear in the terminal. The raw file at $WORKDIR/raw_engine_config.json can be

passed directly to scripts/validate.py for validation (validate.py does not

print credential values).

What to extract:

ConfigValue: The actual engine configuration (JSON/YAML)

Step 5.5 (Optional): Static Config Sanity Check

Optionally run scripts/validate.py against the retrieved ConfigValue to quickly

rule out structural / reference / naming errors in the engine configuration

before diving into the log trace. See Workflow 2 § Step 3 and

references/config-validation.md for usage,

exit codes, and the full rule list.

printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin

When to run: when the log trace points at a specific configuration element

(e.g. a RecallConfs / FilterConfs / SceneConfs entry), or when the

configuration is being diagnosed for the first time in this skill session.

When to skip: when the log trace already shows a decisive non-config root

cause (e.g. a scene_id not present in SceneConfs, a 5xx from an upstream

EAS dependency, a missing feature table). validate.py is a static checker and

cannot detect request-time mismatches between client input and configuration.

[MUST] Scoping rule for the final report:

validate.py findings may enter the final diagnosis ONLY when they are

directly tied to the log evidence for the current request_id

(e.g. the log blames a RecallConf name that validate.py flags as

duplicated or dangling).

Findings unrelated to the current request_id trace MUST NOT be added to the

final conclusion. They remain an internal sanity-check signal only. This

preserves the evidence-only reporting rule in Step 6.

Step 6: Comprehensive Analysis

Analyze the following components together:

API Response: Error code, message, and returned data
Service Logs: Trace logs for the request_id showing processing flow
Engine Configuration: Settings that may affect the behavior

Common Issues to Check:

Configuration mismatches (e.g., recall settings, filtering rules)
Resource limitations (e.g., insufficient items, timeout settings)
Data source issues (e.g., table access, feature availability)
Environment inconsistencies (e.g., prod config in prepub environment)

[MUST] Evidence-only reporting rule:

The final diagnosis delivered to the user MUST be grounded strictly in what the EAS service logs and the engine configuration directly show. Apply the following constraints:

Report only what is observed. Quote the exact log line (file:line, level, message) and the exact config fragment that proves each claim.
State the direct causal chain from log evidence to the API response, and stop there.
Do NOT add any of the following unless the user explicitly asks:
Speculative root causes not visible in logs/config (e.g., "client probably sent wrong X")
Fix recommendations or remediation steps
Conditional "if X then Y" scenarios
Tangential best-practice advice (security, fallback design, naming, etc.)
Guesses about upstream systems, client code, or data sources not covered by the logs/config
If the evidence is insufficient to reach a conclusion, state explicitly what additional data (specific log lines, other config versions, other environments) is needed, instead of guessing.
Recommendations are opt-in only. Provide fixes/suggestions only when the user explicitly requests them in a follow-up.

Workflow 2: PAI-Rec Engine Configuration Validation

This workflow validates engine configurations for potential issues.

Input: Configuration name and environment (Prod/Pre)

Step 1: List Configuration Versions

If user doesn't provide engine-config-id, list available versions:

aliyun pairecservice list-engine-configs \
  --instance-id <instance-id> \
  --environment <Prod|Pre> \
  --name <config-name>

Display to user:

Version: Version number
Status: Configuration status (Released/Draft/Archived)
GmtCreateTime: Creation timestamp
EngineConfigId: Version ID

Ask user to select a version or provide the engine-config-id.

Step 2: Retrieve Configuration Details

aliyun pairecservice get-engine-config \
  --instance-id <instance-id> \
  --engine-config-id <engine-config-id> > "$WORKDIR/raw_engine_config.json" 2>&1

[MUST] Sanitize before display — Always sanitize before printing to terminal:

python3 scripts/sanitize_config.py "$WORKDIR/raw_engine_config.json"

Step 3: Run Schema + Rule Validation

[MUST] Feed the extracted ConfigValue JSON into scripts/validate.py. The script

enforces JSON Schema (references/schema.json) + reference-consistency rules and exits

with status 0 on pass, 1 on failure.

# From stdin (recommended when ConfigValue is already in memory)
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin

# From a saved JSON file
python3 scripts/validate.py "$WORKDIR/raw_engine_config.json"

# From an inline JSON string
python3 scripts/validate.py '{"RunMode":"product","RecallConfs":[...]}'

Requires jsonschema (pip install jsonschema); if missing the script falls back to

rule-only validation without Schema checks.

What the script checks (summary):

Structure — JSON well-formedness, required fields, types (RunMode,

RecallConfs, FilterConfs, SortConfs, AlgoConfs, SceneConfs, RankConf,

FeatureConfs, UserFeatureConfs, DebugConfs, FeatureLogConfs,

CallBackConfs, PipelineConfs, etc.)

Enum values — RecallType / FilterType / SortType / RunMode /

DebugConfs.OutputType / GeneralRankConfs.ActionConfs[].ActionType

Reference consistency — SceneConfs.RecallNames → RecallConfs;

FilterNames → FilterConfs; SortNames → SortConfs;

RankConf.RankAlgoList → AlgoConfs; any DaoConf.AdapterType +

Name → the corresponding Confs (Hologres / Redis / MySQL / TableStore /

FeatureStore / …)

Business rules

User2ItemExposureFilter with WriteLog=true + FeatureStore adapter: must set

TimeInterval > 0

PriorityAdjustCountFilter in accumulator mode: Count must be strictly

increasing (use Type="fix" for independent per-recall caps)

PipelineConfs.*.Name must be globally unique
DebugConfs.Rate must be an integer in [0, 100]

Duplicate name detection within RecallConfs, FilterConfs, SortConfs,

AlgoConfs

Detailed usage, exit codes, example outputs and the full rule list live in

references/config-validation.md.

Step 4: Evidence-Grounded Report

Report to the user based strictly on the script's output plus any additional

inspection of ConfigValue:

✅ Checks passed (Schema clean, references resolved, no rule violations)
⚠️ Warnings reported by the script (severity=warning) or inconsistencies

observed in ConfigValue (e.g. naming collisions between RankScore variables

and model output fields, env/region mismatches)

❌ Errors reported by the script (severity=error) or missing required fields
Missing-evidence notes — what extra data (other config versions, model

signatures, etc.) would be needed to turn a warning into a confirmed error

Do not add speculative fixes or best-practice tangents; suggestions are provided

only when the user explicitly asks for them.

Success Verification Method

For detailed verification steps, see references/verification-method.md.

Quick Verification:

For Diagnosis Workflow:

Service information retrieved successfully
Logs found containing the request_id
Configuration loaded correctly
Root cause identified

For Validation Workflow:

Configuration retrieved successfully
All validation checks executed
Issues clearly reported
Recommendations provided (if applicable)

Cleanup

This skill performs read-only Alibaba Cloud API calls (no remote resources are

created). Transient artifacts are written to a per-session local working

directory $WORKDIR under /tmp (see Core Workflow preamble). The skill does

NOT delete $WORKDIR automatically — the OS-level temporary file policy is

relied on for eventual reclamation (macOS reaps /tmp periodically; most Linux

distros reap on reboot or via systemd-tmpfiles).

If an operator wants to free disk space sooner, they may manually run

rm -rf /tmp/pairec-diag-* outside the workflow.

Best Practices

Always capture request_id: When reporting API issues, include the full response with request_id for accurate log correlation.

Log queries — keyword only, no time range, no local filtering: For request-level diagnosis, pass --keyword to aliyun eas describe-service-log and leave --start-time / --end-time unset. NEVER omit --keyword and post-process locally (e.g., | head, | grep, | python3) — this defeats server-side filtering, wastes tokens, and may miss logs beyond the first page. Combining keyword with a time range filters out business logs due to a CLI quirk (see Workflow 1, Step 3). Only use time ranges for broad non-request scans, and only with the yyyy-MM-dd HH:mm:ss UTC format (no T / no Z).

Environment awareness: Always verify that configurations match the target environment (Prod vs Pre).

Version control: When validating configurations, check multiple versions if issues persist across deployments.

Log retention: EAS service logs are retained for limited periods; diagnose issues promptly after occurrence.

Configuration backup: Before applying changes based on validation results, ensure current configurations are backed up.

Cross-reference: Compare working configurations with problematic ones to identify differences.

Service status: Check EAS service status before diagnosing; service-level issues may mask configuration problems.

Evidence-only conclusions: Ground every statement in the diagnosis on a specific log line or config fragment. Do not speculate, do not propose fixes, and do not volunteer best-practice advice unless the user explicitly asks. If the evidence is insufficient, say what is missing rather than inferring.

Structured analysis: Follow the systematic workflow rather than jumping to conclusions based on error messages alone.

Document findings: Keep track of recurring issues and their resolutions for faster future diagnosis.

Reference Links

Reference Document	Description
--------------------	-------------
RAM Policies	Required RAM permissions for PAI-Rec and EAS APIs
Related Commands	Complete CLI command reference
Verification Method	Detailed verification procedures
CLI Installation Guide	Alibaba Cloud CLI installation instructions
Configuration Examples	Sample engine configurations and common patterns
Config Validation	`scripts/validate.py` usage, exit codes, rule catalogue
Troubleshooting Guide	Common issues and solutions
Config Sanitization	Credential redaction before LLM analysis

Alibabacloud Pai Rec Diagnosis

概述

PAI-Rec Engine Diagnosis and Configuration Validation

Scenario Description

Key Components

Installation

Authentication

RAM Policy

Parameter Confirmation

Required Parameters

Core Workflow

Workflow 1: PAI-Rec Engine Interface Diagnosis

Step 1: Retrieve EAS Service Information

Step 2: Extract Request ID from API Response

Step 3: Query EAS Service Logs

Step 4: List Engine Configurations

Step 5: Get Engine Configuration Details

Step 5.5 (Optional): Static Config Sanity Check

Step 6: Comprehensive Analysis

Workflow 2: PAI-Rec Engine Configuration Validation

Step 1: List Configuration Versions

Step 2: Retrieve Configuration Details

Step 3: Run Schema + Rule Validation

Step 4: Evidence-Grounded Report

Success Verification Method

Cleanup

Best Practices

Reference Links

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Alibabacloud Rds Copilot

Alibabacloud Lindorm Agent Skill

Alibabacloud Pds Intelligent Workspace