Generate comprehensive chaos engineering and high availability testing scenarios for a
specific AWS service. Uses a Scenario-Library-first approach: read the latest FIS
Scenario Library documentation for pre-built composite scenarios first, then query
individual FIS actions via list-actions, and finally supplement with deep documentation
research.
Detect the language of the user's conversation and use the same language for all output.
Required tools (at least one of each group):
FIS Scenario Library (Group A — documentation-based, always available):
aws___read_documentation — read FIS Scenario Library pages directly (scenarios areconsole-only and cannot be queried via CLI, so reading the latest docs is the only way
to discover them)
FIS Actions Discovery (Group B — use in order of preference):
aws fis list-actions — definitive, real-time list of FIS actions from user's regionDocumentation Research (Group C):
aws___search_documentation — search AWS official docsaws___read_documentation — read full doc pagesaws___recommend — discover related pagesAll documentation research uses only the AWS Knowledge MCP tools above.
Do NOT use SearXNG or other web search tools for documentation research.
CRITICAL — Sequential execution of all AWS Knowledge MCP calls:
All calls to aws___search_documentation, aws___read_documentation, and
aws___recommend MUST be executed one at a time, sequentially. NEVER send
multiple MCP requests in parallel — the aws-knowledge-mcp-server has strict rate
limits and will reject concurrent requests with "Too many requests" errors.
Wait for each request to return a complete response before sending the next one.
This applies to ALL steps below (Step 2, 4b, 4c, 5a, 5b).
Retry on failure: If any MCP call (especially aws___read_documentation) returns
a rate limit error ("Too many requests") or any other transient error, **retry up to
10 times** with a 5-second wait between retries. Only skip the request after all 10
retries have failed.
Multi-service requests: When the user asks about multiple services (e.g.,
"EKS, RDS, MSK, and ElastiCache"), process them one service at a time. Complete
all research steps (Steps 2-5) for one service before starting the next. Do NOT
launch parallel research for multiple services — this will trigger rate limiting.
The Scenario Library fetch (Step 2) only needs to run once since it covers all
services; the per-service steps (3-5) must be repeated sequentially for each service.
Extract the target AWS service from the user's message and determine the target region.
FIS actions can differ across AWS regions — some actions may be available in
us-east-1 but not yet in ap-southeast-1. Always determine the target region first,
because service keyword resolution depends on it.
Detection order (use the first one that applies):
aws configure get region to get the configured default"Which AWS region are you targeting? FIS actions and scenarios may vary by region."
Store the resolved region as TARGET_REGION for use in subsequent steps.
FIS action IDs follow the pattern aws:. To map the user's input
to the correct FIS service keyword, use dynamic discovery from the live FIS action list:
aws fis list-actions --region TARGET_REGION | jq '.actions[].id' | awk -F':' '{print $2}' | sort -u
This returns the definitive list of FIS-supported service keywords in that region
(e.g., ebs, ec2, ecs, eks, elasticache, fis, network, rds, s3, ssm...).
Match the user's service name against this list. For example, if the user says
"Aurora", match it to rds; if "Kubernetes", match to eks.
If the AWS CLI is not available, derive the keyword by lowercasing the AWS service name
and removing spaces/hyphens (e.g., "ElastiCache" -> elasticache).
If the service is ambiguous, ask the user to clarify (e.g., "RDS MySQL or Aurora MySQL?").
Also determine the deployment architecture if the user mentions it:
This step has the highest priority. The FIS Scenario Library provides AWS-curated
composite scenarios that orchestrate multiple fault injection actions into realistic
failure simulations. These are the most valuable starting point because they represent
AWS's own recommendations for how to test resilience.
Scenario Library scenarios are console-only — they cannot be listed or queried via
AWS CLI or API. The only way to discover them is by reading the latest documentation.
Fetch the Scenario Library pages listed in references/search-queries.md under
"FIS Scenario Library Pages (Always Fetch)". Read both the overview and detailed scenario
pages relevant to the target service. Read pages one at a time, sequentially —
wait for each aws___read_documentation call to complete before starting the next one.
After reading the documentation, classify each scenario's relevance:
| Relevance | Criteria |
|---|---|
| --- | --- |
| Directly relevant | Scenario includes sub-actions that explicitly target the service (e.g., "Failover RDS" in AZ Power Interruption) |
| Indirectly relevant | Scenario affects infrastructure the service depends on (e.g., network disruption affects any VPC-based service) |
| Not relevant | Scenario has no meaningful impact on the target service |
Include both directly and indirectly relevant scenarios in the output.
After the Scenario Library research, query individual FIS actions to discover
service-specific fault injection capabilities that may not be covered by composite
scenarios.
Step 3a: Fetch ALL FIS actions in the target region:
aws fis list-actions --region TARGET_REGION --query 'actions[].{id:id, description:description}' --output json
Replace TARGET_REGION with the region resolved in Step 1 (e.g., us-east-1).
If no region was determined, omit --region to use the CLI default, but **warn
the user** that results reflect their default region and may differ in other regions.
Step 3b: Filter for target service — from the full list, find actions whose id
contains the search keyword(s) from Step 1:
aws fis list-actions --region TARGET_REGION --query 'actions[?starts_with(id, `aws:KEYWORD:`)].{id:id, description:description}' --output json
Also scan the description field for the service name, because some actions may
reference a service in their description even if the action prefix is different.
Step 3c (Optional): Collect cross-cutting actions — these affect services
indirectly. Include them if the user's service would benefit from network, API, or
infrastructure-level fault injection testing:
aws fis list-actions --region TARGET_REGION --query 'actions[?starts_with(id, `aws:network:`) || starts_with(id, `aws:fis:inject`) || starts_with(id, `aws:ssm:`) || starts_with(id, `aws:ec2:stop`) || starts_with(id, `aws:ec2:terminate`)].{id:id, description:description}' --output json
Cross-cutting actions and when they're useful:
aws:network:disrupt-connectivity — useful for any VPC-based serviceaws:network:disrupt-vpc-endpoint — useful for services accessed via PrivateLinkaws:fis:inject-api-internal-error — useful to test app handling of AWS API failuresaws:fis:inject-api-throttle-error — useful to test backoff/retry logicaws:fis:inject-api-unavailable-error — useful to test graceful degradationaws:ec2:stop-instances / terminate-instances — useful for services running on EC2aws:ssm:send-command / start-automation-execution — useful for custom fault scriptsWhether to include cross-cutting actions depends on context:
interested in infrastructure-level failure testing
is fully managed with no user-accessible infrastructure layer
Search the FIS actions reference documentation:
aws___search_documentation(
search_phrase="AWS FIS actions [SERVICE_NAME] fault injection",
topics=["reference_documentation"],
limit=10
)
Then read the FIS actions reference page:
aws___read_documentation(
url="https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html",
max_length=10000
)
Count the number of service-specific actions found (exclude cross-cutting actions).
When FIS has native actions for the target service, combine Scenario Library findings
with FIS-action-specific details.
Map each FIS action to a testing scenario. Use the "FIS Native Fault Injection
Scenarios" table format from references/output-template.md.
IMPORTANT — Scenario Library deduplication (must apply before building the table):
Before listing any FIS action in the per-service table, check whether that exact
action ID appeared as a sub-action in any Scenario Library composite scenario
discovered in Step 2. Common examples of overlap:
aws:rds:failover-db-cluster — sub-action of AZ Power Interruptionaws:elasticache:replicationgroup-interrupt-az-power — sub-action of AZ Power Interruptionaws:eks:pod-network-latency — sub-action of AZ Application Slowdownaws:eks:pod-network-packet-loss — sub-action of Cross-AZ Traffic Slowdownaws:ec2:stop-instances — sub-action of AZ Power InterruptionRules:
per-service table but append to the "HA Verification Purpose" column:
"(Also sub-action of {Scenario Name} — see Scenario Library section)".
ElastiCache has only replicationgroup-interrupt-az-power which is covered by
AZ Power Interruption), omit the "FIS Native Fault Injection Scenarios"
sub-section entirely and replace with:
> All FIS native actions for {SERVICE} are covered by Scenario Library composite
> scenarios. See the Scenario Library and Cross-Cutting section for details.
Group scenarios by failure domain:
Scenario Library cross-reference: For each FIS action, check whether it also
appears as a sub-action in any Scenario Library composite scenario discovered in
Step 2. If it does, append a note in the "HA Verification Purpose" column (e.g.,
"Also a sub-action of AZ Power Interruption — see Scenario Library section"). If
all service-specific FIS actions are sub-actions of Scenario Library scenarios,
omit the "FIS Native Fault Injection Scenarios" sub-section entirely and replace
it with a note: "All FIS native actions for this service are covered by Scenario
Library composite scenarios — see the Scenario Library and Cross-Cutting section."
Some services have built-in fault injection beyond FIS. Search for these
(sequentially — wait for the search to complete before reading any result pages):
aws___search_documentation(
search_phrase="[SERVICE_NAME] fault injection testing failover simulation",
topics=["general", "reference_documentation"],
limit=10
)
If found, add a "Service Built-in Fault Injection" section using the table format from
references/output-template.md.
Use the search queries from references/search-queries.md under "FIS-Enriched Path".
Run all 5 queries sequentially (one at a time). After searches, read the top 3-5
most relevant pages one at a time and use aws___recommend on the most relevant
page for discovery. Never send multiple read or recommend requests in parallel.
When FIS has no native actions for the target service, fall back to comprehensive
documentation research. Note that Scenario Library findings from Step 2 still apply.
Use the search queries from references/search-queries.md under "Documentation-Only Path".
Run all 6 queries sequentially (one at a time, wait for each to complete).
From the combined search results, read the top 5 most relevant pages following the
priority order in references/search-queries.md. Read pages one at a time — wait
for each aws___read_documentation call to complete before the next. Then use
aws___recommend on the service's main documentation page to discover related content.
Extract from all pages:
Use the "Testing Methods (No Native FIS Actions)" section format from references/output-template.md,
including both indirect FIS actions and AWS API/Console methods.
Write the report directly to a local markdown file instead of outputting the full
content to the terminal. Use the following file naming convention:
TIMESTAMP=$(TZ=Asia/Shanghai date +%Y-%m-%d-%H-%M-%S)
SERVICE_SLUG=$(echo "{SERVICE_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')
# File name: ${TIMESTAMP}-${SERVICE_SLUG}-chaos-research.md
For multi-service requests, generate one file per service:
${TIMESTAMP}-rds-chaos-research.md${TIMESTAMP}-eks-chaos-research.mdCompile the report content using the exact format defined in references/output-template.md
and save it to the file. The report must include all sections in this order:
{SVC}-# test IDs, e.g., EKS-1, Redis-1), built-in methods, recommended testing scenario matrix, environment observations, and stop conditionsAfter saving, print a brief summary to the terminal listing only:
curated resilience testing scenarios. Always read the latest Scenario Library
documentation before anything else. These are documentation-based (console-only),
not CLI-queryable.
FIS action availability varies by region. Always pass --region to the AWS CLI and
clearly state the region in the output.
fallback path exists precisely for services FIS doesn't cover.
documentation pages you've read.
service, its HA mechanisms, and its specific metrics.
but focus on service-specific actions and Scenario Library scenarios first.
Use aws___search_documentation, aws___read_documentation, and aws___recommend.
topics values (general, reference_documentation, troubleshooting) sequentially.
aws___recommendto find related content that keyword search may miss.
aws___search_documentation, aws___read_documentation, and aws___recommend MUST be executed one at a time.
Wait for each response before sending the next request. Parallel calls will trigger
"Too many requests" errors from the aws-knowledge-mcp-server. This is the single
most common cause of failures — enforce strictly in every step.
transient error, wait 5 seconds and retry. Repeat up to 10 times before skipping.
共 1 个版本