12-Factor Agents Compliance Analysis

Input Parameters

Parameter	Description	Required
-----------	-------------	----------
`docs_path`	Path to documentation directory (for existing analyses)	Optional
`codebase_path`	Root path of the codebase to analyze	Required

Analysis Framework

Factor 1: Natural Language to Tool Calls

Principle: Convert natural language inputs into structured, deterministic tool calls using schema-validated outputs.

Search Patterns:

# Look for Pydantic schemas
grep -r "class.*BaseModel" --include="*.py"
grep -r "TaskDAG\|TaskResponse\|ToolCall" --include="*.py"

# Look for JSON schema generation
grep -r "model_json_schema\|json_schema" --include="*.py"

# Look for structured output generation
grep -r "output_type\|response_model" --include="*.py"

File Patterns: /agents/.py, /schemas/.py, */models/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	All LLM outputs use Pydantic/dataclass schemas with validators
Partial	Some outputs typed, but dict returns or unvalidated strings exist
Weak	LLM returns raw strings parsed manually or with regex

Anti-patterns:

json.loads(llm_response) without schema validation
output.split() or regex parsing of LLM responses
dict[str, Any] return types from agents
No validation between LLM output and handler execution

Factor 2: Own Your Prompts

Principle: Treat prompts as first-class code you control, version, and iterate on.

Search Patterns:

# Look for embedded prompts
grep -r "SYSTEM_PROMPT\|system_prompt" --include="*.py"
grep -r '""".*You are' --include="*.py"

# Look for template systems
grep -r "jinja\|Jinja\|render_template" --include="*.py"
find . -name "*.jinja2" -o -name "*.j2"

# Look for prompt directories
find . -type d -name "prompts"

File Patterns: /prompts/, /templates/, */agents/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	Prompts in separate files, templated (Jinja2), versioned
Partial	Prompts as module constants, some parameterization
Weak	Prompts hardcoded inline in functions, f-strings only

Anti-patterns:

f"You are a {role}..." inline in agent methods
Prompts mixed with business logic
No way to iterate on prompts without code changes
No prompt versioning or A/B testing capability

Factor 3: Own Your Context Window

Principle: Control how history, state, and tool results are formatted for the LLM.

Search Patterns:

# Look for context/message management
grep -r "AgentMessage\|ChatMessage\|messages" --include="*.py"
grep -r "context_window\|context_compiler" --include="*.py"

# Look for custom serialization
grep -r "to_xml\|to_context\|serialize" --include="*.py"

# Look for token management
grep -r "token_count\|max_tokens\|truncate" --include="*.py"

File Patterns: /context/.py, /state/.py, */core/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	Custom context format, token optimization, typed events, compaction
Partial	Basic message history with some structure
Weak	Raw message accumulation, standard OpenAI format only

Anti-patterns:

Unbounded message accumulation
Large artifacts embedded inline (diffs, files)
No agent-specific context filtering
Same context for all agent types

Factor 4: Tools Are Structured Outputs

Principle: Tools produce schema-validated JSON that triggers deterministic code, not magic function calls.

Search Patterns:

# Look for tool/response schemas
grep -r "class.*Response.*BaseModel" --include="*.py"
grep -r "ToolResult\|ToolOutput" --include="*.py"

# Look for deterministic handlers
grep -r "def handle_\|def execute_" --include="*.py"

# Look for validation layer
grep -r "model_validate\|parse_obj" --include="*.py"

File Patterns: /tools/.py, /handlers/.py, */agents/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	All tool outputs schema-validated, handlers type-safe
Partial	Most tools typed, some loose dict returns
Weak	Tools return arbitrary dicts, no validation layer

Anti-patterns:

Tool handlers that directly execute LLM output
eval() or exec() on LLM-generated code
No separation between decision (LLM) and execution (code)
Magic method dispatch based on string matching

Factor 5: Unify Execution State

Principle: Merge execution state (step, retries) with business state (messages, results).

Search Patterns:

# Look for state models
grep -r "ExecutionState\|WorkflowState\|Thread" --include="*.py"

# Look for dual state systems
grep -r "checkpoint\|MemorySaver" --include="*.py"
grep -r "sqlite\|database\|repository" --include="*.py"

# Look for state reconstruction
grep -r "load_state\|restore\|reconstruct" --include="*.py"

File Patterns: /state/.py, /models/.py, */database/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	Single serializable state object with all execution metadata
Partial	State exists but split across systems (memory + DB)
Weak	Execution state scattered, requires multiple queries to reconstruct

Anti-patterns:

Retry count stored separately from task state
Error history in logs but not in state
LangGraph checkpoints + separate database storage
No unified event thread

Factor 6: Launch/Pause/Resume

Principle: Agents support simple APIs for launching, pausing at any point, and resuming.

Search Patterns:

# Look for REST endpoints
grep -r "@router.post\|@app.post" --include="*.py"
grep -r "start_workflow\|pause\|resume" --include="*.py"

# Look for interrupt mechanisms
grep -r "interrupt_before\|interrupt_after" --include="*.py"

# Look for webhook handlers
grep -r "webhook\|callback" --include="*.py"

File Patterns: /routes/.py, /api/.py, */orchestrator/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	REST API + webhook resume, pause at any point including mid-tool
Partial	Launch/pause/resume exists but only at coarse-grained points
Weak	CLI-only launch, no pause/resume capability

Anti-patterns:

Blocking input() or confirm() calls
No way to resume after process restart
Approval only at plan level, not per-tool
No webhook-based resume from external systems

Factor 7: Contact Humans with Tools

Principle: Human contact is a tool call with question, options, and urgency.

Search Patterns:

# Look for human input mechanisms
grep -r "typer.confirm\|input(\|prompt(" --include="*.py"
grep -r "request_human_input\|human_contact" --include="*.py"

# Look for approval patterns
grep -r "approval\|approve\|reject" --include="*.py"

# Look for structured question formats
grep -r "question.*options\|HumanInputRequest" --include="*.py"

File Patterns: /agents/.py, /tools/.py, */orchestrator/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	`request_human_input` tool with question/options/urgency/format
Partial	Approval gates exist but hardcoded in graph structure
Weak	Blocking CLI prompts, no tool-based human contact

Anti-patterns:

typer.confirm() in agent code
Human contact hardcoded at specific graph nodes
No way for agents to ask clarifying questions
Single response format (yes/no only)

Factor 8: Own Your Control Flow

Principle: Custom control flow, not framework defaults. Full control over routing, retries, compaction.

Search Patterns:

# Look for routing logic
grep -r "add_conditional_edges\|route_\|should_continue" --include="*.py"

# Look for custom loops
grep -r "while True\|for.*in.*range" --include="*.py" | grep -v test

# Look for execution mode control
grep -r "execution_mode\|agentic\|structured" --include="*.py"

File Patterns: /orchestrator/.py, /graph/.py, */core/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	Custom routing functions, conditional edges, execution mode control
Partial	Framework control flow with some customization
Weak	Default framework loop with no custom routing

Anti-patterns:

Single path through graph with no branching
No distinction between tool types (all treated same)
Framework-default error handling only
No rate limiting or resource management

Factor 9: Compact Errors into Context

Principle: Errors in context enable self-healing. Track consecutive errors, escalate after threshold.

Search Patterns:

# Look for error handling
grep -r "except.*Exception\|error_history\|consecutive_errors" --include="*.py"

# Look for retry logic
grep -r "retry\|backoff\|max_attempts" --include="*.py"

# Look for escalation
grep -r "escalate\|human_escalation" --include="*.py"

File Patterns: /agents/.py, /orchestrator/.py, */core/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	Errors in context, retry with threshold, automatic escalation
Partial	Errors logged and returned, no automatic retry loop
Weak	Errors logged only, not fed back to LLM, task fails immediately

Anti-patterns:

logger.error() without adding to context
No retry mechanism (fail immediately)
No consecutive error tracking
No escalation to humans after repeated failures

Factor 10: Small, Focused Agents

Principle: Each agent has narrow responsibility, 3-10 steps max.

Search Patterns:

# Look for agent classes
grep -r "class.*Agent\|class.*Architect\|class.*Developer" --include="*.py"

# Look for step definitions
grep -r "steps\|tasks" --include="*.py" | head -20

# Count methods per agent
grep -r "async def\|def " agents/*.py 2>/dev/null | wc -l

File Patterns: */agents/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	3+ specialized agents, each with single responsibility, step limits
Partial	Multiple agents but some have broad scope
Weak	Single "god" agent that handles everything

Anti-patterns:

Single agent with 20+ tools
Agent with unbounded step count
Mixed responsibilities (planning + execution + review)
No step or time limits on agent execution

Factor 11: Trigger from Anywhere

Principle: Workflows triggerable from CLI, REST, WebSocket, Slack, webhooks, etc.

Search Patterns:

# Look for entry points
grep -r "@cli.command\|@router.post\|@app.post" --include="*.py"

# Look for WebSocket support
grep -r "WebSocket\|websocket" --include="*.py"

# Look for external integrations
grep -r "slack\|discord\|webhook" --include="*.py" -i

File Patterns: /routes/.py, /cli/.py, **/main.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	CLI + REST + WebSocket + webhooks + chat integrations
Partial	CLI + REST API available
Weak	CLI only, no programmatic access

Anti-patterns:

Only if __name__ == "__main__" entry point
No REST API for external systems
No event streaming for real-time updates
Trigger logic tightly coupled to execution

Factor 12: Stateless Reducer

Principle: Agents as pure functions: (state, input) -> (state, output). No side effects in agent logic.

Search Patterns:

# Look for state mutation patterns
grep -r "\.status = \|\.field = " --include="*.py"

# Look for immutable updates
grep -r "model_copy\|\.copy(\|with_" --include="*.py"

# Look for side effects in agents
grep -r "write_file\|subprocess\|requests\." agents/*.py 2>/dev/null

File Patterns: /agents/.py, /nodes/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	Immutable state updates, side effects isolated to tools/handlers
Partial	Mostly immutable, some in-place mutations
Weak	State mutated in place, side effects mixed with agent logic

Anti-patterns:

state.field = new_value (mutation)
File writes inside agent methods
HTTP calls inside agent decision logic
Shared mutable state between agents

Factor 13: Pre-fetch Context

Principle: Fetch likely-needed data upfront rather than mid-workflow.

Search Patterns:

# Look for context pre-fetching
grep -r "pre_fetch\|prefetch\|fetch_context" --include="*.py"

# Look for RAG/embedding systems
grep -r "embedding\|vector\|semantic_search" --include="*.py"

# Look for related file discovery
grep -r "related_tests\|similar_\|find_relevant" --include="*.py"

File Patterns: /context/.py, /retrieval/.py, */rag/.py

Compliance Criteria:

Level	Criteria
-------	----------
Strong	Automatic pre-fetch of related tests, files, docs before planning
Partial	Manual context passing, design doc support
Weak	No pre-fetching, LLM must request all context via tools

Anti-patterns:

Architect starts with issue only, no codebase context
No semantic search for similar past work
Related tests/files discovered only during execution
No RAG or document retrieval system

Output Format

Gate order: Do not assign Strong / Partial / Weak or treat recommendations as observed facts until Hard gates (after Analysis Workflow) are satisfied for the factors in scope.

Executive Summary Table

| Factor | Status | Notes |
|--------|--------|-------|
| 1. Natural Language -> Tool Calls | **Strong/Partial/Weak** | [Key finding] |
| 2. Own Your Prompts | **Strong/Partial/Weak** | [Key finding] |
| ... | ... | ... |
| 13. Pre-fetch Context | **Strong/Partial/Weak** | [Key finding] |

**Overall**: X Strong, Y Partial, Z Weak

Per-Factor Analysis

For each factor, provide:

Current Implementation

Evidence with file:line references
Code snippets showing patterns

Compliance Level

Strong/Partial/Weak with justification

Gaps

What's missing vs. 12-Factor ideal

Recommendations

Actionable improvements with code examples

Analysis Workflow

Initial Scan

Run search patterns for all factors
Identify key files for each factor
Note any existing compliance documentation

Deep Dive (per factor)

Read identified files
Evaluate against compliance criteria
Document evidence with file paths

Gap Analysis

Compare current vs. 12-Factor ideal
Identify anti-patterns present
Prioritize by impact

Recommendations

Provide actionable improvements
Include before/after code examples
Reference roadmap if exists

Summary

Compile executive summary table
Highlight strengths and critical gaps
Suggest priority order for improvements

Hard gates (evidence before scores)

Run these in order. Do not skip ahead: each Pass is an objective condition you can check (paths on disk, citations present), not internal certainty.

Scan gate — After the initial scan (workflow step 1), Pass: for every factor (1–13) you have either (a) ≥1 repo-relative path or glob hit to inspect, or (b) a one-line note with rationale (e.g. search command/output, or “no matches — codebase may omit this concern”). Empty hand-waving (“looks fine”) fails this gate.
Evidence gate (per factor) — Before writing Strong / Partial / Weak for that factor, Pass: “Current Implementation” includes ≥1 citation with file path plus line range or short quoted snippet from codebase_path, or an explicit no evidence located statement after targeted reads. If evidence is missing after search, default that factor to Weak unless the criterion is clearly N/A (say why).
Synthesis gate — Executive summary table and per-factor analysis sections, Pass: only after gates 1–2 are satisfied for the factors in scope. Recommendations may name new files or patterns only as proposals; they must not be presented as observed facts without matching citations from step 2.

Quick Reference: Compliance Scoring

Score	Meaning	Action
-------	---------	--------
Strong	Fully implements principle	Maintain, minor optimizations
Partial	Some implementation, significant gaps	Planned improvements
Weak	Minimal or no implementation	High priority for roadmap

When to Use This Skill

Evaluating new LLM-powered systems
Reviewing agent architecture decisions
Auditing production agentic applications
Planning improvements to existing agents
Comparing frameworks or implementations

Agent Architecture Analysis

概述

12-Factor Agents Compliance Analysis

Input Parameters

Analysis Framework

Factor 1: Natural Language to Tool Calls

Factor 2: Own Your Prompts

Factor 3: Own Your Context Window

Factor 4: Tools Are Structured Outputs

Factor 5: Unify Execution State

Factor 6: Launch/Pause/Resume

Factor 7: Contact Humans with Tools

Factor 8: Own Your Control Flow

Factor 9: Compact Errors into Context

Factor 10: Small, Focused Agents

Factor 11: Trigger from Anywhere

Factor 12: Stateless Reducer

Factor 13: Pre-fetch Context

Output Format

Executive Summary Table

Per-Factor Analysis

Analysis Workflow

Hard gates (evidence before scores)

Quick Reference: Compliance Scoring

When to Use This Skill

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

self-improving agent

Self-Improving + Proactive Agent

Agent Browser