Extract structured profile data from a content creator's public page.
Returns a normalized JSON object regardless of platform.
Use this skill for any request that involves:
If the user provides a URL, jump straight to Step 2. If they only give a
name or handle, start at Step 1.
If the user gave a username/handle without a URL:
hints), then ask the user to confirm.
| Platform | URL Pattern | Example |
|---|---|---|
| -------------- | ------------------------------------------ | -------------------------------------------- |
| YouTube | https://www.youtube.com/@{handle} | https://www.youtube.com/@mkbhd |
https://www.instagram.com/{handle}/ | https://www.instagram.com/natgeo/ | |
| TikTok | https://www.tiktok.com/@{handle} | https://www.tiktok.com/@charlidamelio |
| Twitter / X | https://x.com/{handle} | https://x.com/sama |
https://www.linkedin.com/in/{handle}/ | https://www.linkedin.com/in/satyanadella/ | |
| Twitch | https://www.twitch.tv/{handle} | https://www.twitch.tv/shroud |
| Substack | https://{handle}.substack.com | https://astralcodexten.substack.com |
| GitHub | https://github.com/{handle} | https://github.com/torvalds |
| Patreon | https://www.patreon.com/{handle} | https://www.patreon.com/kurzgesagt |
@ prefix with short handle → likely Twitter/X or Instagramweb_fetch(url)
Check if the response contains useful profile data (bio, follower count, etc.).
If the content is mostly JS placeholders, empty move to 2b. Platforms that almost always require browser mode: Browser steps: counts, bio text). For persistent blocks, use the Apify actor for that platform. See Parse the fetched content and populate the following fields. Mark any missing field as > Privacy rule: Only capture fields that are explicitly public on the > profile page. Do not infer, deduce, or cross-reference private information. > Do not store or relay phone numbers even if visible. See per platform. Quick reference: Convert abbreviated counts to integers before storing: Use the helper script: Present a clean summary in chat: Return or save the full normalized JSON object from Step 3. To save to disk: If the user supplies multiple handles or URLs (comma-separated, line-separated, or a list): requests, default 1500 ms). technically accessible. email, phone numbers). accepts responsibility. targeted harassment infrastructure.2b. Use browser (managed mode) for JS-heavy sites
2c. Apify fallback (if APIFY_API_TOKEN is set)
references/apify-actors.md for actor IDs and call patterns.Step 3 — Extract Structured Fields
null — do not guess.{
"platform": "string", // youtube | instagram | tiktok | twitter | linkedin | twitch | substack | github | patreon | other
"handle": "string", // @-prefixed username
"display_name": "string", // Full display name
"verified": true | false,
"bio": "string", // Profile description / about text
"profile_url": "string", // Canonical URL used to scrape
"avatar_url": "string | null",
"external_links": ["string"], // Any links in bio or link-in-bio
"stats": {
"followers": "number | null",
"following": "number | null",
"subscribers": "number | null",
"total_views": "number | null",
"total_posts": "number | null",
"monthly_listeners": "number | null", // Spotify-style, if applicable
"engagement_rate": "number | null" // Percentage, if computable
},
"recent_content": [
{
"title": "string | null",
"url": "string",
"published_at": "ISO 8601 string | null",
"views": "number | null",
"likes": "number | null",
"comments": "number | null"
}
// Up to 5 most recent items
],
"contact_info": {
"email": "string | null", // Only if publicly listed in bio/links
"website": "string | null"
},
"scraped_at": "ISO 8601 UTC timestamp"
}
Platform-specific extraction hints
references/platform-selectors.md for CSS selectors and JSON-LD paths#subscriber-count or meta itemprop=interactionCount; description in #description-inner[data-testid="UserProfileHeader_Items"]; bio in [data-testid="UserDescription"]window._sharedData or itemprop attributes: name, description, follows, worksForStep 4 — Normalize Numbers
"12.3K" → 12300"4.5M" → 4500000"1B" → 1000000000"1,234" → 1234python3 ~/.openclaw/workspace/skills/scrape-creator-profile/scripts/normalize_count.py "12.3K"
Step 5 — Output
Default output (conversational)
**[Display Name]** (@handle) · Platform
✅ Verified | 👥 X followers | 📝 Y posts
Bio: "..."
Top links: url1, url2
Recent content:
1. "Video Title" — X views (date)
2. ...
[Full JSON available on request]
Structured output (when user asks for JSON, export, or data)
python3 ~/.openclaw/workspace/skills/scrape-creator-profile/scripts/save_profile.py \
--data '<json>' \
--output ~/creator-profiles/{handle}_{platform}.json
Step 6 — Multi-Profile Mode
CREATOR_SCRAPE_DELAY_MS betweenpython3 ~/.openclaw/workspace/skills/scrape-creator-profile/scripts/compare_profiles.py \
--profiles '<json array>' \
--format table # or csv
Error Handling
Situation Action ----------- -------- Login wall / auth required Report which fields were blocked; return partial data Rate limited (429) Wait CREATOR_SCRAPE_DELAY_MS × 3, retry once, then reportProfile not found (404) Inform the user; suggest alternate handle spellings JavaScript-only page, no browser Suggest enabling browser mode in OpenClaw settings Ambiguous handle across platforms Ask user to confirm platform before scraping Legal & Ethical Guardrails
robots.txt disallow rules unless the user explicitly overrides andReference Files
references/platform-selectors.md — CSS selectors and JSON-LD paths per platformreferences/apify-actors.md — Apify actor IDs and call patterns for fallback scrapingscripts/normalize_count.py — Converts "12.3K" → 12300scripts/save_profile.py — Saves profile JSON to diskscripts/compare_profiles.py — Builds comparison table or CSV from multiple profiles
共 1 个版本