← 返回
内容创作 中文

Instagram Scraper

Browser-based tool to discover Instagram profiles by location/category and scrape their public info, stats, images, and engagement with export options.
基于浏览器的工具,可按位置/类别发现Instagram账号,抓取公开信息、统计数据、图片及互动数据,并支持导出。
arulmozhiv
内容创作 clawhub v1.0.7 2 版本 99900.4 Key: 无需
★ 8
Stars
📥 3,854
下载
💾 529
安装
2
版本
#latest

概述

Instagram Profile Scraper

A browser-based Instagram profile discovery and scraping tool.

> Part of ScrapeClaw — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.

---
name: instagram-scraper
description: Discover and scrape Instagram profiles from your browser.
emoji: 📸
version: 1.0.6
author: influenza
tags:
  - instagram
  - scraping
  - social-media
  - influencer-discovery
metadata:
  clawdbot:
    requires:
      bins:
        - python3
        - chromium

    config:
      stateDirs:
        - data/output
        - data/queue
        - thumbnails
      outputFormats:
        - json
        - csv
---

Overview

This skill provides a two-phase Instagram scraping system:

  1. Profile Discovery
  2. Browser Scraping

Features

  • 🔍 - Discover Instagram profiles by location and category
  • 🌐 - Full browser simulation for accurate scraping
  • 🛡️ - Browser fingerprinting, human behavior simulation, and stealth scripts
  • 📊 - Profile info, stats, images, and engagement data
  • 💾 - JSON/CSV export with downloaded thumbnails
  • 🔄 - Resume interrupted scraping sessions
  • ⚡ - Auto-skip private accounts, low followers, empty profiles
  • 🌍 - Built-in residential proxy support with 4 providers

Getting Google API Credentials (Optional)

  1. Go to Google Cloud Console
  2. Create a new project or select existing
  3. Enable "Custom Search API"
  4. Create API credentials → API Key
  5. Go to Programmable Search Engine
  6. Create a search engine with instagram.com as the site to search
  7. Copy the Search Engine ID

Usage

Agent Tool Interface

For OpenClaw agent integration, the skill provides JSON output:

# Discover profiles (returns JSON)
discover --location "Miami" --category "fitness" --output json

# Scrape single profile (returns JSON)
scrape --username influencer123 --output json

Output Data

Profile Data Structure

{
  "username": "example_user",
  "full_name": "Example User",
  "bio": "Fashion blogger | NYC",
  "followers": 125000,
  "following": 1500,
  "posts_count": 450,
  "is_verified": false,
  "is_private": false,
  "influencer_tier": "mid",
  "category": "fashion",
  "location": "New York",
  "profile_pic_local": "thumbnails/example_user/profile_abc123.jpg",
  "content_thumbnails": [
    "thumbnails/example_user/content_1_def456.jpg",
    "thumbnails/example_user/content_2_ghi789.jpg"
  ],
  "post_engagement": [
    {"post_url": "https://instagram.com/p/ABC123/", "likes": 5420, "comments": 89}
  ],
  "scrape_timestamp": "2025-02-09T14:30:00"
}

Influencer Tiers

TierFollower Range
--------------------------
nano< 1,000
micro1,000 - 10,000
mid10,000 - 100,000
macro100,000 - 1M
mega> 1,000,000

File Outputs

  • Queue files: data/queue/{location}_{category}_{timestamp}.json
  • Scraped data: data/output/{username}.json
  • Thumbnails: thumbnails/{username}/profile_.jpg, thumbnails/{username}/content_.jpg
  • Export files: data/export_{timestamp}.json, data/export_{timestamp}.csv

Configuration

Edit config/scraper_config.json:

{
  "proxy": {
    "enabled": false,
    "provider": "brightdata",
    "country": "",
    "sticky": true,
    "sticky_ttl_minutes": 10
  },
  "google_search": {
    "enabled": true,
    "api_key": "",
    "search_engine_id": "",
    "queries_per_location": 3
  },
  "scraper": {
    "headless": false,
    "min_followers": 1000,
    "download_thumbnails": true,
    "max_thumbnails": 6
  },
  "cities": ["New York", "Los Angeles", "Miami", "Chicago"],
  "categories": ["fashion", "beauty", "fitness", "food", "travel", "tech"]
}

Filters Applied

The scraper automatically filters out:

  • ❌ Private accounts
  • ❌ Accounts with < 1,000 followers (configurable)
  • ❌ Accounts with no posts
  • ❌ Non-existent/removed accounts
  • ❌ Already scraped accounts (deduplication)

Troubleshooting

Login Issues

  • Ensure credentials are correct
  • Handle verification codes when prompted
  • Wait if rate limited (the script will auto-retry)

No Profiles Discovered

  • Check Google API key and quota
  • Verify Search Engine ID is configured for instagram.com
  • Try different location/category combinations

Rate Limiting

  • Reduce scraping speed (increase delays in config)
  • Run during off-peak hours
  • Use a residential proxy (see below)

🌐 Residential Proxy Support

Why Use a Residential Proxy?

Running a scraper at scale without a residential proxy will get your IP blocked fast. Here's why proxies are essential for long-running scrapes:

AdvantageDescription
------------------------
Avoid IP BansResidential IPs look like real household users, not data-center bots. Instagram is far less likely to flag them.
Automatic IP RotationEach request (or session) gets a fresh IP, so rate-limits never stack up on one address.
Geo-TargetingRoute traffic through a specific country/city so scraped content matches the target audience's locale.
Sticky SessionsKeep the same IP for a configurable window (e.g. 10 min) — critical for maintaining a consistent browsing session.
Higher Success RateRotating residential IPs deliver 95%+ success rates compared to ~30% with data-center proxies on Instagram.
Long-Running ScrapesScrape thousands of profiles over hours or days without interruption.
Concurrent ScrapingRun multiple browser instances across different IPs simultaneously.

Recommended Proxy Providers

We have affiliate partnerships with top residential proxy providers. Using these links supports continued development of this skill:

ProviderBest ForSign Up
-----------------------------
Bright DataWorld's largest network, 72M+ IPs, enterprise-grade👉 Get Bright Data
IProyalPay-as-you-go, 195+ countries, no traffic expiry👉 Get IProyal
Storm ProxiesFast & reliable, developer-friendly API, competitive pricing👉 Get Storm Proxies
NetNutISP-grade network, 52M+ IPs, direct connectivity👉 Get NetNut

Setup Steps

1. Get Your Proxy Credentials

Sign up with any provider above, then grab:

  • Username (from your provider dashboard)
  • Password (from your provider dashboard)
  • Host and Port are pre-configured per provider (or use custom)

2. Configure via Environment Variables

export PROXY_ENABLED=true
export PROXY_PROVIDER=brightdata    # brightdata | iproyal | stormproxies | netnut | custom
export PROXY_USERNAME=your_user
export PROXY_PASSWORD=your_pass
export PROXY_COUNTRY=us             # optional: two-letter country code
export PROXY_STICKY=true            # optional: keep same IP per session

3. Provider-Specific Host/Port Defaults

These are auto-configured when you set the provider name:

ProviderHostPort
----------------------
Bright Databrd.superproxy.io22225
IProyalproxy.iproyal.com12321
Storm Proxiesrotating.stormproxies.com9999
NetNutgw-resi.netnut.io5959

Override with PROXY_HOST / PROXY_PORT env vars if your plan uses a different gateway.

4. Custom Proxy Provider

For any other proxy service, set provider to custom and supply host/port manually:

{
  "proxy": {
    "enabled": true,
    "provider": "custom",
    "host": "your.proxy.host",
    "port": 8080,
    "username": "user",
    "password": "pass"
  }
}

Running the Scraper with Proxy

Once configured, the scraper picks up the proxy automatically — no extra flags needed:

# Discover and scrape as usual — proxy is applied automatically
python main.py discover --location "Miami" --category "fitness"
python main.py scrape --username influencer123

# The log will confirm proxy is active:
# INFO - Proxy enabled: <ProxyManager provider=brightdata enabled host=brd.superproxy.io:22225>
# INFO - Browser using proxy: brightdata → brd.superproxy.io:22225

Using the Proxy Manager Programmatically

from proxy_manager import ProxyManager

# From config (auto-reads config/scraper_config.json)
pm = ProxyManager.from_config()

# From environment variables
pm = ProxyManager.from_env()

# Manual construction
pm = ProxyManager(
    provider="brightdata",
    username="your_user",
    password="your_pass",
    country="us",
    sticky=True
)

# For Playwright browser context
proxy = pm.get_playwright_proxy()
# → {"server": "http://brd.superproxy.io:22225", "username": "user-country-us-session-abc123", "password": "pass"}

# For requests / aiohttp
proxies = pm.get_requests_proxy()
# → {"http": "http://user:pass@host:port", "https": "http://user:pass@host:port"}

# Force new IP (rotates session ID)
pm.rotate_session()

# Debug info
print(pm.info())

Best Practices for Long-Running Scrapes

  1. Use sticky sessions — Instagram requires consistent IPs during a browsing session. Set "sticky": true.
  2. Target the right country — Set "country": "us" (or your target region) so Instagram serves content in the expected locale.
  3. Combine with existing anti-detection — This scraper already has fingerprinting, stealth scripts, and human behavior simulation. The proxy is the final layer.
  4. Rotate sessions between batches — Call pm.rotate_session() between large batches of profiles to get a fresh IP.
  5. Use delays — Even with proxies, respect delay_between_profiles in config to avoid aggressive patterns.
  6. Monitor your proxy dashboard — All providers have dashboards showing bandwidth usage and success rates.

版本历史

共 2 个版本

  • v1.0.7 当前
    2026-03-28 12:17 安全
  • v1.0.5
    2026-03-07 11:36

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

suspicious
查看报告

🔗 相关推荐

content-creation

Baidu Wenku AIPPT

ide-rea
使用百度文库 AI 智能生成 PPT,自动根据内容选择模板。
★ 66 📥 46,152
content-creation

AdMapix

fly0pants
广告情报与应用数据分析助手,支持搜索广告素材、分析应用排名、下载量、收入及市场洞察,用于广告素材和竞品分析。
★ 295 📥 136,442
content-creation

Humanizer

biostartechnology
消除AI写作痕迹,使文本更自然真实。基于维基百科"AI写作特征"指南,识别并修正夸张象征、宣传用语、肤浅-ing分析、模糊归因、破折号滥用、三项排比、AI词汇、负面平行结构及冗长连接词等模式。
★ 858 📥 199,487