概述

Pinterest Search Scraper

Scrapes pins from Pinterest search results using Crawlee's PlaywrightCrawler with infinite scroll support. Collects pin URLs, image URLs, and descriptions without requiring login.

Requirements

pip install crawlee[playwright]
playwright install chromium

Usage

python pinterest_scraper.py "search query"
python pinterest_scraper.py "minimalist home decor" 30
python pinterest_scraper.py "brutalist architecture" 100

Arguments:

query (required): Search term (use quotes for multi-word queries)
max_pins (optional): Maximum pins to collect, default 30

What It Returns

Each pin object contains:

Field	Description
-------	-------------
`id`	Pinterest pin ID
`url`	Full Pinterest pin URL (`https://www.pinterest.com/pin//`)
`image_url`	Highest-resolution image URL from srcset
`description`	Image alt text / pin description

Output

Saved to ./storage/pinterest/.json:

[
  {
    "id": "123456789",
    "url": "https://www.pinterest.com/pin/123456789/",
    "image_url": "https://i.pinimg.com/736x/ab/cd/ef/...",
    "description": "Minimalist living room with white walls"
  },
  ...
]

How It Works

The scraper uses Crawlee's PlaywrightCrawler to:

Navigate to the Pinterest search URL: https://www.pinterest.com/search/pins/?q=
Wait for pin elements to appear ([data-test-id="pin"] or a[href*="/pin/"])
Iteratively scroll to the bottom to trigger infinite load
Extract pins after each scroll via page.evaluate() JavaScript injection
Deduplicate by pin ID and collect until max_pins is reached or scrolling stalls

The JavaScript extractor resolves srcset attributes to select the highest-resolution image available:

async def _extract_pins(self, page) -> None:
    """Extract pin data from the current page state."""
    # Runs JS in the browser context to walk the DOM and extract pin data
    # Handles multiple Pinterest DOM structures (data-test-id variants)
    # Resolves srcset to get highest resolution image
    ...

Configuration

Parameter	Default	Description
-----------	---------	-------------
`max_pins`	30	Maximum pins to collect
`headless`	`True`	Run browser headlessly (set `False` for debugging)
`max_scroll_attempts`	10	Stop after N consecutive scrolls with no new pins
`scroll_delay`	1.5s	Wait between scrolls for content to load

To run in headed mode for debugging:

scraper = PinterestScraper(max_pins=10, headless=False)

Integrating Into a Pipeline

import asyncio
from pinterest_scraper import PinterestScraper

async def collect_inspiration(query: str, limit: int = 50) -> list[dict]:
    scraper = PinterestScraper(max_pins=limit, headless=True)
    pins = await scraper.scrape_search(query)
    return pins

pins = asyncio.run(collect_inspiration("editorial fashion photography", 50))

Troubleshooting

No pins found: Pinterest occasionally changes its DOM structure. Try setting headless=False to inspect visually. The scraper attempts two selector strategies ([data-test-id="pin"] and a[href*="/pin/"]).

Fewer pins than expected: Pinterest's infinite scroll depends on scroll velocity and network speed. Increase max_scroll_attempts in scrape_search() or add a longer asyncio.sleep() after each scroll.

Playwright install error: Run playwright install chromium to download the browser binary. If behind a corporate proxy, set PLAYWRIGHT_BROWSERS_PATH to a writable directory.

Rate limiting / CAPTCHA: Pinterest may show a CAPTCHA after many rapid requests. Add delays between scraper runs or rotate residential IPs.

Rate Limiting Guidelines

Wait 5+ seconds between search queries when running multiple
Avoid scraping more than 300-500 pins per hour from a single IP
Pinterest does not require login for search, but aggressive scraping triggers bot detection

Use Cases

Visual trend research: Collect images around a topic to identify visual patterns
Dataset creation: Build image datasets for computer vision or aesthetic scoring models
Content planning: Find top-performing visuals in a niche to guide creative direction
Competitive research: Scrape brand-related queries to see what imagery dominates a category
Mood board generation: Automate collection of reference images for design projects

版本历史

共 1 个版本

v1.0.0 当前

2026-03-30 22:24 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)