← 返回
数据分析 Key 中文

Brand DNA Extractor

Extract brand identity (colors, typography, visual style, imagery) from any website URL. Scrapes the site, analyzes CSS/images with K-means and VLM, and retu...
从任意网站URL中提取品牌视觉标识(颜色、字体、视觉风格、图像),通过抓取页面并结合K-means聚类与视觉语言模型分析CSS及图片,返回完整的品牌风格报告。
phy041
数据分析 clawhub v1.0.0 1 版本 100000 Key: 需要
★ 0
Stars
📥 311
下载
💾 17
安装
1
版本
#analysis#branding#latest

概述

Brand DNA Extractor

Extract a structured brand identity profile from any website URL. Analyzes colors, typography, and visual style to produce a reusable brand profile for on-brand content generation.

Environment Variables

export OPENAI_API_KEY="your_openai_key"          # for VLM visual analysis (fallback)
export GOOGLE_GENAI_API_KEY="your_gemini_key"    # for VLM visual analysis (primary)
export SUPABASE_URL="your_supabase_url"          # optional: for caching results
export SUPABASE_KEY="your_supabase_key"          # optional: service role key

What It Extracts

ComponentDetails
--------------------
Color palettePrimary, secondary, accent, background, and text colors — sourced from CSS variables, computed styles, and K-means image clustering
TypographyHeading and body fonts, weights, sources (Google Fonts, Adobe Fonts, system)
Visual styleMood descriptors, photography styles, composition notes, lighting characterization, brand personality, target audience signals
ImageryLogo, favicon, hero images, product images, other images — classified and ranked

Python Usage

import asyncio
from brand_dna_extractor.extractor import BrandDNAExtractor, extract_brand_dna

# Quick extraction
async def main():
    result = await extract_brand_dna(
        url="https://example.com",
        user_id="optional-user-id",
        force_refresh=False,
    )

    if result.success:
        dna = result.brand_dna
        print(dna.color_palette.dominant_color)       # "#2563EB"
        print(dna.typography.primary_font.family)     # "Inter"
        print(dna.visual_style.moods)                 # ["warm minimalism", "approachable"]
        print(dna.visual_style.brand_personality)     # "Confident and calm..."
    else:
        print(result.error)

asyncio.run(main())

# Full control
extractor = BrandDNAExtractor(
    vlm_provider="gemini",      # "gemini" (default) or "openai"
    enable_storage=True,        # cache results in Supabase
    enable_embeddings=False,    # skip CLIP embedding generation
)

result = await extractor.extract(
    url="https://example.com",
    include_subpages=True,      # also scrape about/product pages
    max_subpages=5,
    force_refresh=False,
)

5-Step Extraction Pipeline

Step 1: Website Scraping

Uses a two-tier scraping strategy:

Primary — DOM Structure Scraper (SimpleScraper)

  • Fast HTTP requests with structured HTML parsing
  • Extracts CSS variables, computed styles, stylesheets, JSON-LD data
  • Optimized for Shopify stores (reads product JSON-LD)
  • Follows include_subpages to crawl up to max_subpages additional URLs

Fallback — Playwright Scraper (PlaywrightScraper)

  • Activates when simple scraper yields < 3 gallery/product images
  • Handles JavaScript-rendered content
  • Optional dependency: pip install playwright && playwright install

Step 2: Image Extraction and Classification

Images are classified into types:

TypeDescription
-------------------
logoSite logo (detected by position, alt text, size)
faviconSite favicon
heroLarge above-the-fold banner images
productProduct photography
lifestyleContextual/lifestyle imagery
otherRemaining UI images

Up to 100 images extracted; top 30 product + 30 other retained.

Step 3: Color Analysis

Multi-source color extraction and classification:

CSS custom properties (--primary-color, --brand-color, etc.)
    +
Computed element styles (headerBackground, ctaBackground, linkColor, etc.)
    +
K-means clustering on logo pixels (3 colors)
    +
K-means clustering on hero/product images (3 colors each, up to 5 images)
    ↓
Deduplicate (Euclidean distance threshold = 30)
    ↓
Classify by lightness/saturation:
  L > 0.9  → background
  L < 0.15 → text
  S > 0.6  → accent
  source=primary → primary
  else     → secondary

ColorPalette output:

palette.dominant_color        # "#2563EB" (hex string)
palette.primary_colors        # List[ColorInfo] (up to 3)
palette.secondary_colors      # List[ColorInfo] (up to 3)
palette.accent_colors         # List[ColorInfo] (up to 2)
palette.background_colors     # List[ColorInfo] (up to 2)
palette.text_colors           # List[ColorInfo] (up to 2)

ColorInfo fields: hex, rgb, hsl, role, source, name, frequency, css_property

Step 4: Typography Analysis

Font detection from three sources:

CSS Computed Fonts

  • Parses font-family declarations from computed element styles
  • Classifies by role: heading, body, cta, nav
  • Identifies system fonts vs custom fonts

Google Fonts (detected from stylesheet URLs)

  • Parses both old (/css?family=) and new (/css2?family=) API formats
  • Extracts family names and weight variants

Adobe Fonts / Typekit (detected from stylesheet URLs)

  • Flags usage of use.typekit.net or use.adobe.com

Typography output:

typography.primary_font         # FontInfo — main body font
typography.secondary_font       # FontInfo — heading font (if different)
typography.heading_fonts        # List[FontInfo]
typography.body_fonts           # List[FontInfo]
typography.accent_fonts         # List[FontInfo]
typography.google_fonts_urls    # List[str]
typography.detected_from_google_fonts  # bool
typography.detected_from_adobe_fonts   # bool

FontInfo fields: family, weight, role, source, fallbacks, url

Step 5: Visual Style Analysis (VLM)

Up to 5 representative images (prioritized: hero > product > lifestyle) are analyzed by a VLM using a structured creative director prompt.

Analysis dimensions:

  1. Visual mood and atmosphere (3-5 compound descriptors)
  2. Photography/visual style (2-3 technical descriptors)
  3. Composition analysis (negative space, focal point, depth)
  4. Lighting characterization (quality, direction, color temperature)
  5. Texture and material language
  6. Dominant subjects
  7. Brand personality inference
  8. Target audience signals

VLM provider selection:

  • Default: Gemini (gemini-3-flash-preview or env GEMINI_MODEL)
  • Fallback: OpenAI Vision (env OPENAI_MODEL)
  • Automatic retry with exponential backoff (3 attempts)

VisualStyle output:

style.moods                  # List[str] — top 5 by frequency across images
style.photography_styles     # List[str] — top 3
style.composition_notes      # str — aggregated composition analysis
style.lighting_style         # str
style.texture_notes          # str
style.dominant_subjects      # List[str] — top 5
style.brand_personality      # str — 2-3 sentences
style.target_audience_hint   # str — 2-3 sentences
style.confidence_score       # float — 0.0-1.0 (higher with more images analyzed)
style.images_analyzed        # int

BrandDNA Object

@dataclass
class BrandDNA:
    url: str
    domain: str
    logo: Optional[ExtractedImage]
    favicon: Optional[ExtractedImage]
    hero_images: List[ExtractedImage]
    product_images: List[ExtractedImage]
    other_images: List[ExtractedImage]
    color_palette: ColorPalette
    typography: Typography
    visual_style: VisualStyle
    id: Optional[str]                 # UUID if stored in database
    style_embedding: Optional[List[float]]  # CLIP embedding if enabled

Caching

When enable_storage=True and Supabase credentials are configured, results are automatically cached by domain.

# Force re-extraction (ignore cache)
result = await extractor.extract(url, force_refresh=True)

# Retrieve cached result by domain
from brand_dna_extractor.extractor import get_brand_dna_by_domain
dna = await get_brand_dna_by_domain("example.com")

# Retrieve by stored ID
from brand_dna_extractor.extractor import get_brand_dna
dna = await get_brand_dna("uuid-string")

Error Handling

BrandDNAResponse always returns a result object:

result = await extractor.extract(url)

result.success         # bool
result.brand_dna       # BrandDNA | None
result.error           # str | None — human-readable error description
result.from_cache      # bool — True if returned from cache

Common failure modes:

  • Both scrapers fail (site blocks bots, requires login)
  • VLM API quota exhausted
  • URL is not a public website

Installation

pip install aiohttp Pillow numpy scikit-learn openai google-generativeai
# Optional for JS-heavy sites:
pip install playwright && playwright install chromium

Example Output

BrandDNA(
    domain="allbirds.com",
    color_palette=ColorPalette(
        dominant_color="#2B2B2B",
        primary_colors=[ColorInfo(hex="#2B2B2B", role="primary"), ...],
        accent_colors=[ColorInfo(hex="#E8D5C0", role="accent"), ...],
    ),
    typography=Typography(
        primary_font=FontInfo(family="Flanders Sans", weight="400", source="css"),
        detected_from_google_fonts=False,
    ),
    visual_style=VisualStyle(
        moods=["warm minimalism", "earthy authenticity", "understated confidence"],
        photography_styles=["lifestyle documentary", "naturalistic color treatment"],
        brand_personality="Calm and purposeful, with a commitment to sustainability...",
        target_audience_hint="Environmentally conscious millennials and Gen Z...",
        confidence_score=0.83,
        images_analyzed=5,
    ),
)

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-30 18:38 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 367 📥 139,997
developer-tools

Twitter Scrape

phy041
Scrape Twitter profiles and tweets via GraphQL, export to JSON or database
★ 0 📥 1,022
data-analysis

A股量化 AkShare

mbpz
A股量化数据分析工具,基于AkShare库获取A股行情、财务数据、板块信息等。用于回答关于A股股票查询、行情数据、财务分析、选股等问题。
★ 163 📥 59,694