Duplicate company records fragment contacts, deals, and engagement history across multiple records for the same real-world company. This leads to inaccurate reporting, broken associations, sales confusion, and workflow failures. This skill identifies duplicates by domain and by name, exports prioritized audit CSVs, and guides the user through merging.
crm.objects.companies.read scopeuv for package management.env file containing HUBSPOT_ACCESS_TOKENHubSpot has no bulk merge API. Merging must happen one pair at a time through the HubSpot UI or via third-party tools. The API is used for discovery, analysis, and audit trail generation.
HubSpot's built-in Duplicates tool is NOT available on all plan tiers. Check whether the account has access to Settings > Data Management > Duplicates before relying on it.
This skill follows a 4-stage execution pattern: Plan -> Before State -> Execute -> After State.
Before writing any code, confirm with the user:
Fetch all companies, identify duplicate groups by domain and name, and export audit CSVs.
"""
Before State: Identify duplicate companies by domain and by name.
Creates CSV audit logs for review before merging.
"""
import os
import csv
import time
import requests
from collections import defaultdict
from dotenv import load_dotenv
load_dotenv()
TOKEN = os.environ["HUBSPOT_ACCESS_TOKEN"]
BASE = "https://api.hubapi.com"
headers = {
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "application/json",
}
# --- Step 1: Fetch all companies ---
print("Fetching all companies...")
all_companies = []
after = None
while True:
params = {
"limit": 100,
"properties": "name,domain,lifecyclestage,num_associated_contacts,"
"num_associated_deals,hubspot_owner_id,createdate",
}
if after:
params["after"] = after
resp = requests.get(
f"{BASE}/crm/v3/objects/companies",
headers=headers, params=params,
)
if resp.status_code != 200:
print(f"Stopped at {len(all_companies)} (status {resp.status_code})")
break
data = resp.json()
for company in data.get("results", []):
props = company.get("properties", {})
all_companies.append({
"id": company["id"],
"name": (props.get("name") or "").strip(),
"domain": (props.get("domain") or "").strip().lower(),
"lifecycle_stage": props.get("lifecyclestage", ""),
"associated_contacts": props.get("num_associated_contacts", "0"),
"associated_deals": props.get("num_associated_deals", "0"),
"owner_id": props.get("hubspot_owner_id", ""),
"createdate": props.get("createdate", ""),
})
paging = data.get("paging", {})
after = paging.get("next", {}).get("after")
if not after:
break
time.sleep(0.05)
print(f"Total companies fetched: {len(all_companies)}")
# --- Step 2: Find duplicates by domain ---
print("\nAnalyzing duplicates by domain...")
domain_groups = defaultdict(list)
for c in all_companies:
if c["domain"]:
domain_groups[c["domain"]].append(c)
dup_domain_groups = {d: cs for d, cs in domain_groups.items() if len(cs) > 1}
dup_domain_records = sum(len(cs) for cs in dup_domain_groups.values())
print(f"Unique domains with duplicates: {len(dup_domain_groups)}")
print(f"Total records in duplicate domain groups: {dup_domain_records}")
# Top offenders
sorted_domains = sorted(dup_domain_groups.items(), key=lambda x: len(x[1]), reverse=True)
print("\nTop duplicate domains:")
for domain, companies in sorted_domains[:15]:
print(f" {domain}: {len(companies)} records")
# --- Step 3: Find duplicates by name ---
print("\nAnalyzing duplicates by name...")
name_groups = defaultdict(list)
for c in all_companies:
if c["name"]:
name_groups[c["name"].lower()].append(c)
dup_name_groups = {n: cs for n, cs in name_groups.items() if len(cs) > 1}
dup_name_records = sum(len(cs) for cs in dup_name_groups.values())
print(f"Unique names with duplicates: {len(dup_name_groups)}")
print(f"Total records in duplicate name groups: {dup_name_records}")
sorted_names = sorted(dup_name_groups.items(), key=lambda x: len(x[1]), reverse=True)
print("\nTop duplicate names:")
for name_lower, companies in sorted_names[:15]:
print(f" {companies[0]['name']}: {len(companies)} records")
# --- Step 4: Save CSV audit logs ---
os.makedirs("data/audit-logs", exist_ok=True)
# Domain duplicates CSV
domain_csv = "data/audit-logs/duplicate-companies-by-domain.csv"
with open(domain_csv, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=[
"domain", "duplicate_count", "id", "name", "lifecycle_stage",
"associated_contacts", "associated_deals", "owner_id", "createdate",
])
writer.writeheader()
for domain, companies in sorted_domains:
for c in companies:
writer.writerow({
"domain": domain,
"duplicate_count": len(companies),
**{k: c[k] for k in [
"id", "name", "lifecycle_stage", "associated_contacts",
"associated_deals", "owner_id", "createdate",
]},
})
print(f"\nDomain duplicates CSV: {domain_csv}")
# Name duplicates CSV
name_csv = "data/audit-logs/duplicate-companies-by-name.csv"
with open(name_csv, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=[
"duplicate_name", "duplicate_count", "id", "name", "domain",
"lifecycle_stage", "associated_contacts", "associated_deals",
"owner_id", "createdate",
])
writer.writeheader()
for name_lower, companies in sorted_names:
for c in companies:
writer.writerow({
"duplicate_name": name_lower,
"duplicate_count": len(companies),
**{k: c[k] for k in [
"id", "name", "domain", "lifecycle_stage",
"associated_contacts", "associated_deals",
"owner_id", "createdate",
]},
})
print(f"Name duplicates CSV: {name_csv}")
Present findings to the user. Key data points:
This stage is primarily manual. Guide the user through the merging process.
Option A: HubSpot Built-In Duplicates Tool (if available)
Prioritization order:
Option B: Manual search-and-merge for top offenders
For companies with many duplicates (4+ records):
Option C: Third-party deduplication tools
For large-scale merging, recommend:
These tools can automate bulk merges that would take hours manually.
Prevention: Configure auto-association after merging
Settings > Data Management > Companies (or Settings > Objects > Companies)
Enable: "Create and associate companies with contacts"
Set unique identifier: Company domain name
This prevents future duplicates by using domain-based matching instead of name-based.
Re-run the Before State analysis and compare duplicate counts.
"""
After State: Verify duplicate reduction.
"""
# Re-fetch all companies and re-run duplicate analysis
# Compare:
# - Number of duplicate domain groups (should decrease)
# - Number of duplicate name groups (should decrease)
# - Top offenders (should be resolved)
# Also verify merged records:
# For each known duplicate that was merged, search for the company
# and confirm only one record exists with all expected associations.
Manual verification:
| Mechanism | Detail |
|---|---|
| ----------- | -------- |
| CSV audit trail | Complete export of all companies with duplicate group annotations before any merging. |
| Prioritized approach | Customer and Opportunity companies merged first to protect highest-value data. |
| Review before merge | CSVs enable team review before any irreversible merges happen. |
| Confirmation prompt | Present duplicate analysis to the user and wait for explicit confirmation before instructing merges. |
| No auto-merge | This skill never merges automatically. All merges require manual human decision. |
GET /crm/v3/objects/companies with pagination, not the Search API. The Search API works too but is slower for full exports.Example.com and example.com are the same company.uv init hubspot-cleanup
cd hubspot-cleanup
uv add requests python-dotenv
Create a .env file:
HUBSPOT_ACCESS_TOKEN=pat-na1-xxxxxxxx
共 1 个版本