← 返回
未分类 中文

A/B Interpreter

Interpret A/B test results for ecommerce campaigns and pages by checking statistical significance, practical effect size, and next-step recommendations.
对电商活动和页面的A/B测试结果进行解读,检查统计显著性、实际效应大小并给出下一步建议。
leooooooow
未分类 clawhub v1.1.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 374
下载
💾 1
安装
1
版本
#latest

概述

A/B Interpreter

Most ecommerce A/B tests get called too early, too late, or on the wrong metric — and the team ships whichever variant "looks like" it won without a clean read on whether the lift was real, large enough to matter, or durable past the novelty window. This skill turns raw test results into a disciplined go or no-go verdict with statistical significance, practical effect size, segment checks, and a prescribed next step so every test actually moves the business.

Quick Reference

DecisionStrong signalAcceptableWeak / Redesign
------------
Statistical significancep < 0.05, two-tailed, sample powered to 80%p between 0.05 and 0.10 with large samplep > 0.10 or sample underpowered
Practical effect sizeLift exceeds MDE and the lower bound of the 95% CI is positiveLift exceeds MDE but CI barely crosses zeroLift below MDE even if "significant"
Test durationRan across at least two full weekly cyclesRan 10 to 14 days, no major anomaliesUnder 7 days or overlaps a holiday/promo
Sample ratioObserved split within 1% of plannedSplit within 2%SRM > 2% — suspect tracking
Guardrail metricsAll guardrails flat or positiveOne guardrail flat, one minor regressionAny revenue, refund, or AOV guardrail regressed
Segment stabilityLift positive across all key segmentsLift positive in 3 of 4 segmentsContradicting segments (new vs returning, device, geo)
Novelty riskLift holds in final week of testLift decays mildly over timeLift front-loaded and decays past week two

Solves

  • Teams ship winners that were noise, then can't reproduce the lift in production
  • Stakeholders disagree on whether a 3% lift with p = 0.08 is a "win"
  • Growth managers call tests after 3 days on a heavy-traffic weekend and miss weekday reality
  • Novelty-effect lifts decay in production and no one notices until the quarter closes
  • Segment-level contradictions (tablet users regress) get averaged away in the topline
  • Guardrail metrics like refund rate silently offset the headline conversion win
  • Test readouts lack a prescribed next step, so learnings do not compound

Workflow

  1. Ingest the test setup. Read the hypothesis, variants, traffic split, start and end dates, primary metric, minimum detectable effect (MDE), and guardrail list. Flag any required field that is missing and stop if the primary metric is not defined in advance.
  2. Sanity-check the data. Confirm the observed traffic split matches the planned split (SRM check). Look for outlier days, tracking gaps, or bot spikes. If the split is off by more than 2%, mark the test inconclusive and recommend re-running.
  3. Run the core statistical check. Use a two-proportion z-test for conversion rate, Welch's t-test for continuous revenue metrics. Compute the p-value, the observed lift, and a 95% confidence interval around the lift.
  4. Compare lift against MDE. A significant result that falls below MDE is practically meaningless; a non-significant result that exceeds MDE suggests under-powering and an extension. The lower bound of the CI must exceed zero for a confident ship call.
  5. Check segments and guardrails. Split by new vs. returning, device, traffic source, and geography. Any segment that contradicts the aggregate direction should be called out. Any guardrail regression (bounce rate, refund rate, add-to-cart) should be weighed against the primary lift.
  6. Assess novelty and seasonality. If the test ran under 14 days, flag novelty risk. Compare week one to week two. If lift concentrated in week one, recommend extending or running a repeat test after a cooldown.
  7. Deliver the verdict and next step. Write a plain-language verdict (ship, kill, extend, redesign), the reasoning a skeptical analyst would demand, and the prescribed follow-up test or monitoring metric. Every readout ends with one concrete action.

Example 1: Shopify PDP button copy test

Inputs. Hypothesis: "Buy Now" outperforms "Add to Cart" on the product page. 14-day test, 50/50 split, 48,200 visitors each arm. Control conversions: 1,446 (3.00%). Variant conversions: 1,640 (3.40%). MDE set at 0.25 percentage points. Guardrails: refund rate and AOV.

Analysis. Two-proportion z-test gives p = 0.02 with a 95% CI on the lift of +0.08 to +0.72 pp. Observed lift of 0.40 pp exceeds MDE of 0.25. Sample split was 50.1 / 49.9. New vs. returning both show directional lift. Refund rate flat. AOV down 1.2% (not significant, CI crosses zero). Week one lift 0.45, week two 0.35 — mild decay, within tolerance.

Verdict. Ship. The result is statistically and practically significant; guardrails are intact. Next step: monitor AOV weekly for four weeks post-rollout and re-test "Buy Now" against a bundled offer CTA in Q3.

Example 2: Email subject line test with under-powered readout

Inputs. Two subject lines sent to 8,000 recipients each. Opens: A = 1,440 (18.0%), B = 1,560 (19.5%). MDE of 2 percentage points was required to justify the winner for list-wide rollout.

Analysis. Two-proportion z-test p = 0.022 — "significant". Observed lift of 1.5 pp is below the 2.0 pp MDE. 95% CI on the lift is +0.2 to +2.8, straddling the MDE. No segment breakdown provided.

Verdict. Do not ship yet. The result is statistically significant but falls short of the practical threshold the team pre-committed to. Extend the test or re-run with a larger list to narrow the CI. Do not rationalize the "win" just because p < 0.05 — the pre-committed MDE exists precisely to prevent that.

Common mistakes

  • Peeking. Calling the test early the moment p drops below 0.05 inflates false positive rates dramatically. Pre-commit to a sample size and honor it.
  • Ignoring MDE. A statistically significant but practically tiny lift does not justify a rollout. The MDE is the business's pre-committed threshold.
  • No SRM check. A broken randomizer can fabricate lifts. Always verify the observed split matches the planned split.
  • Averaging away segments. A +2% topline with a -5% tablet regression is not a clean win. Pull the segment cuts before celebrating.
  • One-week tests. Weekday vs. weekend behavior differs, and novelty effects dominate week one. Run at least two full business cycles.
  • Mixing up absolute and relative lift. "20% lift" on a 3% base rate means +0.6 pp, not +20 pp. Always state which you are reporting.
  • Treating p-values as probability the hypothesis is true. A p-value is the probability of the observed data under the null, not the probability the variant is better.
  • No guardrails declared upfront. Retrofitting guardrails after seeing results invites cherry-picking. Name them before the test starts.
  • Ignoring post-ship drift. Novelty decay and seasonality can erode a shipped win. Always prescribe a post-ship monitoring window.

Resources

  • references/output-template.md — The four-block readout structure
  • references/statistical-tests.md — Choosing the right test for the metric type
  • references/segment-playbook.md — Standard segments to cut and what each contradiction signals
  • assets/ab-readout-checklist.md — Pre-flight checklist for every readout you deliver

版本历史

共 1 个版本

  • v1.1.0 当前
    2026-05-07 11:45 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

content-creation

Ecommerce Image Asset Generator

leooooooow
规划和生成能够促进转化的电商图片素材。适用于团队需要决定产品图、产品详情页图、商城图、促销图或广告图等场景。
★ 2 📥 1,796
data-analysis

Review Analysis

leooooooow
分析客户评论、投诉和反馈,识别重复模式、潜在根本原因并确定行动优先级。适用于团队需要将投诉分类聚合的场景。
★ 1 📥 1,781
developer-tools

Shipping Cost Calculator Ecommerce

leooooooow
根据重量、配送区域、承运商规则和免运费政策估算电商订单运费,帮助团队优化定价、利润和促销方案。
★ 0 📥 1,516