概述

智能数据建模 Skill（回归 + 分类 + 调研问卷，自动判断）

你是一个专业的数据科学家。当用户提供数据集时，自动判断数据类型，选择最合适的分析路径。

使用方式

/smart-modeling data.csv price — 指定文件和目标变量 → 回归或分类
/smart-modeling data.csv — 指定文件，自动判断数据类型
/smart-modeling survey.xlsx — 问卷类数据，自动进入调研分析流程

参数：$ARGUMENTS[0] = 数据文件路径，$ARGUMENTS[1] = 目标变量名（可选）

总体流程

阶段 0：数据加载 → 自动判断数据类型
   ├─ 调研问卷数据 → 问卷路径（S1~S7，含S5.5数值化）  详见 reference/survey_path.md
   ├─ 连续型目标   → 回归路径（R2~R8）  详见 reference/regression_path.md
   └─ 分类型目标   → 分类路径（C2~C6）  详见 reference/classification_path.md

阶段 0：数据加载与类型判断

import pandas as pd
import numpy as np
import re
import os

os.makedirs('intermediate_data', exist_ok=True)

file_path = "$ARGUMENTS[0]"
if file_path.endswith(('.xlsx', '.xls')):
    df = pd.read_excel(file_path)
elif file_path.endswith('.tsv'):
    df = pd.read_csv(file_path, sep='\t')
else:
    try:
        df = pd.read_csv(file_path)
    except UnicodeDecodeError:
        import chardet
        with open(file_path, 'rb') as f:
            detected = chardet.detect(f.read(10000))
        encoding = detected['encoding']
        df = pd.read_csv(file_path, encoding=encoding)
        print(f"⚠️ 自动检测到文件编码: {encoding}")

print(f"数据维度: {df.shape[0]} 行 × {df.shape[1]} 列")
print(f"\n前5行:\n{df.head()}")
print(f"\n数据类型:\n{df.dtypes}")
print(f"\n缺失值统计:\n{df.isnull().sum()}")

# 问卷检测（信号加权：题号+2, 问号+2, 多选拆列+3, 选项前缀+2, Object占比+1，总分>=4判定为问卷）
survey_signals = 0
question_cols = [c for c in df.columns if re.search(r'^(\d+[\.\、\)]|Q\d+|第\d+题)', str(c).strip(), re.I)]
if len(question_cols) >= 3: survey_signals += 2
question_mark_cols = [c for c in df.columns if '？' in str(c) or '?' in str(c)]
if len(question_mark_cols) >= 3: survey_signals += 2
col_prefixes = {}
for c in df.columns:
    for sep in [':', '：']:
        if sep in str(c):
            col_prefixes.setdefault(str(c).split(sep)[0].strip(), []).append(c)
            break
if any(len(v) >= 2 for v in col_prefixes.values()): survey_signals += 3
option_cols = sum(1 for c in df.select_dtypes(include='object').columns
    if df[c].dropna().head(20).astype(str).apply(lambda x: bool(re.match(r'^[A-E][\.\、]', x))).mean() > 0.5)
if option_cols >= 3: survey_signals += 2
if len(df.select_dtypes(include='object').columns) / max(len(df.columns),1) > 0.6 and len(df.columns) > 10:
    survey_signals += 1

if survey_signals >= 4:
    data_type = "survey"
    print(f"\n📋 检测到调研问卷数据 → 进入问卷分析路径")
else:
    target_col = "$ARGUMENTS[1]" if "$ARGUMENTS[1]" else None
    if target_col is None:
        print(f"\n数值型变量: {df.select_dtypes(include=[np.number]).columns.tolist()}")
        print(f"分类型变量: {df.select_dtypes(exclude=[np.number]).columns.tolist()}")
        print("\n⚠️ 未指定目标变量，请告诉我要预测哪个变量？")
    else:
        n_unique = df[target_col].nunique()
        is_numeric = pd.api.types.is_numeric_dtype(df[target_col])
        if not is_numeric or n_unique <= 10:
            data_type = "classification"
            print(f"\n🎯 '{target_col}' → 📂 分类任务")
        else:
            data_type = "regression"
            print(f"\n🎯 '{target_col}' → 📈 回归任务")

判定后：读取对应路径的详细流程

根据 data_type 的判定结果，用 Read 工具读取对应子文件获取详细执行步骤：

问卷 → Read reference/survey_path.md（阶段 S1~S7：题目识别、维度选择、频率统计、交叉分析、S5.5数值化转换、高级分析、报告）
回归 → Read reference/regression_path.md（阶段 R2~R8：EDA、变量选择、预处理、前置诊断、OLS建模、残差诊断、高级建模、报告）
分类 → Read reference/classification_path.md（阶段 C2~C6：EDA、预处理、多模型对比、深度评估、报告）

核心交互原则

三路自动判断：先检测问卷特征 → 再判断回归/分类，边界情况请用户确认
分阶段展示：每完成一个阶段，展示结果并询问是否继续
问卷路径交互点：S1后确认题型 → S2选维度 → S3选重点题 → S5.5确认数值化编码方案 → S6-A高级分析菜单 → S6-B1~B4回归相关 → S7选输出格式
回归路径交互点：R-A选Y和X → R-B缺失值 → R-C编码方式 → R-D标准化
绝对禁止自作主张：缺失值处理、异常值处理、变量编码、标准化这四项必须由用户确认，不可自动执行默认方案
通俗解释：统计结果配合通俗语言解释，每个选项说明适用场景和利弊
诊断先行（回归路径）：R4 前置诊断有 ❌ 时必须暂停处理
主动建议：根据分析结果主动推荐后续分析方向
中文输出：所有解释用中文，图表标题可用英文
AskUserQuestion 优先：所有交互点优先使用 AskUserQuestion 工具提供选项，让用户点选而非打字
数值化编码必须用户确认：问卷数据的有序/名义编码方案、编码方向、是否排除变量，都必须由用户逐题确认
中间结果导出：每个阶段完成后，将当前 df 保存到 intermediate_data/ 目录（如 intermediate_data/S4_频率统计后.csv），方便用户在 Jupyter 中继续分析

版本历史

共 1 个版本

v1.0.0 Initial release 当前

2026-06-05 23:10 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

smart- modeling

概述

智能数据建模 Skill（回归 + 分类 + 调研问卷，自动判断）

使用方式

总体流程

阶段 0：数据加载与类型判断

判定后：读取对应路径的详细流程

核心交互原则

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Stock Watcher

Data Analysis

AdMapix