A comprehensive, interactive data analysis workflow that transforms CSV data into actionable business insights. This skill guides you through professional data analysis from initial exploration to final deliverable, with quality gates and user confirmations at key decision points.
This skill follows a 7-step methodology with 3 interaction points:
Input → Business Understanding → Data Inspection → Cleaning Strategy → EDA → Deep Analysis → Insights → Output
↓ ↓ ↓ ↓ ↓ ↓ ↓
Required Interaction 1 Quality Gate Interaction 2 Auto-run Auto-run Interaction 3
Required:
Optional:
--quick (basic stats), --standard (default), or --deep (advanced modeling)--auto (skip interactions, use recommended strategies)--format=markdown|html|dashboardUsage examples:
"Analyze sales_data.csv - I want to know which channels have the best conversion rates"
"Help me understand customer_behavior.csv, specifically looking at retention patterns"
"Quick analysis of Q4_results.csv --quick --auto"
```markdown
## Analysis Plan
Core Question: [Restate the user's goal in one sentence]
Key Metrics to Calculate:
Analysis Dimensions:
Expected Deliverables:
```
Present your analysis plan and ask:
Does this match what you're looking for?
If you'd like me to focus on different aspects or add something, let me know.
If the business goal is unclear, offer templates:
I can help with common scenarios:
1. Sales Analysis (channel comparison, trend forecasting, top products)
2. User Behavior (funnel analysis, retention cohorts, churn prediction)
3. Operations (ROI calculation, campaign effectiveness, resource allocation)
Which best describes what you need, or would you like to describe it differently?
Wait for user confirmation before proceeding.
```python
import pandas as pd
import numpy as np
# Try UTF-8 first, fall back to other encodings if needed
try:
df = pd.read_csv(file_path, encoding='utf-8')
except UnicodeDecodeError:
df = pd.read_csv(file_path, encoding='latin-1')
```
```
## Data Overview
📊 Dimensions: {rows:,} rows × {cols} columns
💾 Memory: {size} MB
📋 Columns:
| Column Name | Data Type | Sample Value |
|-------------|-----------|--------------|
| ... | ... | ... |
🔍 Preview (first 5 rows):
[Display formatted table]
```
```
Score = 100 - (missing_penalty + duplicate_penalty + type_mismatch_penalty)
Where:
```
Based on the quality score, present findings:
Score ≥ 80 (Good):
✅ Data quality looks good (Score: {score}/100)
Minor issues found: [list if any]
Proceeding to analysis...
Score 60-79 (Fair):
⚠️ Data has some quality issues (Score: {score}/100)
Issues found: [list]
I can still analyze this, but results may be limited. Continue?
Score < 60 (Poor):
🚨 Data quality is concerning (Score: {score}/100)
Major issues:
- [Issue 1 with impact]
- [Issue 2 with impact]
Recommendation: Contact the data provider or provide a data dictionary.
Would you like me to proceed with limited analysis, or should we address these issues first?
If quality score ≥ 80 and issues are minor, apply automatic fixes and report:
🧹 Applied automatic cleaning:
- Standardized date format in 'OrderDate' column
- Trimmed whitespace from text fields
Ready to analyze!
If quality score < 80, present issues with specific strategy options:
## Data Cleaning Recommendations
### Issue 1: Missing Values in 'Age' Column (20% missing)
**Strategy options:**
A. Delete rows with missing Age (lose 20% of data) ← Recommended if Age is critical
B. Fill with median age (35 years)
C. Fill with group average (median by Gender)
D. Keep as-is and exclude Age from analysis
### Issue 2: Outliers in 'Price' Column (3 negative values)
**Strategy options:**
A. Remove the 3 rows ← Recommended
B. Set negative values to 0
C. Set to the minimum valid price
### Issue 3: Date Format Inconsistency in 'PurchaseDate'
**Strategy options:**
A. Standardize to YYYY-MM-DD format ← Recommended (automatic)
Ask the user:
Please choose a strategy for each issue (e.g., "1A, 2A, 3A"),
or type "recommended" to use all recommended strategies,
or type "auto" to let me decide.
In auto mode (--auto flag): Skip this interaction and use all recommended strategies.
```
✅ Cleaning completed:
📊 Final dataset: {final_rows:,} rows × {cols} columns (was {original_rows:,} rows)
```
```python
cleaned_path = output_dir / 'cleaned_data.csv'
df_clean.to_csv(cleaned_path, index=False)
```
Extract Top 3-5 preliminary findings, such as:
Save all visualizations as PNG files:
import matplotlib.pyplot as plt
import seaborn as sns
# Example
fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(data=df, x='Age', bins=30, ax=ax)
plt.title('Age Distribution')
plt.savefig(output_dir / 'age_distribution.png', dpi=300, bbox_inches='tight')
plt.close()
Report the findings in a structured format:
## Exploratory Findings
### Distribution Overview
- [Key observation about data distribution]
### Preliminary Insights
1. **[Insight 1]:** [Data supporting it]
2. **[Insight 2]:** [Data supporting it]
3. **[Insight 3]:** [Data supporting it]
📊 Visualizations saved: age_distribution.png, correlation_heatmap.png, channel_comparison.png
Based on the business question type identified in Step 1, automatically choose and apply the appropriate analysis method:
| Business Question Type | Analysis Method |
|---|---|
| ------------------------ | ----------------- |
| Trend over time | Time series analysis with moving averages, seasonality detection |
| Attribution/cause | Grouped comparison, contribution breakdown (e.g., which factor drives 80% of variance) |
| User behavior | Funnel analysis (conversion at each step), cohort retention analysis |
| Customer value | RFM model (Recency, Frequency, Monetary), clustering into segments |
| Forecasting | Simple linear regression or exponential smoothing for trend extrapolation |
# Time series with moving average
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df['7d_MA'] = df['Revenue'].rolling(window=7).mean()
# Detect trend
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(
df['Date'].map(pd.Timestamp.toordinal),
df['Revenue']
)
if p_value < 0.05:
trend = "increasing" if slope > 0 else "decreasing"
print(f"Statistically significant {trend} trend detected (p={p_value:.4f})")
# Calculate RFM scores
current_date = df['PurchaseDate'].max()
rfm = df.groupby('CustomerID').agg({
'PurchaseDate': lambda x: (current_date - x.max()).days, # Recency
'OrderID': 'count', # Frequency
'Amount': 'sum' # Monetary
}).rename(columns={
'PurchaseDate': 'Recency',
'OrderID': 'Frequency',
'Amount': 'Monetary'
})
# Score customers (1-5 scale)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5, 4, 3, 2, 1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1, 2, 3, 4, 5])
# Segment customers
rfm['Segment'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)
Report deep analysis results with specific numbers:
## Deep Analysis Results
### [Analysis Type: e.g., "Channel Performance Attribution"]
**Key Metric Calculated:** [e.g., Conversion Rate by Channel]
| Channel | Orders | Conversion Rate | Contribution to Revenue |
|---------|--------|-----------------|-------------------------|
| A | 5,234 | 8.5% | 45% |
| B | 3,102 | 3.7% | 28% |
| C | 1,876 | 2.1% | 27% |
**Statistical Finding:**
Channel A's conversion rate is 2.3x higher than Channel B (p < 0.001), indicating significantly better targeting or user experience.
📊 Visualization saved: channel_performance.png
Synthesize all findings into a structured narrative following the What → So What → Now What framework:
## Analysis Report
### 🔍 Core Findings (What)
Objective facts from the data:
1. **[Finding 1]:** [Specific numbers and context]
2. **[Finding 2]:** [Specific numbers and context]
3. **[Finding 3]:** [Specific numbers and context]
### 💡 Business Insights (So What)
Interpretation and implications:
1. **[Insight 1]:** Why this matters for the business
- Impact: [Quantify if possible: e.g., "Could increase revenue by 15%"]
- Root cause hypothesis: [Why you think this is happening]
2. **[Insight 2]:** Why this matters for the business
- Impact: [...]
- Root cause hypothesis: [...]
### 🎯 Action Recommendations (Now What)
Prioritized next steps:
1. **[Action 1 - High Priority]:** What to do, expected outcome
- Timeline: [Immediate / This quarter / Long-term]
- Owner: [Which team should handle this]
2. **[Action 2 - Medium Priority]:** What to do, expected outcome
- Timeline: [...]
- Owner: [...]
For each finding, choose the most effective visualization:
Ensure all charts have:
If user specified format in initial request (e.g., --format=html), skip this and use that format.
Otherwise, present options:
## Analysis Complete! 🎉
Your analysis is ready. Please choose an output format:
1. **Quick Report** - Markdown document with embedded PNG charts
- Best for: Sharing via email, documentation, GitHub
- Generation time: ~10 seconds
2. **Interactive Report** - Single-page HTML with embedded Chart.js
- Best for: Presentations, exploring data interactively
- Generation time: ~30 seconds
3. **Full Dashboard** - Multi-page web application (Tailwind CSS + Chart.js)
- Best for: Sharing with stakeholders, ongoing monitoring
- Generation time: ~2 minutes
Enter 1, 2, or 3:
import shutil
report_path = output_dir / 'analysis_report.md'
with open(report_path, 'w', encoding='utf-8') as f:
f.write(f"# Data Analysis Report\n\n")
f.write(f"**Dataset:** {csv_filename}\n")
f.write(f"**Analysis Date:** {datetime.now().strftime('%Y-%m-%d')}\n\n")
f.write(f"---\n\n")
f.write(f"## Business Question\n\n{business_question}\n\n")
f.write(f"## Data Overview\n\n{data_overview}\n\n")
f.write(f"## Findings\n\n{findings}\n\n")
f.write(f"## Insights\n\n{insights}\n\n")
f.write(f"## Recommendations\n\n{recommendations}\n\n")
f.write(f"---\n\n")
f.write(f"## Appendix: Visualizations\n\n")
for chart in chart_files:
f.write(f"\n\n")
# Copy all chart PNGs to the output directory
for chart in chart_files:
shutil.copy(chart, output_dir)
Use this template structure:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Data Analysis Report</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
<script src="https://cdn.tailwindcss.com"></script>
</head>
<body class="bg-gray-50 p-8">
<div class="max-w-6xl mx-auto bg-white shadow-lg rounded-lg p-8">
<h1 class="text-3xl font-bold text-gray-800 mb-4">Data Analysis Report</h1>
<!-- Business Question Section -->
<section class="mb-8">
<h2 class="text-2xl font-semibold text-gray-700 mb-3">Business Question</h2>
<p class="text-gray-600">[Insert question]</p>
</section>
<!-- Key Findings Section -->
<section class="mb-8">
<h2 class="text-2xl font-semibold text-gray-700 mb-3">🔍 Key Findings</h2>
<div class="grid grid-cols-1 md:grid-cols-3 gap-4">
<div class="bg-blue-50 p-4 rounded-lg">
<div class="text-3xl font-bold text-blue-600">[Metric 1]</div>
<div class="text-sm text-gray-600">[Description]</div>
</div>
<!-- Repeat for other metrics -->
</div>
</section>
<!-- Charts Section -->
<section class="mb-8">
<h2 class="text-2xl font-semibold text-gray-700 mb-3">📊 Visualizations</h2>
<div class="mb-6">
<canvas id="chart1"></canvas>
</div>
<!-- Repeat for other charts -->
</section>
<!-- Recommendations Section -->
<section class="mb-8">
<h2 class="text-2xl font-semibold text-gray-700 mb-3">🎯 Recommendations</h2>
<ol class="list-decimal list-inside space-y-2 text-gray-700">
<li>[Recommendation 1]</li>
<li>[Recommendation 2]</li>
</ol>
</section>
</div>
<script>
// Chart.js configurations
const ctx1 = document.getElementById('chart1').getContext('2d');
new Chart(ctx1, {
type: 'bar',
data: {
labels: [labels_from_analysis],
datasets: [{
label: 'Metric Name',
data: [data_from_analysis],
backgroundColor: 'rgba(59, 130, 246, 0.5)',
borderColor: 'rgb(59, 130, 246)',
borderWidth: 1
}]
},
options: {
responsive: true,
plugins: {
title: {
display: true,
text: 'Chart Title'
}
}
}
});
</script>
</body>
</html>
Create a file structure:
dashboard/
├── index.html (overview page)
├── data.html (detailed data explorer)
├── insights.html (findings and recommendations)
└── assets/
├── data.json (exported data for interactivity)
└── styles.css (custom styling)
Present the deliverables to the user:
## ✅ Analysis Complete!
Your analysis has been saved to: `{output_directory}`
### Deliverables:
- 📄 Analysis Report: `analysis_report.{md|html}`
- 📊 Visualizations: {list of chart files}
- 🧹 Cleaned Data: `cleaned_data.csv`
- 💾 Analysis Code: `analysis_notebook.ipynb` (optional)
### Key Takeaways:
1. [One-sentence summary of finding 1]
2. [One-sentence summary of finding 2]
3. [One-sentence summary of finding 3]
[If HTML/Dashboard]
You can open `{report_filename}` in your browser to explore the interactive report.
Would you like me to explain any part of the analysis in more detail, or make changes to the report?
Detect large files early and present options:
This dataset has {rows:,} rows. For faster analysis, I can:
1. Analyze a random sample (10%, ~{sample_size:,} rows) ← Recommended
2. Aggregate by {suggested_dimension} before analyzing
3. Process the full dataset (will take longer)
Which would you prefer?
This dataset has {cols} columns. To focus the analysis, I can:
1. Use only columns relevant to your question: {relevant_cols}
2. Let you select which columns to include
3. Analyze all columns (report will be lengthy)
Which approach works best?
If the user just says "analyze this CSV" without a specific question:
I can analyze this data for you! To make it most useful, what would you like to understand?
Common scenarios:
- 🎯 Performance: "Which {dimension} performs best on {metric}?"
- 📈 Trends: "How has {metric} changed over time?"
- 🔍 Patterns: "What factors correlate with {outcome}?"
- 📊 Overview: "Give me a general summary of this data"
Or describe what you're trying to figure out in your own words.
⚠️ This data has significant quality issues (Score: {score}/100):
- {issue_1}
- {issue_2}
Recommendations:
1. Contact the data source to get a cleaner version
2. Provide a data dictionary so I can better interpret the fields
3. I can proceed with a limited analysis, but findings will be marked as "low confidence"
How would you like to proceed?
Data Processing:
pandas - Primary data manipulationnumpy - Numerical operationsscipy - Statistical testsVisualization:
matplotlib - Static charts (PNG output)seaborn - Statistical visualizationsOutput Generation:
Always install required packages if missing:
import subprocess
import sys
def ensure_package(package):
try:
__import__(package)
except ImportError:
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
# Ensure core packages
for pkg in ['pandas', 'numpy', 'scipy', 'matplotlib', 'seaborn']:
ensure_package(pkg)
All outputs are saved to: ./data-analysis-results/{timestamp}/
Structure:
data-analysis-results/
└── 2024-03-29_14-30-45/
├── analysis_report.md (or .html)
├── cleaned_data.csv
├── charts/
│ ├── age_distribution.png
│ ├── correlation_heatmap.png
│ └── channel_comparison.png
└── analysis_notebook.ipynb (if requested)
Create the timestamped directory at the start:
from datetime import datetime
from pathlib import Path
timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
output_dir = Path(f'./data-analysis-results/{timestamp}')
output_dir.mkdir(parents=True, exist_ok=True)
(output_dir / 'charts').mkdir(exist_ok=True)
共 1 个版本