Data Analysis Skill
Analyze data files (CSV, Excel) and produce actionable insights.
Quick Start
- Read the file - Use appropriate library:
- CSV:
csv module or pandas.read_csv() - Excel:
pandas.read_excel() with openpyxl engine
- Explore the data - Get shape, columns, dtypes, missing values
- Generate insights - Calculate:
- Descriptive stats (mean, median, mode, std, min, max)
- Correlations between numeric columns
- Value counts for categorical columns
- Trends over time if date column exists
- Create visualizations - Use matplotlib:
- Bar charts for categorical data
- Line charts for time series
- Histograms for distributions
- Scatter plots for correlations
- Summarize - Write findings in plain English
Common Patterns
Sales Data
import pandas as pd
df = pd.read_csv('sales.csv')
summary = {
'total_revenue': df['amount'].sum(),
'avg_order': df['amount'].mean(),
'top_products': df['product'].value_counts().head(5),
'monthly_trend': df.groupby(pd.to_datetime(df['date']).dt.month)['amount'].sum()
}
Customer Data
demographics = df.groupby('segment').agg({
'age': ['mean', 'median'],
'income': ['mean', 'std'],
'id': 'count'
})
Time Series
df['date'] = pd.to_datetime(df['date'])
monthly = df.resample('M', on='date')['value'].sum()
Output Format
Always include:
- Overview - What the data contains (rows, columns, date range)
- Key Metrics - Top 5-10 actionable numbers
- Insights - 3-5 bullet points of what the data reveals
- Visualizations - At least 2 charts for any dataset with 100+ rows
- Recommendations - Suggested next steps based on findings
Error Handling
- Handle missing values:
df.fillna(0) or df.dropna() - Handle date parsing: Use
pd.to_datetime(..., errors='coerce') - Handle large files: Process in chunks for files >100MB