← 返回
开发者工具 中文

ML Pipeline

Complete machine learning pipeline for trading: feature engineering, AutoML, deep learning, and financial RL. Use for automated parameter sweeps, feature cre...
完整的机器学习交易流水线:特征工程、AutoML、深度学习与金融强化学习。用于自动化参数搜索、特征构建...
ahuserious
开发者工具 clawhub v0.1.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 1,033
下载
💾 88
安装
1
版本
#latest

概述

ML Pipeline

Unified skill for the complete ML pipeline within a quant trading research system.

Consolidates eight prior skills into a single authoritative reference covering

the full lifecycle: data validation, feature creation, selection,

transformation, anti-leakage checks, pipeline automation, deep learning optimization, and deployment.


1. When to Use

Activate this skill when the task involves any of the following:

  • Creating, selecting, or transforming features for an ML-driven strategy.
  • Auditing an existing feature pipeline for data leakage or overfitting risk.
  • Automating an end-to-end ML pipeline (data prep through model export).
  • Evaluating feature importance, scaling, encoding, or interaction effects.
  • Integrating features with a feature store (Feast, Tecton, custom Parquet store).
  • Explaining core ML concepts (bias-variance, cross-validation, regularisation)

in the context of feature engineering decisions.


2. Inputs to Gather

Before starting work, collect or confirm:

InputDetails
----------------
ObjectiveTarget metric (Sharpe, accuracy, RMSE ...), constraints, time horizon.
DataSymbols / instruments, timeframe, bar type, sampling frequency, data sources.
Leakage risksPoint-in-time concerns, survivorship bias, look-ahead in labels or features.
Compute budgetCPU/GPU limits, wall-clock budget for AutoML search.
LatencyOnline vs. offline inference, acceptable prediction latency.
InterpretabilityRegulatory or research need for explainable features / models.
Deployment targetWhere the model will run (notebook, backtest harness, live engine).

3. Feature Creation Patterns

3.1 Numerical Features

  • Interaction terms: price * volume, high / low, close - open.
  • Rolling statistics: mean, std, skew, kurtosis over configurable windows.
  • Polynomial / log transforms: log(volume + 1), spread^2.
  • Binning / discretisation: equal-width, quantile-based, or domain-driven bins.

3.2 Categorical Features

  • One-hot encoding: for low-cardinality categoricals (sector, exchange).
  • Target encoding: mean-target per category with smoothing (careful of leakage -- use only in-fold means).
  • Ordinal encoding: when categories have a natural order (credit rating).

3.3 Time-Series Specific

  • Lag features: return_{t-1}, return_{t-5}, etc.
  • Calendar features: day-of-week, month, quarter, options-expiry flag.
  • Rolling z-score: (x - rolling_mean) / rolling_std for stationarity.
  • Fractional differentiation: preserve memory while achieving stationarity (Lopez de Prado).

3.4 Feature Selection Techniques

  • Filter methods: mutual information, variance threshold, correlation pruning.
  • Wrapper methods: recursive feature elimination (RFE), forward/backward selection.
  • Embedded methods: L1 regularisation, tree-based importance, SHAP values.
  • Permutation importance: model-agnostic; run on out-of-fold predictions.

4. Anti-Leakage Checks

Data leakage is the single most common cause of inflated backtest results.

Apply these checks at every pipeline stage:

4.1 Label Leakage

  • Labels must be computed from future returns relative to the feature

timestamp. Verify that the label window does not overlap the feature window.

  • Use purging and embargo when labels span multiple bars.

4.2 Feature Leakage

  • No feature may use information from time t+1 or later at prediction time t.
  • Rolling statistics must use a closed left window: df['feat'].rolling(20).mean().shift(1).
  • Target-encoded categoricals must be computed on the training fold only.

4.3 Cross-Validation Leakage

  • Use purged k-fold or walk-forward CV for time-series. Never use random

k-fold on ordered data.

  • Insert an embargo gap between train and test folds to prevent bleed-through

from autocorrelation.

4.4 Survivorship & Selection Bias

  • Ensure the universe of instruments at time t reflects what was actually

tradable at that time (delisted stocks, halted symbols removed later).

  • Backfill from point-in-time databases where available.

4.5 Validation Checklist

Run before every backtest:

[ ] Labels computed strictly from future returns (no overlap with features)
[ ] All rolling features shifted by at least 1 bar
[ ] Target encoding uses in-fold means only
[ ] Walk-forward or purged CV used (no random shuffle on time-series)
[ ] Embargo gap >= max(label_horizon, autocorrelation_lag)
[ ] Universe is point-in-time (no survivorship bias)
[ ] No global scaling fitted on full dataset (fit on train, transform test)

5. Pipeline Automation (AutoML)

5.1 Prerequisites

  • Python environment with one or more AutoML libraries:

Auto-sklearn, TPOT, H2O AutoML, PyCaret, Optuna, or custom Optuna pipelines.

  • Training data in CSV / Parquet / database.
  • Problem type identified: classification, regression, or time-series forecasting.

5.2 Pipeline Steps

StepAction
--------------
1. Define requirementsProblem type, evaluation metric, time/resource budget, interpretability needs.
2. Data infrastructureLoad data, quality assessment, train/val/test split strategy, define feature transforms.
3. Configure AutoMLSelect framework, define algorithm search space, set preprocessing steps, choose tuning strategy (Bayesian, random, Hyperband).
4. Execute trainingRun automated feature engineering, model selection, hyperparameter optimisation, cross-validation.
5. Analyse & exportCompare models, extract best config, feature importance, visualisations, export for deployment.

5.3 Pipeline Configuration Template

pipeline_config = {
    "task_type": "classification",        # or "regression", "time_series"
    "time_budget_seconds": 3600,
    "algorithms": ["rf", "xgboost", "catboost", "lightgbm"],
    "preprocessing": ["scaling", "encoding", "imputation"],
    "tuning_strategy": "bayesian",        # or "random", "hyperband"
    "cv_folds": 5,
    "cv_type": "purged_kfold",            # or "walk_forward"
    "embargo_bars": 10,
    "early_stopping_rounds": 50,
    "metric": "sharpe_ratio",             # domain-specific metric
}

5.4 Output Artifacts

  • automl_config.py -- pipeline configuration.
  • best_model.pkl / .joblib / .onnx -- serialised model.
  • feature_pipeline.pkl -- fitted preprocessing + feature transforms.
  • evaluation_report.json -- metrics, confusion matrix / residuals, feature rankings.
  • deployment/ -- prediction API code, input validation, requirements.txt.

6. Core ML Fundamentals (Feature-Engineering Context)

6.1 Bias-Variance Trade-off

  • More features increase model capacity (lower bias) but risk overfitting (higher variance).
  • Use regularisation (L1/L2), feature selection, or dimensionality reduction to manage.

6.2 Evaluation Strategy

  • Walk-forward validation: the gold standard for time-series strategies.

Roll a fixed-width training window forward; test on the next out-of-sample period.

  • Monte Carlo permutation tests: shuffle labels and re-evaluate to estimate

the probability that observed performance is due to chance.

  • Combinatorial purged CV (CPCV): generate many train/test combinations with

purging for more robust performance estimates.

6.3 Feature Scaling

  • Fit scalers (StandardScaler, MinMaxScaler, RobustScaler) on the training set only.
  • Apply the same fitted scaler to validation and test sets.
  • RobustScaler is often preferred for financial data due to heavy tails.

6.4 Handling Missing Data

  • Forward-fill then backward-fill for price data (be aware of leakage on backfill).
  • Indicator column for missingness can itself be informative.
  • Tree-based models can handle NaN natively; linear models cannot.

7. Workflow

For any feature engineering task, follow this sequence:

  1. Restate the task in measurable terms (metric, constraints, deadline).
  2. Enumerate required artifacts: datasets, feature definitions, configs, scripts, reports.
  3. Propose a default approach and 1-2 alternatives with trade-offs.
  4. Implement feature pipeline with anti-leakage checks built in.
  5. Validate with walk-forward CV, Monte Carlo, and the leakage checklist above.
  6. Deliver repo-ready code, documentation, and a run command.

8. Deep Learning Optimization

8.1 Optimizer Selection

OptimizerBest ForLearning Rate
------------------------------------
AdamMost cases, adaptive1e-3 to 1e-4
AdamWTransformers, weight decay1e-4 to 1e-5
SGD + MomentumLarge batches, fine-tuning1e-2 to 1e-3
RAdamStability without warmup1e-3

8.2 Learning Rate Scheduling

  • OneCycleLR: Best for short training, fast convergence
  • CosineAnnealing: Smooth decay, good generalization
  • ReduceOnPlateau: Adaptive when validation loss plateaus
  • Warmup + Decay: Standard for transformers

8.3 Regularization Techniques

  • Dropout: 0.1-0.5 for fully connected layers
  • L2 (Weight Decay): 1e-4 to 1e-2
  • Batch Normalization: Stabilizes training
  • Early Stopping: Monitor validation loss, patience 5-10 epochs

8.4 PyTorch Lightning Integration

import pytorch_lightning as pl

class TradingModel(pl.LightningModule):
    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=1e-4)
        scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer, max_lr=1e-3, total_steps=self.trainer.estimated_stepping_batches
        )
        return [optimizer], [scheduler]

8.5 Financial Reinforcement Learning

  • State: Market features, portfolio state, position
  • Action: Buy/Sell/Hold, position sizing
  • Reward: Risk-adjusted returns (Sharpe, Sortino)
  • Frameworks: Stable-Baselines3, RLlib, FinRL

9. Error Handling

ProblemCauseFix
---------------------
AutoML search finds no good modelInsufficient time budget or poor featuresIncrease budget, engineer better features, expand algorithm search space.
Out of memory during trainingDataset too large for available RAMDownsample, use incremental learning, simplify feature engineering.
Model accuracy below thresholdWeak signal or overfittingCollect more data, add domain-driven features, regularise, adjust metric.
Feature transforms produce NaN/InfDivision by zero, log of negativeAdd guards: np.where(denom != 0, ...), np.log1p(np.abs(x)).
Optimiser fails to convergeBad hyperparameter rangesTighten search bounds, increase iterations, exclude unstable algorithms.

10. Bundled Scripts

All scripts live in scripts/ within this skill directory.

ScriptPurpose
-----------------
data_validation.pyValidate input data quality before pipeline execution.
model_evaluation.pyEvaluate trained model performance and generate reports.
pipeline_deployment.pyDeploy a trained pipeline to a target environment with rollback support.
feature_engineering_pipeline.pyEnd-to-end feature engineering: load, clean, transform, select, train.
feature_importance_analyzer.pyAnalyse feature importance (permutation, SHAP, tree-based).
data_visualizer.pyVisualise feature distributions and relationships to target.
feature_store_integration.pyIntegrate with feature stores (Feast, Tecton) for online/offline serving.

11. Resources

Frameworks

  • scikit-learn -- preprocessing, feature selection, pipelines.
  • Auto-sklearn / TPOT / H2O AutoML / PyCaret -- automated pipeline search.
  • Optuna -- flexible hyperparameter optimisation.
  • SHAP -- model-agnostic feature importance.
  • Feast / Tecton -- feature store management.
  • PyTorch Lightning -- https://lightning.ai/docs/pytorch/stable/
  • Stable-Baselines3 -- https://stable-baselines3.readthedocs.io/
  • FinRL -- https://github.com/AI4Finance-Foundation/FinRL

Key References

  • Lopez de Prado, Advances in Financial Machine Learning (2018) -- purged CV, fractional differentiation, meta-labelling.
  • Hastie, Tibshirani & Friedman, The Elements of Statistical Learning -- bias-variance, regularisation, model selection.
  • scikit-learn user guide: feature extraction, preprocessing, model selection.

Best Practices

  • Always start with a simple baseline before running AutoML.
  • Balance automation with domain knowledge -- blind search rarely beats informed priors.
  • Monitor resource consumption; set hard timeouts.
  • Validate on true out-of-sample holdout data, not just cross-validation.
  • Document every pipeline decision for reproducibility.

版本历史

共 1 个版本

  • v0.1.0 当前
    2026-03-29 22:22 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

developer-tools

CodeConductor.ai

larsonreever
AI驱动平台,提供快速全栈开发、智能体、工作流自动化及低代码AI集成的可扩展产品创建。
★ 68 📥 180,260
data-analysis

Tearsheet Generator

ahuserious
利用QuantStats库生成包含自定义SVG可视化的专业业绩报告,制作带MAE分析和杠杆建议的表现报表。
★ 0 📥 747
developer-tools

Gog

steipete
Google Workspace 命令行工具,支持 Gmail、日历、云端硬盘、通讯录、表格和文档。
★ 921 📥 185,809