← 返回
未分类

expflow Pipeline HPO

PDEBench competition workflow orchestration with expflow — three pipeline modes (full/fast/skip), distributed HPO, pruner integration, and ClearML HyperParam...
PDEBench竞赛工作流编排 - 使用expflow实现三种管道模式(完整/快速/跳过)、分布式超参数优化、剪枝器集成以及ClearML超参数管理
diamond2nv diamond2nv 来源
未分类 clawhub v0.5.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 250
下载
💾 0
安装
1
版本
#clearml#competition#hpo#latest#mlops#optuna#pde#pipeline

概述

expflow PDEBench Pipeline & HPO

Orchestrate experiment workflows for the AI4S PDE competition using expflow.

Three modes for three competition phases.

Triggers

  • User says "run HPO", "submit pipeline", "distributed experiment"
  • User says "competition sprint" or "fast iterate"
  • User asks about automating the train→eval→submit loop
  • User mentions needing to find best hyperparams

Installation

pip install "expflow-pde[pipeline]"

Available Pipeline Modes

Three pipeline modes, each mapped to a CLI command:

Mode A — Full (HPO → Train → Eval)

For the exploration phase of a competition task. Optuna finds best params

via distributed clearml-agent trials, trains with best, then evaluates.

expflow pipeline submit-full train_task1.py \
    --queue default \
    --trials 50 --parallel 4 \
    --eval-script eval_task1.py \
    --metric seg_total --direction maximize

Flags used:

  • --trials N: total HPO trials
  • --parallel M: max concurrent trials (use GPU node count)
  • --metric: objective metric name prefixed METRIC: in script stdout
  • --pruner hyperband|median|percentile: early-stop poor trials
  • --study-name: Optuna study name (auto if omitted; persists to SQLite)
  • --skip hpo --skip eval: run train only within full skeleton

Mode B — Fast (Train → Eval)

For the competition sprint phase. You already know best params. Skip HPO,

run directly with fixed args.

expflow pipeline submit train_task1.py \
    --queue default \
    --train-param lr=0.001 --train-param epochs=80 \
    --eval-script eval_task1.py \
    --eval-param sub_step=5

Flags:

  • --skip eval: train-only (just submit checkpoint)
  • --train-param key=val: injected as --key=val to training script
  • --eval-param key=val: injected as --key=val to eval script

Mode C — Flexible Skip

Override step inclusion on either mode:

expflow pipeline submit-full train_task1.py \
    --skip hpo --skip eval          # = train only
expflow pipeline submit-full train_task1.py \
    --skip train --skip eval         # = HPO only

HPO: Three Execution Modes

HPO (expflow optuna run) has three backends:

| Mode | Flag | Description | Best for |

|------|------|-------------|----------|

| Local | (default) | subprocess serial on CPU | ≤20 trials, quick test |

| Distributed | --distributed | ask/tell + clearml Task clone| Multi-GPU, custom control|

| Optimizer | --optimizer -O | Clearml HyperParameterOptimizer | Production, 50-200+ trials |

Key flags across all HPO modes:

  • --pruner hyperband|median|percentile|none: ASHA pruner saves ~40% GPU time
  • --metric : reads METRIC:= from script stdout
  • --direction maximize|minimize
  • --timeout : safety cutoff

Script Requirements

The training/eval script must:

  1. Accept hyperparams as --key=value CLI arguments
  2. Output METRIC:= to stdout for objective capture (local mode)
  3. Report clearml scalars for distributed/optimizer mode:

```python

Task.current_task().report_scalar("Score", "seg_total", value, iteration=epoch)

```

Pitfalls

  • Pruner needs trial.report() calls during training. If the script only reports at the end, the pruner has nothing to prune on. Call trial.report(val_loss, epoch) at least every 10 epochs.
  • HyperParameterOptimizer needs the metric name in Title/Series format. If your metric is seg_total, it becomes title=seg_total, series=seg_total. If your clearml report_scalar is report_scalar("Score", "seg_total", v), pass --metric Score/seg_total.
  • Clearml-agent must be running on GPU nodes before submitting. Verify with expflow clearml workers or check Web UI.
  • _collect_one_trial polls every 5s — waits up to 60min per trial. If trials are expected to run longer, increase timeout_minutes.

Architecture Reference

Key files in expflow_pde/:

  • hpo.py — 3-mode HPO runner (local/distributed/optimizer)
  • pipeline.py — ExperimentPipeline class (fast/full modes)
  • cli_pipeline.pypipeline submit + pipeline submit-full
  • cli_optuna.pyoptuna run with all three backends

Related

  • experiment-lifecycle-governance — PIN, metrics registry, compare-scores, competition rules audit
  • pde-experiment-hyperparameters — PDEBench-specific hyperparameter reference
  • multi-agent-distributed-experiment-workflow — Hermes → OpenCode → clearml

版本历史

共 1 个版本

  • v0.5.0 当前
    2026-05-21 15:39

安全检测

腾讯云安全 (Keen)

队列中

腾讯云安全 (Sanbu)

队列中

🔗 相关推荐

Clearml Metrics Logging Pattern

diamond2nv
Standardized ClearML metrics logging patterns for PDEBench experiment scripts — train loss, validation metrics, competit
★ 0 📥 289

Hfpclawer Paper Search

diamond2nv
发现、下载并整理来自 arXiv、HuggingFacePapers 和 OpenReview 的学术论文。多源搜索 → 去重 → PDF 下载 → Markdown 转换
★ 0 📥 302

Hfpclawer Citation Audit

diamond2nv
使用三层回退管道验证学术论文引用:本地 FTS5 数据库 → Semantic Scholar API → OpenAlex API,支持单个引用检查……
★ 0 📥 341