概述

expflow PDEBench Pipeline & HPO

Orchestrate experiment workflows for the AI4S PDE competition using expflow.

Three modes for three competition phases.

Triggers

User says "run HPO", "submit pipeline", "distributed experiment"
User says "competition sprint" or "fast iterate"
User asks about automating the train→eval→submit loop
User mentions needing to find best hyperparams

Installation

pip install "expflow-pde[pipeline]"

Available Pipeline Modes

Three pipeline modes, each mapped to a CLI command:

Mode A — Full (HPO → Train → Eval)

For the exploration phase of a competition task. Optuna finds best params

via distributed clearml-agent trials, trains with best, then evaluates.

expflow pipeline submit-full train_task1.py \
    --queue default \
    --trials 50 --parallel 4 \
    --eval-script eval_task1.py \
    --metric seg_total --direction maximize

Flags used:

--trials N: total HPO trials
--parallel M: max concurrent trials (use GPU node count)
--metric: objective metric name prefixed METRIC: in script stdout
--pruner hyperband|median|percentile: early-stop poor trials
--study-name: Optuna study name (auto if omitted; persists to SQLite)
--skip hpo --skip eval: run train only within full skeleton

Mode B — Fast (Train → Eval)

For the competition sprint phase. You already know best params. Skip HPO,

run directly with fixed args.

expflow pipeline submit train_task1.py \
    --queue default \
    --train-param lr=0.001 --train-param epochs=80 \
    --eval-script eval_task1.py \
    --eval-param sub_step=5

Flags:

--skip eval: train-only (just submit checkpoint)
--train-param key=val: injected as --key=val to training script
--eval-param key=val: injected as --key=val to eval script

Mode C — Flexible Skip

Override step inclusion on either mode:

expflow pipeline submit-full train_task1.py \
    --skip hpo --skip eval          # = train only
expflow pipeline submit-full train_task1.py \
    --skip train --skip eval         # = HPO only

HPO: Three Execution Modes

HPO (expflow optuna run) has three backends:

|------|------|-------------|----------|

Key flags across all HPO modes:

--pruner hyperband|median|percentile|none: ASHA pruner saves ~40% GPU time
--metric : reads METRIC:= from script stdout
--direction maximize|minimize
--timeout : safety cutoff

Script Requirements

The training/eval script must:

Accept hyperparams as --key=value CLI arguments
Output METRIC:= to stdout for objective capture (local mode)
Report clearml scalars for distributed/optimizer mode:

```python

Task.current_task().report_scalar("Score", "seg_total", value, iteration=epoch)

```

Pitfalls

Pruner needs trial.report() calls during training. If the script only reports at the end, the pruner has nothing to prune on. Call trial.report(val_loss, epoch) at least every 10 epochs.
HyperParameterOptimizer needs the metric name in Title/Series format. If your metric is seg_total, it becomes title=seg_total, series=seg_total. If your clearml report_scalar is report_scalar("Score", "seg_total", v), pass --metric Score/seg_total.
Clearml-agent must be running on GPU nodes before submitting. Verify with expflow clearml workers or check Web UI.
_collect_one_trial polls every 5s — waits up to 60min per trial. If trials are expected to run longer, increase timeout_minutes.

Architecture Reference

Key files in expflow_pde/:

hpo.py — 3-mode HPO runner (local/distributed/optimizer)
pipeline.py — ExperimentPipeline class (fast/full modes)
cli_pipeline.py — pipeline submit + pipeline submit-full
cli_optuna.py — optuna run with all three backends

版本历史

共 1 个版本

v0.5.0 当前

2026-05-21 15:39

安全检测

腾讯云安全 (Keen)

队列中

腾讯云安全 (Sanbu)

队列中

expflow Pipeline HPO

概述