Orchestrate experiment workflows for the AI4S PDE competition using expflow.
Three modes for three competition phases.
pip install "expflow-pde[pipeline]"
Three pipeline modes, each mapped to a CLI command:
For the exploration phase of a competition task. Optuna finds best params
via distributed clearml-agent trials, trains with best, then evaluates.
expflow pipeline submit-full train_task1.py \
--queue default \
--trials 50 --parallel 4 \
--eval-script eval_task1.py \
--metric seg_total --direction maximize
Flags used:
--trials N: total HPO trials
--parallel M: max concurrent trials (use GPU node count)
--metric: objective metric name prefixed METRIC: in script stdout
--pruner hyperband|median|percentile: early-stop poor trials
--study-name: Optuna study name (auto if omitted; persists to SQLite)
--skip hpo --skip eval: run train only within full skeleton
For the competition sprint phase. You already know best params. Skip HPO,
run directly with fixed args.
expflow pipeline submit train_task1.py \
--queue default \
--train-param lr=0.001 --train-param epochs=80 \
--eval-script eval_task1.py \
--eval-param sub_step=5
Flags:
--skip eval: train-only (just submit checkpoint)
--train-param key=val: injected as --key=val to training script
--eval-param key=val: injected as --key=val to eval script
Override step inclusion on either mode:
expflow pipeline submit-full train_task1.py \
--skip hpo --skip eval # = train only
expflow pipeline submit-full train_task1.py \
--skip train --skip eval # = HPO only
HPO (expflow optuna run) has three backends:
| Mode | Flag | Description | Best for |
|------|------|-------------|----------|
| Local | (default) | subprocess serial on CPU | ≤20 trials, quick test |
| Distributed | --distributed | ask/tell + clearml Task clone| Multi-GPU, custom control|
| Optimizer | --optimizer -O | Clearml HyperParameterOptimizer | Production, 50-200+ trials |
--pruner hyperband|median|percentile|none: ASHA pruner saves ~40% GPU time
--metric : reads METRIC:= from script stdout
--direction maximize|minimize
--timeout : safety cutoff
The training/eval script must:
--key=value CLI arguments
METRIC:= to stdout for objective capture (local mode)
```python
Task.current_task().report_scalar("Score", "seg_total", value, iteration=epoch)
```
trial.report() calls during training. If the script only reports at the end, the pruner has nothing to prune on. Call trial.report(val_loss, epoch) at least every 10 epochs.
Title/Series format. If your metric is seg_total, it becomes title=seg_total, series=seg_total. If your clearml report_scalar is report_scalar("Score", "seg_total", v), pass --metric Score/seg_total.
expflow clearml workers or check Web UI.
_collect_one_trial polls every 5s — waits up to 60min per trial. If trials are expected to run longer, increase timeout_minutes.
Key files in expflow_pde/:
hpo.py — 3-mode HPO runner (local/distributed/optimizer)
pipeline.py — ExperimentPipeline class (fast/full modes)
cli_pipeline.py — pipeline submit + pipeline submit-full
cli_optuna.py — optuna run with all three backends
experiment-lifecycle-governance — PIN, metrics registry, compare-scores, competition rules audit
pde-experiment-hyperparameters — PDEBench-specific hyperparameter reference
multi-agent-distributed-experiment-workflow — Hermes → OpenCode → clearml
共 1 个版本