← 返回
未分类 中文

Clearml Metrics Logging Pattern

Standardized ClearML metrics logging patterns for PDEBench experiment scripts — train loss, validation metrics, competition scores, PDE residual, and TensorB...
Standardized ClearML metrics logging patterns for PDEBench experiment scripts — train loss, validation metrics, competition scores, PDE residual, and TensorB...
diamond2nv diamond2nv 来源
未分类 clawhub v0.5.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 242
下载
💾 0
安装
1
版本
#latest

概述

ClearML Metrics Logging Pattern

When to Use

  • Creating or modifying PDEBench training/evaluation scripts
  • Adding clearml logging to train_task1.py, train_task1_phys.py, train_task1_ft.py, train_task1_unroll.py
  • Ensuring expflow (single-node + distributed) can auto-capture metrics
  • Standardizing metric naming for compare-scores and gating

Installation

pip install "expflow-pde[clearml]"

Standardized Metric Naming Convention

All clearml metrics use Group/Metric naming, compatible with expflow clearml compare-scores:

# Loss group — error/cost related scalars
clearml_logger.report_scalar('Loss', 'Train MSE',     float_val, iteration=epoch)
clearml_logger.report_scalar('Loss', 'Val MSE',       float_val, iteration=epoch)
clearml_logger.report_scalar('Loss', 'Val RelMSE',    float_val, iteration=epoch)
clearml_logger.report_scalar('Loss', 'Physics',       float_val, iteration=epoch)
clearml_logger.report_scalar('Loss', 'Commut',        float_val, iteration=epoch)
clearml_logger.report_scalar('Loss', 'Stability',     float_val, iteration=epoch)

# Score group — competition segment scores (100-point scale)
clearml_logger.report_scalar('Score', 'Seg Total',    float_val, iteration=epoch)
clearml_logger.report_scalar('Score', 'Seg1',         float_val, iteration=epoch)
clearml_logger.report_scalar('Score', 'Seg2',         float_val, iteration=epoch)
clearml_logger.report_scalar('Score', 'Seg3',         float_val, iteration=epoch)

# PDE group — PDE residuals (per-segment)
clearml_logger.report_scalar('PDE', 'Mean Residual',  float_val, iteration=epoch)
clearml_logger.report_scalar('PDE', 'Seg1 Residual',  float_val, iteration=epoch)
clearml_logger.report_scalar('PDE', 'Seg2 Residual',  float_val, iteration=epoch)
clearml_logger.report_scalar('PDE', 'Seg3 Residual',  float_val, iteration=epoch)

# System group — system monitoring
clearml_logger.report_scalar('System', 'GPU Alloc MB',   float_val, iteration=epoch)
clearml_logger.report_scalar('System', 'GPU Reserved MB', float_val, iteration=epoch)
clearml_logger.report_scalar('System', 'LR',              float_val, iteration=epoch)

# Kfold group — k-fold cross-validation results
clearml_logger.report_scalar('Kfold', 'Mean Seg',    float_val, iteration=0)
clearml_logger.report_scalar('Kfold', 'Std Seg',     float_val, iteration=0)
clearml_logger.report_scalar('Kfold', 'CV Seg%',     float_val, iteration=0)

Code Templates

Template A: Add clearml logging to training loop

Insert into existing train_task1.py / train_task1_phys.py / train_task1_ft.py / train_task1_unroll.py:

# After Task.init(), get logger
clearml_logger = None
if clearml_task is not None:
    try:
        clearml_logger = clearml_task.get_logger()
    except Exception:
        pass

# At end of epoch loop (after avg_loss is computed)
if clearml_logger is not None:
    clearml_logger.report_scalar('Loss', 'Train MSE', avg_loss, iteration=epoch + 1)
    clearml_logger.report_scalar('System', 'LR', scheduler.get_last_lr()[0], iteration=epoch + 1)
    if DEVICE.type == 'cuda':
        clearml_logger.report_scalar('System', 'GPU Alloc MB', round(gpu_alloc, 1), iteration=epoch + 1)

# After validation (after val_mse, val_rel, seg are computed)
if clearml_logger is not None:
    clearml_logger.report_scalar('Loss', 'Val MSE', val_mse, iteration=epoch + 1)
    clearml_logger.report_scalar('Loss', 'Val RelMSE', val_rel, iteration=epoch + 1)
    clearml_logger.report_scalar('Score', 'Seg Total', seg['total_segmented_score'], iteration=epoch + 1)
    clearml_logger.report_scalar('Score', 'Seg1', seg['seg1_score'], iteration=epoch + 1)
    clearml_logger.report_scalar('Score', 'Seg2', seg['seg2_score'], iteration=epoch + 1)
    clearml_logger.report_scalar('Score', 'Seg3', seg['seg3_score'], iteration=epoch + 1)

# For physics loss (train_task1_phys.py)
if clearml_logger is not None and phys_loss is not None:
    clearml_logger.report_scalar('Loss', 'Physics', phys_loss.item(), iteration=epoch + 1)

Template B: Eval script clearml logging

def run_eval_and_log(model, val_data, cl_task, tag):
    clearml_logger = cl_task.get_logger() if cl_task is not None else None
    val_mse, val_rel, seg_scores = evaluate_autoregressive(model, val_data)

    if clearml_logger is not None:
        clearml_logger.report_scalar('Score', 'Seg Total', seg_scores['total_segmented_score'], iteration=1)
        clearml_logger.report_scalar('Score', 'Seg1', seg_scores['seg1_score'], iteration=1)
        clearml_logger.report_scalar('Score', 'Seg2', seg_scores['seg2_score'], iteration=1)
        clearml_logger.report_scalar('Score', 'Seg3', seg_scores['seg3_score'], iteration=1)
        clearml_logger.report_scalar('Loss', 'Val MSE', val_mse, iteration=1)
        clearml_logger.report_scalar('Loss', 'Val RelMSE', val_rel, iteration=1)

    return val_mse, val_rel, seg_scores

Template C: Double Logger (TensorBoardX + ClearML)

class DoubleLogger:
    def __init__(self, tb_writer=None, cl_logger=None):
        self.tb = tb_writer
        self.cl = cl_logger

    def scalar(self, group, name, value, iteration):
        if self.tb is not None:
            self.tb.add_scalar(f'{group}/{name}', value, iteration)
        if self.cl is not None:
            self.cl.report_scalar(group, name, value, iteration=iteration)

Consistency with expflow

  • Group names match compare-scores display names
  • Metric names match STANDARD_METRICS keys (via underscore)
  • iteration must increment monotonically (clearml x-axis requirement)
  • Single-value eval metrics use iteration=1

Known Pitfalls

  1. Task.get_logger() must be called after Task.init(), otherwise returns None
  2. capture_tensorboard=True — TensorBoardX and clearml dual-write works, but clearml adds TensorBoard path prefix to group names
  3. Distributed metrics are stored per-trial — parent optuna study only stores user_objective, not aggregated trial metrics
  4. Group + Metric name must be consistent — always Score/Seg Total, never Score/Seg_Total

版本历史

共 1 个版本

  • v0.5.0 当前
    2026-05-21 15:46

安全检测

腾讯云安全 (Keen)

队列中

腾讯云安全 (Sanbu)

队列中

🔗 相关推荐

Competition Task Intelligence

diamond2nv
构建并维护结构化的PDE方程库,分析竞赛任务(难度、瓶颈、得分预测),生成战略建议...
★ 0 📥 283

Hfpclawer Citation Audit

diamond2nv
使用三层回退管道验证学术论文引用:本地 FTS5 数据库 → Semantic Scholar API → OpenAlex API,支持单个引用检查……
★ 0 📥 341

Hfpclawer Paper Search

diamond2nv
发现、下载并整理来自 arXiv、HuggingFacePapers 和 OpenReview 的学术论文。多源搜索 → 去重 → PDF 下载 → Markdown 转换
★ 0 📥 302