Skip to content

Incorrect opt_end_learning_rate for GBS 64 in gpt-oss-20b RCPs #458

@ramgandikota

Description

@ramgandikota

While collecting submission logs and comparing against RCPs, we found that the GBS 64 RCPs use an LR of "1e-05" while using a MIN_LR of "4e-05". This would violate the rule for the hyperparameters of this benchmark:
https://github.com/mlcommons/training_policies/blob/master/training_rules.adoc#91-hyperparameters

gpt_oss_20b | adamw | opt_end_learning_rate | opt_base_learning_rate * 0.1 

Logs from reference RCPs:
https://github.com/mlcommons/training/blob/master/small_llm_moe_pretraining/primus/rcp_logs/gbs64/run_0.log#L30

Logs snippet:

:::MLLOG {"namespace": "", "time_ms": 1773021880374, "event_type": "POINT_IN_TIME", "key": "opt_adamw_epsilon", "value": 1e-05, "metadata": {"file": "/opt/venv/lib/python3.10/site-packages/primus_mllog/mlperf_pre_training.py", "lineno": 65}}
:::MLLOG {"namespace": "", "time_ms": 1773021880374, "event_type": "POINT_IN_TIME", "key": "opt_adamw_weight_decay", "value": 0.1, "metadata": {"file": "/opt/venv/lib/python3.10/site-packages/primus_mllog/mlperf_pre_training.py", "lineno": 65}}
:::MLLOG {"namespace": "", "time_ms": 1773021880374, "event_type": "POINT_IN_TIME", "key": "opt_gradient_clip_norm", "value": 1.0, "metadata": {"file": "/opt/venv/lib/python3.10/site-packages/primus_mllog/mlperf_pre_training.py", "lineno": 65}}
:::MLLOG {"namespace": "", "time_ms": 1773021880374, "event_type": "POINT_IN_TIME", "key": opt_end_learning_rate", "value": 4e-05, "metadata": {"file": "/opt/venv/lib/python3.10/site-packages/primus_mllog/mlperf_pre_training.py", "lineno": 65}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions