Skip to content

模型只能训练一个echo,然后会中断 #35

@GaryGao99

Description

@GaryGao99

如题;
数据:训练数据使用的是aishell,
模型:LLM模型是Qwen2.5 1.5B,encoder paraformer;
训练使用2张GPU;

模型只能训练一个epoch,执行第二个echo会报错:错误如下:

Image

当输出显示 :
“2025-06-09 16:34:19 | INFO | mooer.utils.checpoint_io | checpoint_io.py:10 | Rank 1--> saving model ...”
时,会长时间停止; 此时GPU 利用率100%; 然后会报错,并退出;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions