Skip to content
This repository was archived by the owner on Nov 19, 2025. It is now read-only.

use lightning or pytorch-lightning #438

@better629

Description

@better629

Describe the bug

installed pytorch-lightning=2.4.0, nemo_toolkit=2.1.0rc0 under NeMo-Aligner=0.5.0

error with

[rank0]:   File "/tf/NeMo-Aligner/./examples/nlp/gpt/train_reward_model.py", line 137, in main
[rank0]:     init_using_ptl(trainer, ptl_model, train_dataloader, train_ds)
[rank0]:   File "/tf/anaconda3/envs/nemo/lib/python3.10/site-packages/nemo_aligner/utils/train_script_utils.py", line 107, in init_using_ptl
[rank0]:     call._call_configure_model(ptl_trainer)
[rank0]:   File "/tf/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 111, in _call_configure_model
[rank0]:     if is_overridden("configure_sharded_model", trainer.lightning_module):
[rank0]:   File "/tf/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/utilities/model_helpers.py", line 42, in is_overridden
[rank0]:     raise ValueError("Expected a parent")
[rank0]: ValueError: Expected a parent

It seems that resolve_and_create_trainer inits a lightning Trainer but makes a judge with pytorch_lightning Trainer under init_using_ptl.
Does anyone occur with this problem.

I updated the code inside nemo_aligner/utils/train_script_utils.py and it will run forward

# from pytorch_lightning.trainer import call                        # raw code inside train_script_utils.py
# from pytorch_lightning.trainer.states import TrainerFn  # raw code inside train_script_utils.py
from lightning.pytorch.trainer import call
from lightning.pytorch.trainer.states import TrainerFn

Steps/Code to reproduce bug

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location:
    Docker
  • Method of NeMo-Aligner install: [pip install or from source].
    pip
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version Ubuntu 22.04
  • PyTorch version 2.5.1
  • Python version 3.10

Additional context

Add any other context about the problem here.
Example: GPU model

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions