DPO using step level loses the ordering of steps information

thank you very much for open-sourcing the code. I noticed dpo data preparation method [formalise_dpo_data](https://github.com/lifan-yuan/ImplicitPRM/blob/352f1cd8f9b0e7d4245a2e3fa148da97bd745259/train/openrlhf/cli/train_dpo.py#L17) creates chosen, rejected samples using the step-level labels, although in the paper you show that this doesn't bring any gain in performance.

Moreover, the way that training samples are generated loses the ordering of steps, so I assume the reward model only learns if a step is valid at all for a given problem, and not evaluating if a given steps is taken at a correct stage of reasoning. is that correct?

For example, let's say these are the responses and correctness_lists
responses ["aaa", "bbb", "ccc", "ddd", "eee"]
correctness = [False, False, True, True, False]
then the preference datapoints looks something like this

```python
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ccc"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "aaa"
        }
    ],
    "id": "0-0",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ccc"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "bbb"
        }
    ],
    "id": "0-1",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ccc"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "eee"
        }
    ],
    "id": "0-2",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ddd"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "aaa"
        }
    ],
    "id": "0-3",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ddd"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "bbb"
        }
    ],
    "id": "0-4",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ddd"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "eee"
        }
    ],
    "id": "0-5",
    "dataset": "dataset_name_here"
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO using step level loses the ordering of steps information #4

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DPO using step level loses the ordering of steps information #4

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions