Skip to content

DPO using step level loses the ordering of steps information #4

@kevinNejad

Description

@kevinNejad

thank you very much for open-sourcing the code. I noticed dpo data preparation method formalise_dpo_data creates chosen, rejected samples using the step-level labels, although in the paper you show that this doesn't bring any gain in performance.

Moreover, the way that training samples are generated loses the ordering of steps, so I assume the reward model only learns if a step is valid at all for a given problem, and not evaluating if a given steps is taken at a correct stage of reasoning. is that correct?

For example, let's say these are the responses and correctness_lists
responses ["aaa", "bbb", "ccc", "ddd", "eee"]
correctness = [False, False, True, True, False]
then the preference datapoints looks something like this

{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ccc"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "aaa"
        }
    ],
    "id": "0-0",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ccc"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "bbb"
        }
    ],
    "id": "0-1",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ccc"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "eee"
        }
    ],
    "id": "0-2",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ddd"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "aaa"
        }
    ],
    "id": "0-3",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ddd"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "bbb"
        }
    ],
    "id": "0-4",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ddd"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "eee"
        }
    ],
    "id": "0-5",
    "dataset": "dataset_name_here"
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions