Skip to content

bug: AV_MossFormer2 audio shift for detection beyond the beginning of the file #160

@stranger-games

Description

@stranger-games

Hi,

AV_MossFormer2_TSE_16K is awesome. I did the following and it generally works but need some clean up.

    myClearVoice = ClearVoice(task='target_speaker_extraction', model_names=['AV_MossFormer2_TSE_16K'])

    # #1sd calling method: process an input video and return output video, then write outputs to 'path_to_output_videos_tse'
    output_wav = myClearVoice(input_path='input.mp4', online_write=True, output_path='separate_audio')

The issue is that any detection done by the model further into the video (not the start frame), the detected audio starts with the first frame leading to a huge desync in video_est_x.mp4 files.

For example if a speaker detected at 00:05 mark, the corresponding video_est_x.mp4 file will have the audio shifted to the left 5 seconds.

Thank you for advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions