Skip to content

Cap DiffSinger vocoder num_mel_bins at 255#2198

Open
KakaruHayate wants to merge 1 commit into
openutau:masterfrom
KakaruHayate:fix/ds-vocoder-block-non-vocoder-models
Open

Cap DiffSinger vocoder num_mel_bins at 255#2198
KakaruHayate wants to merge 1 commit into
openutau:masterfrom
KakaruHayate:fix/ds-vocoder-block-non-vocoder-models

Conversation

@KakaruHayate

Copy link
Copy Markdown
Contributor

Problem

The DiffSinger vocoder currently loads any onnx model that exposes a mel input plus optional f0, with no bounds checking on its declared configuration. As a result, a malformed voicebank can ship a vocoder slot that points at a non-vocoder onnx model (for example, a DiffSinger acoustic model masquerading as a vocoder, or some other graph that requires an extra-large mel tensor to drive its internal logic).

Real DiffSinger vocoders in the wild use 80 or 128 mel bins.

Fix

Add a hard-coded upper bound DsVocoderConfig.MaxMelBins = 255 and reject both the vocoder and the acoustic-model dsconfig when num_mel_bins falls outside [1, 255].

The check fires in two places:

  • DsVocoder constructor: rejects the voicebank at load time with the same MessageCustomizableException other vocoder load failures use, so the entry never reaches the onnx session.
  • DiffSingerRenderer.InvokeDiffsinger: redundant runtime check on both the vocoder and the acoustic model cfg, alongside the existing mel_base / mel_scale validations, as defense in depth.

The threshold is intentionally hard-coded rather than a yaml field: lowering it is a deliberate code change, not something a voicebank author can override.

The DiffSinger vocoder currently loads any onnx model that exposes a mel
input plus optional f0, with no bounds checking on its declared
configuration. As a result, a malformed voicebank can ship a vocoder
slot that points at a non-vocoder onnx model (e.g. a DiffSinger
acoustic model masquerading as a vocoder, or some other graph that
requires an extra-large mel tensor to drive its internal logic).

Real DiffSinger vocoders in the wild use 80 or 128 mel bins. Add a
hard-coded upper bound DsVocoderConfig.MaxMelBins = 255 and reject
both the vocoder and the acoustic-model dsconfig when num_mel_bins
falls outside [1, 255]. The check fires in two places:

- DsVocoder constructor: rejects the voicebank at load time with the
  same MessageCustomizableException other vocoder load failures use,
  so the entry never reaches the onnx session.
- DiffSingerRenderer.InvokeDiffsinger: redundant runtime check on
  both the vocoder and the acoustic model cfg, alongside the existing
  mel_base / mel_scale validations, as defense in depth.

The threshold is intentionally hard-coded rather than a yaml field:
lowering it is a deliberate code change, not something a voicebank
author can override.

@HuanLinOTO HuanLinOTO left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: LGTM with nits

整体评价:功能目标清晰,安全加固方向正确,双层检查设计合理,阈值选择恰当(255)。同意合并,但提几个小建议:

建议修复

1. 错误消息措辞不统一

  • DsVocoder 构造器:got {config.num_mel_bins}
  • InvokeDiffsingerbut got {vocoder.num_mel_bins}
    建议统一为 but got,方便 grep。

2. 边界条件可改为 <= 0
当前用 < 1,逻辑等价但语义可以更清晰。num_mel_bins 是 int 且默认 128,正常情况下不会为 0,<= 0 更直观:

if (config.num_mel_bins <= 0 || config.num_mel_bins > MaxMelBins)

建议补充

3. 建议补充单元测试
此类安全边界检查非常适合单元测试。建议覆盖:

  • num_mel_bins = 0 或负数 → 应拒绝
  • num_mel_bins = 256 → 应拒绝
  • num_mel_bins = 1255 → 应接受(边界合法值)

Commit: 0ed2251
Files: DiffSingerVocoder.cs / DiffSingerRenderer.cs (+27 lines)

内容由 AI 生成,请仔细甄别

@yqzhishen

Copy link
Copy Markdown
Collaborator

This check also prevents vulnerabilities like extremely high mel bins (the only configurable dynamic dimension) that will explode the user's disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants