Cap DiffSinger vocoder num_mel_bins at 255#2198
Open
KakaruHayate wants to merge 1 commit into
Open
Conversation
The DiffSinger vocoder currently loads any onnx model that exposes a mel input plus optional f0, with no bounds checking on its declared configuration. As a result, a malformed voicebank can ship a vocoder slot that points at a non-vocoder onnx model (e.g. a DiffSinger acoustic model masquerading as a vocoder, or some other graph that requires an extra-large mel tensor to drive its internal logic). Real DiffSinger vocoders in the wild use 80 or 128 mel bins. Add a hard-coded upper bound DsVocoderConfig.MaxMelBins = 255 and reject both the vocoder and the acoustic-model dsconfig when num_mel_bins falls outside [1, 255]. The check fires in two places: - DsVocoder constructor: rejects the voicebank at load time with the same MessageCustomizableException other vocoder load failures use, so the entry never reaches the onnx session. - DiffSingerRenderer.InvokeDiffsinger: redundant runtime check on both the vocoder and the acoustic model cfg, alongside the existing mel_base / mel_scale validations, as defense in depth. The threshold is intentionally hard-coded rather than a yaml field: lowering it is a deliberate code change, not something a voicebank author can override.
HuanLinOTO
reviewed
Jun 15, 2026
There was a problem hiding this comment.
Review: LGTM with nits
整体评价:功能目标清晰,安全加固方向正确,双层检查设计合理,阈值选择恰当(255)。同意合并,但提几个小建议:
建议修复
1. 错误消息措辞不统一
DsVocoder构造器:got {config.num_mel_bins}InvokeDiffsinger:but got {vocoder.num_mel_bins}
建议统一为but got,方便 grep。
2. 边界条件可改为 <= 0
当前用 < 1,逻辑等价但语义可以更清晰。num_mel_bins 是 int 且默认 128,正常情况下不会为 0,<= 0 更直观:
if (config.num_mel_bins <= 0 || config.num_mel_bins > MaxMelBins)建议补充
3. 建议补充单元测试
此类安全边界检查非常适合单元测试。建议覆盖:
num_mel_bins = 0或负数 → 应拒绝num_mel_bins = 256→ 应拒绝num_mel_bins = 1和255→ 应接受(边界合法值)
Commit: 0ed2251
Files: DiffSingerVocoder.cs / DiffSingerRenderer.cs (+27 lines)
内容由 AI 生成,请仔细甄别
yqzhishen
approved these changes
Jun 15, 2026
Collaborator
|
This check also prevents vulnerabilities like extremely high mel bins (the only configurable dynamic dimension) that will explode the user's disk. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The DiffSinger vocoder currently loads any onnx model that exposes a
melinput plus optionalf0, with no bounds checking on its declared configuration. As a result, a malformed voicebank can ship a vocoder slot that points at a non-vocoder onnx model (for example, a DiffSinger acoustic model masquerading as a vocoder, or some other graph that requires an extra-large mel tensor to drive its internal logic).Real DiffSinger vocoders in the wild use 80 or 128 mel bins.
Fix
Add a hard-coded upper bound
DsVocoderConfig.MaxMelBins = 255and reject both the vocoder and the acoustic-model dsconfig whennum_mel_binsfalls outside[1, 255].The check fires in two places:
DsVocoderconstructor: rejects the voicebank at load time with the sameMessageCustomizableExceptionother vocoder load failures use, so the entry never reaches the onnx session.DiffSingerRenderer.InvokeDiffsinger: redundant runtime check on both the vocoder and the acoustic model cfg, alongside the existingmel_base/mel_scalevalidations, as defense in depth.The threshold is intentionally hard-coded rather than a yaml field: lowering it is a deliberate code change, not something a voicebank author can override.