[StyleTTS2] feat: add demo and inference integration basics#804
[StyleTTS2] feat: add demo and inference integration basics#804roedoejet wants to merge 7 commits into
Conversation
Changed Files
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #804 +/- ##
==========================================
- Coverage 87.55% 84.98% -2.57%
==========================================
Files 45 46 +1
Lines 4033 4217 +184
Branches 605 632 +27
==========================================
+ Hits 3531 3584 +53
- Misses 365 494 +129
- Partials 137 139 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
To resolve the conflict, keep |
joanise
left a comment
There was a problem hiding this comment.
Not really tested yet, but I'm done for today. Generally good but see suggestions below.
| return demo | ||
|
|
||
|
|
||
| def create_demo_app_styletts2( |
There was a problem hiding this comment.
Is there any code at all that is shared between the styletts2 demo and the fs2+hfgl one? As I scan, it seems like it's completely separate code, and in that case I would think that this should be placed in styletts2/cli/demo.py instead of here.
That would also facilitate adding it to the styletts2 CLI as styletts2 demo, actually.
There was a problem hiding this comment.
I agree that this would be good, but I might leave it for another PR since it requires re-factoring code that isn't part of this PR feature. For now, I'll leave this demo code where the other demo code lives, but I'll also make an issue to refactor it in the way you suggest.
|
|
||
|
|
||
| @demo_group.command( | ||
| name="text-to-spec", |
There was a problem hiding this comment.
Hum, this name won't be intuitive for me: this demo variant does text->spec->wav, while the other does text->wav: both generate wav from text.
I just discussed it with Sam, and he came up with text-to-spec-to-wav for this vs text-to-wav for the other. It's longer, but with auto complete you don't have to type it all. Sam also found that we can create aliases if we don't want to have to always use the long names.
E.g.,
@demo_group.command(name="text-to-spec-to-wav" short_help="...")
@demo_group.command(name="t2s2w", hidden=True)
def demo...()
would define an alias command everyvoice demo t2s2w if we want to have a shorter version available. The alias could even use the model short names, maybe everyvoice demo fs2-hfgl vs everyvoice demo styletts2 could be our aliases for the two variants?
There was a problem hiding this comment.
hm, yes, I totally see what you mean. I do like this better than what I have, but I'm also thinking that maybe we can skirt through this whole issue by reading the metadata in the checkpoint since EV declares which type of model it is. That way, we could just run everyvoice demo path/to/ckpt and let the command figure out which one of these commands to run. That would be cleaner from a UX point of view, because there isn't really any reason the user needs to know exactly what type of model it is when they want to demo it. If you agree with this, I can implement and push.
There was a problem hiding this comment.
Absolutely, that would be great! And we have all the logic needed for inspect, why not use it here.
Obviously, it should be an error to provide a vocoder and a styletts, and a different error to have fs2 without a vocoder, but that's pretty simple logic.
The other advantage of your suggestion: current demo startup commands keep working unchanged.
There was a problem hiding this comment.
OK - I'll give this a go
a5b2a20 to
e8c84a4
Compare
1ac2571 to
cca19c5
Compare
PR Goal?
I previously connected the
everyvoice trainfunctionality with StyleTTS2. This PR integrates StyleTTS2 witheveryvoice demo,everyvoice synthesize, andeveryvoice checkpoint inspectcommands.Fixes?
Part of #686
Feedback sought?
Testing, but also sanity. I think there are actually quite a few places where we tied ourselves a bit too closely to FS2 and its architecture. I think I need some space to be able to tell how to refactor, but any insight is helpful.
I'm mostly looking for high-level analysis about whether the approach to combine repos in this way is reasonable.
Priority?
high
Tests added?
none so far
How to test?
try running
everyvoice synthesize,everyvoice demo, andeveryvoice checkpoint inspect. note, that I don't think this will work on the model you just trained. I'm not adding backwards compatibility support for that, although the hooks are in place for us to be able to handle this in the future in the same way as FS2Confidence?
medium
Version change?
n/a, already bumped to 0.5.0
Related PRs?
EveryVoiceTTS/FastSpeech2_lightning#142
EveryVoiceTTS/StyleTTS2#4