A Variational Autoencoder (VAE) for audio mel spectrograms, featuring a causal Convformer encoder and a DiT-based Conditional Flow Matching (CFM) decoder.
modules/: Core model components.VAE.py: Main model wrapper.encoder/: Causal Convformer architecture.decoder/: DiT-based CFM decoder.configs.py: Dataclass-based configurations.
configs/: Hydra configuration system.defaults/: Base configurations for encoder, decoder, and training.settings/: Experiment-specific overrides.
data/: Dataset loaders (LibriTTS, MLS, etc.).train.py: Main training script using Hugging Face Trainer and Hydra.
pip install -r requirements.txtNote: For optimal performance, Flash Attention is recommended.
The project uses Hydra for configuration management.
python train.py experiment=your_experiment_nameaccelerate launch train.py experiment=your_experiment_nameaccelerate launch --config_file configs/deepspeed/ds_config.yaml train.py experiment=your_experiment_nameOverrides can be applied directly via CLI:
python train.py training.learning_rate=1e-4 training.per_device_train_batch_size=8