This project explores the generation of emotion-aware images from Hindi speech prompts using a novel pipeline. The pipeline integrates speech recognition, emotion detection, and text-to-image synthesis to generate contextually relevant images that reflect the emotional tone of the input speech.
Overview of the project pipeline and architecture.
Demo.mp4
Existing text-to-image models are predominantly trained on English datasets and often lack the ability to incorporate emotional context, which limits their effectiveness for low-resource languages like Hindi and for tasks requiring emotional expressiveness. This project addresses the following questions:
- How to enable speech-to-image generation for multilingual inputs, specifically Hindi, using state-of-the-art models?
- How to generate images that are emotionally aware by integrating speech emotion recognition into the generation process?
The proposed pipeline consists of the following stages:
- Speech Input: The user provides Hindi speech as input.
- Fine-Tuned Whisper: OpenAI's Whisper model is fine-tuned to transcribe and translate the Hindi audio into English text. Simultaneously, the emotional tone of the speech is extracted using a speech emotion recognition module. The extracted emotion is concatenated with the translated text.
- Emotion Detection: A separate emotion detection module analyzes the Hindi speech to determine its sentiment.
- CLIP-Guided Diffusion Model: A CLIP-guided diffusion model generates an image based on the translated text and the associated emotion.
Workflow:
- Hindi speech input is analyzed for sentiment.
- The sentiment is combined with the translated text output from the fine-tuned Whisper model.
- The CLIP-guided diffusion model generates emotion-aware and contextually relevant images.
1. Whisper Model Architecture:
- Raw audio inputs are converted into a log-Mel spectrogram using a feature extractor.
- A Transformer encoder encodes the spectrogram into a sequence of hidden states.
- A decoder autoregressively predicts text tokens.
2. CLIP Shared Embedding Space:
- CLIP converts text and images into a shared latent space using trained encoders for both modalities.
3. Diffusion Model Architecture:
- CLIP embeddings are used as cross-attention during the training of the diffusion model.
- This process generates semantically similar images with emotional context.
- Whisper was fine-tuned on Hindi data from Common Voice.
- An emotion classifier was trained using the RAVDESS dataset.
- Images were generated using a CLIP-guided Stable Diffusion model fine-tuned on Flickr8K, using translated text and inferred emotions.
Fine-Tuning Hyperparameters (Whisper):
| Parameter | Value |
|---|---|
| Learning_rate | 1e-5 - fine-tuning learning rate |
| max_steps | 4000 - total training steps |
| gradient_accumulation_step | 4 - for every 4 steps gradients are updated |
| per_device_train_batch_size | 4 - training batch size per device |
| Evaluation_strategy | Word Error Rate |
Results:
| Component | Metric | Result |
|---|---|---|
| Whisper Fine Tuned | WER | 32.0% |
| Emotion Classifier | F1-Score | [Pending] |
| Diffusion Model | CLIP Score, MSE | 0.023 MSE (before CUDA OOM) |
- Multimodal Integration: The project successfully integrates speech, text, emotion, and image modalities.
- Fine-Tuning Large Models: Fine-tuning state-of-the-art models like Whisper and Stable Diffusion was a significant learning experience.
- Compute Limitations: Access to high-end GPUs and cloud computing resources is crucial for such projects.
- Transfer Learning: Leveraging pre-trained models reduces training costs and enables experimentation in constrained environments.
- Resource Management & Optimization: Insights were gained into the impact of GPU memory, batch sizes, and learning rates on training.
- Complete training and evaluation of the stable diffusion model.
- Either integrate Hume.AI for emotion detection or develop and train a model using the RAVDESS Dataset using a BiLSTM or a Transformer.
- Build a full stack application integrating the whole pipeline and deploy for the wider audience to play around with.