Emotion-Aware Multilingual Audio-to-Image Generation

Overview

This project explores the generation of emotion-aware images from Hindi speech prompts using a novel pipeline. The pipeline integrates speech recognition, emotion detection, and text-to-image synthesis to generate contextually relevant images that reflect the emotional tone of the input speech.

Project Poster

Overview of the project pipeline and architecture.

Demo Video

Demo.mp4

Demo Drive Link

Problem Statement

Existing text-to-image models are predominantly trained on English datasets and often lack the ability to incorporate emotional context, which limits their effectiveness for low-resource languages like Hindi and for tasks requiring emotional expressiveness. This project addresses the following questions:

How to enable speech-to-image generation for multilingual inputs, specifically Hindi, using state-of-the-art models?
How to generate images that are emotionally aware by integrating speech emotion recognition into the generation process?

Model Pipeline

The proposed pipeline consists of the following stages:

Speech Input: The user provides Hindi speech as input.
Fine-Tuned Whisper: OpenAI's Whisper model is fine-tuned to transcribe and translate the Hindi audio into English text. Simultaneously, the emotional tone of the speech is extracted using a speech emotion recognition module. The extracted emotion is concatenated with the translated text.
Emotion Detection: A separate emotion detection module analyzes the Hindi speech to determine its sentiment.
CLIP-Guided Diffusion Model: A CLIP-guided diffusion model generates an image based on the translated text and the associated emotion.

Workflow:

Hindi speech input is analyzed for sentiment.
The sentiment is combined with the translated text output from the fine-tuned Whisper model.
The CLIP-guided diffusion model generates emotion-aware and contextually relevant images.

Architectural Diagrams

1. Whisper Model Architecture:

Raw audio inputs are converted into a log-Mel spectrogram using a feature extractor.
A Transformer encoder encodes the spectrogram into a sequence of hidden states.
A decoder autoregressively predicts text tokens.

2. CLIP Shared Embedding Space:

CLIP converts text and images into a shared latent space using trained encoders for both modalities.

3. Diffusion Model Architecture:

CLIP embeddings are used as cross-attention during the training of the diffusion model.
This process generates semantically similar images with emotional context.

Implementation Results

Whisper was fine-tuned on Hindi data from Common Voice.
An emotion classifier was trained using the RAVDESS dataset.
Images were generated using a CLIP-guided Stable Diffusion model fine-tuned on Flickr8K, using translated text and inferred emotions.

Fine-Tuning Hyperparameters (Whisper):

Parameter	Value
Learning_rate	1e-5 - fine-tuning learning rate
max_steps	4000 - total training steps
gradient_accumulation_step	4 - for every 4 steps gradients are updated
per_device_train_batch_size	4 - training batch size per device
Evaluation_strategy	Word Error Rate

Results:

Component	Metric	Result
Whisper Fine Tuned	WER	32.0%
Emotion Classifier	F1-Score	[Pending]
Diffusion Model	CLIP Score, MSE	0.023 MSE (before CUDA OOM)

Key Takeaways

Multimodal Integration: The project successfully integrates speech, text, emotion, and image modalities.
Fine-Tuning Large Models: Fine-tuning state-of-the-art models like Whisper and Stable Diffusion was a significant learning experience.
Compute Limitations: Access to high-end GPUs and cloud computing resources is crucial for such projects.
Transfer Learning: Leveraging pre-trained models reduces training costs and enables experimentation in constrained environments.
Resource Management & Optimization: Insights were gained into the impact of GPU memory, batch sizes, and learning rates on training.

Future Work

Complete training and evaluation of the stable diffusion model.
Either integrate Hume.AI for emotion detection or develop and train a model using the RAVDESS Dataset using a BiLSTM or a Transformer.
Build a full stack application integrating the whole pipeline and deploy for the wider audience to play around with.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
poster		poster
.DS_Store		.DS_Store
Project 3.ipynb		Project 3.ipynb
Project_3_Final_Paper.pdf		Project_3_Final_Paper.pdf
README.md		README.md
app.py		app.py
image-generation-using-diffusion.pdf		image-generation-using-diffusion.pdf
image_generation_using_diffusion.ipynb		image_generation_using_diffusion.ipynb
requirements.txt		requirements.txt
whisper-fine-tuning.pdf		whisper-fine-tuning.pdf
whisper_fine_tuning.ipynb		whisper_fine_tuning.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emotion-Aware Multilingual Audio-to-Image Generation

Overview

Project Poster

Demo Video

Problem Statement

Model Pipeline

Architectural Diagrams

Implementation Results

Key Takeaways

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Emotion-Aware Multilingual Audio-to-Image Generation

Overview

Project Poster

Demo Video

Problem Statement

Model Pipeline

Architectural Diagrams

Implementation Results

Key Takeaways

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages