Skip to content

ananyd36/EMOGEN

Repository files navigation

Emotion-Aware Multilingual Audio-to-Image Generation

Overview

This project explores the generation of emotion-aware images from Hindi speech prompts using a novel pipeline. The pipeline integrates speech recognition, emotion detection, and text-to-image synthesis to generate contextually relevant images that reflect the emotional tone of the input speech.

Project Poster

Project Poster
Overview of the project pipeline and architecture.

Demo Video

Demo.mp4

Demo Drive Link

Problem Statement

Existing text-to-image models are predominantly trained on English datasets and often lack the ability to incorporate emotional context, which limits their effectiveness for low-resource languages like Hindi and for tasks requiring emotional expressiveness. This project addresses the following questions:

  • How to enable speech-to-image generation for multilingual inputs, specifically Hindi, using state-of-the-art models?
  • How to generate images that are emotionally aware by integrating speech emotion recognition into the generation process?

Model Pipeline

The proposed pipeline consists of the following stages:

  1. Speech Input: The user provides Hindi speech as input.
  2. Fine-Tuned Whisper: OpenAI's Whisper model is fine-tuned to transcribe and translate the Hindi audio into English text. Simultaneously, the emotional tone of the speech is extracted using a speech emotion recognition module. The extracted emotion is concatenated with the translated text.
  3. Emotion Detection: A separate emotion detection module analyzes the Hindi speech to determine its sentiment.
  4. CLIP-Guided Diffusion Model: A CLIP-guided diffusion model generates an image based on the translated text and the associated emotion.

Workflow:

  • Hindi speech input is analyzed for sentiment.
  • The sentiment is combined with the translated text output from the fine-tuned Whisper model.
  • The CLIP-guided diffusion model generates emotion-aware and contextually relevant images.

Architectural Diagrams

1. Whisper Model Architecture:

  • Raw audio inputs are converted into a log-Mel spectrogram using a feature extractor.
  • A Transformer encoder encodes the spectrogram into a sequence of hidden states.
  • A decoder autoregressively predicts text tokens.

2. CLIP Shared Embedding Space:

  • CLIP converts text and images into a shared latent space using trained encoders for both modalities.

3. Diffusion Model Architecture:

  • CLIP embeddings are used as cross-attention during the training of the diffusion model.
  • This process generates semantically similar images with emotional context.

Implementation Results

  • Whisper was fine-tuned on Hindi data from Common Voice.
  • An emotion classifier was trained using the RAVDESS dataset.
  • Images were generated using a CLIP-guided Stable Diffusion model fine-tuned on Flickr8K, using translated text and inferred emotions.

Fine-Tuning Hyperparameters (Whisper):

Parameter Value
Learning_rate 1e-5 - fine-tuning learning rate
max_steps 4000 - total training steps
gradient_accumulation_step 4 - for every 4 steps gradients are updated
per_device_train_batch_size 4 - training batch size per device
Evaluation_strategy Word Error Rate

Results:

Component Metric Result
Whisper Fine Tuned WER 32.0%
Emotion Classifier F1-Score [Pending]
Diffusion Model CLIP Score, MSE 0.023 MSE (before CUDA OOM)

Key Takeaways

  • Multimodal Integration: The project successfully integrates speech, text, emotion, and image modalities.
  • Fine-Tuning Large Models: Fine-tuning state-of-the-art models like Whisper and Stable Diffusion was a significant learning experience.
  • Compute Limitations: Access to high-end GPUs and cloud computing resources is crucial for such projects.
  • Transfer Learning: Leveraging pre-trained models reduces training costs and enables experimentation in constrained environments.
  • Resource Management & Optimization: Insights were gained into the impact of GPU memory, batch sizes, and learning rates on training.

Future Work

  • Complete training and evaluation of the stable diffusion model.
  • Either integrate Hume.AI for emotion detection or develop and train a model using the RAVDESS Dataset using a BiLSTM or a Transformer.
  • Build a full stack application integrating the whole pipeline and deploy for the wider audience to play around with.

About

This project explores the generation of emotion-aware images from Hindi speech prompts using a novel pipeline. The pipeline integrates speech recognition, emotion detection, and text-to-image synthesis to generate contextually relevant images that reflect the emotional tone of the input speech.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors