https://drive.google.com/drive/folders/1J0A9taJg-t7_dRZdi8k8j4CDILQ9oZwl?usp=sharing
https://huggingface.co/spaces/Simi2407/VisionScript
ποΈ VisionScript-ImageCaptioning-BLIP/
β
βββ π Milestone 1/
β
βββ π Milestone 2/
β
βββ π Milestone 3/
β
βββ π Milestone 4/
β
βββ π Group5_Report.pdf
β
βββ π README.md
Milestone 1 focuses on building the baseline pipeline for automatic image captioning using BLIP. The system uses CLIP to extract image embeddings and BLIP to generate a single, fluent natural-language caption per image. This milestone establishes the baseline caption generation before fine-tuning in later milestones.
- Flickr30k dataset selection and preprocessing
- Caption filtering and dataset subset creation (1k pairs)
- CLIP embedding extraction (512-dim, L2-normalized)
- GPT-2 tokenizer setup for caption analysis
- BLIP baseline caption generation (unconditional)
- 5 sample image:caption test runs
- Repository setup and documentation
Image β CLIP Encoder β 512-dim Embedding Image β BLIP Processor β BLIP Decoder β Generated Caption
Models Used:
- CLIP ViT-B/32 (Image Encoder)
- BLIP blip-image-captioning-base (Caption Generation)
- GPT-2 Tokenizer (Caption Preprocessing Only)
- Baseline captions generated successfully using BLIP
- Captions are fluent and image-grounded out of the box
- CLIP embeddings extracted and cached (512-dim)
- GPT-2 tokenizer used to analyze caption length distribution
- Fine-tuning on Flickr30k will be implemented in Milestone 2
ποΈ Milestone 1/
β
βββ π Milestone_1_Prj2.ipynb
β
βββ π Project_Proposal.pdf
β
βββ πΌοΈ baseline_captions.png
β
βββ π README.md
Milestone 2 focuses on integrating CLIP image embeddings directly into the GPT-2 caption generation process. Instead of only using style prompts, this milestone injects visual embeddings into the language model to improve image grounding and caption relevance.
This milestone transitions the project from prompt-based captioning to embedding-conditioned caption generation.
- CLIP embedding extraction for images
- Embedding projection layer implementation
- Injection of CLIP embeddings into GPT-2 input embeddings
- Modified caption generation pipeline
- Comparison with baseline captions from Milestone 1
- Evaluation of caption relevance and grounding
- Documentation and result analysis
Image β CLIP Encoder β Embedding Projection β GPT-2 β Generated Caption
Models Used:
- CLIP ViT-B/32 (Image Encoder)
- GPT-2 Base (Language Model)
- Projection Layer (Embedding Alignment)
- Captions became more image-relevant
- Reduced generic caption generation
- Visual grounding improved compared to Milestone 1
- Style prompts still influence caption tone
- Embedding injection successfully implemented
- Model ready for fine-tuning and evaluation in Milestone 3
ποΈ Milestone 2/
β
βββ π Milestone_2_Pri_2.ipynb β CLIP + GPT-2 caption generation notebook
β
βββ π Milestone_2_Group_5.pdf β Milestone report
β
βββ ποΈ demo_Test_1_(1).png β Caption output example
β
βββ ποΈ demo_Test_4.png β Caption output example
β
βββ ποΈ demo_test3.png β Caption output example
β
βββ ποΈ training_loss_v2.png β Training loss plot
β
βββ π README.md β Documentation
Milestone 3 focuses on fine-tuning the caption generation pipeline and analyzing how different styles affect caption generation. This milestone evaluates the final system by comparing captions generated in multiple styles and analyzing model performance and caption quality.
This milestone represents the final system evaluation and comparison stage of the project.
- Fine-tuning GPT-2 with image-conditioned captions
- Style-conditioned caption generation experiments
- Parameter tuning and caption comparison
- Caption quality evaluation
- Style comparison analysis
- Result visualization and documentation
- Final pipeline testing
Image β CLIP Encoder β Embedding Injection β Fine-Tuned GPT-2 β Styled Caption
- Fine-tuned model produced more descriptive captions
- Style differences clearly visible in generated captions (single/paragraph)
- Final system successfully generates style-conditioned captions
ποΈ Milestone 3
β
βββ π Results/
β β
β βββ ποΈ Test_1_comparison.png
β β
β βββ ποΈ Test_2_comparison.png
β β
β βββ ποΈ Test_3_comparison.png
β β
β βββ ποΈ Test_4_comparison.png
β
βββ π Milestone3_Prj2.ipynb
β
βββ π README.md
Milestone 4 extends the image captioning system by introducing multi-mode caption generation and visual question answering (VQA) using the BLIP model. This milestone enhances the systemβs ability to generate both concise and detailed captions, while also enabling interactive question-based understanding of images.
This milestone represents the transition from static caption generation to a more flexible and interactive vision-language system.
- Implementation of BLIP-based caption generation
- Multi-mode captioning (Single + Paragraph)
- Visual Question Answering (VQA) integration
- Caption comparison across modes
- Qualitative evaluation of generated outputs
- Result visualization and documentation
- Final system testing and validation
Image β BLIP Encoder-Decoder β Caption / Answer Generation
Modes Implemented:
- Single Caption (Concise Output)
- Paragraph Caption (Detailed Output)
- Visual Question Answering (Q&A Output)
- Paragraph mode generated richer and more descriptive captions
- Single mode produced short and efficient summaries
- VQA successfully answered context-based questions from images
- Model demonstrated strong understanding of objects, scenes, and spatial relationships
- System supports flexible output formats for different use cases
- Final system successfully integrates captioning and reasoning capabilities
π Milestone 4/
β
βββ π Hugging Face/
β βββ π app.py
β β
β βββ π requirements.txt
β
βββ π Results/
β βββ πΌοΈ VisionScript_paragraph_20260409_160452.png
β β
β βββ πΌοΈ VisionScript_paragraph_20260410_003347.png
β β
β βββ πΌοΈ VisionScript_paragraph_20260410_003722.png
β β
β βββ πΌοΈ VisionScript_single_20260409_160448.png
β β
β βββ πΌοΈ VisionScript_vqa_20260410_003730.png
β
βββ π Final_Prj2NNDL.ipynb
β
βββ π Final_Prj2.ipynb
β
βββ ποΈ Group5_Presentation.pptx
β
βββ π¬ Group5_Recording.mp4
β
βββ π README.md