Skip to content

akshaygande/VisualIntelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎬 VELA — Video Engine for Language & Analysis

A production-grade multimodal AI system that understands video through combined visual and audio analysis. Upload any video or paste a YouTube/Vimeo/TikTok URL — VELA extracts keyframes, describes them with a vision LLM, transcribes audio with Whisper, and answers natural language questions with timestamped citations.


Architecture

Video File or URL (YouTube, Vimeo, TikTok, etc.)
           ↓
  Frame Extraction Agent    ← OpenCV scene detection + fallback sampling
           ↓
  Vision Description Agent  ← GPT-4o vision / LLaVA via Ollama (BYOK)
           ↓
  Transcript Agent          ← OpenAI Whisper (runs locally, free)
           ↓
  Fusion Agent              ← Merges visual + audio into timestamped chunks
           ↓
  Indexing Agent            ← Embeds chunks → Qdrant vector DB
           ↓
  Retrieval + Synthesis     ← Semantic search + reranking + LLM answer
           ↓
  Streamlit UI              ← Q&A with frame previews + timestamps

Quickstart

Step 1 — Install system dependencies

Mac:

brew install ffmpeg

Ubuntu:

sudo apt install ffmpeg

ffmpeg is required for Whisper audio extraction and yt-dlp video merging. Without it, URL processing will fail.

Step 2 — Install Python dependencies

cd vela
pip install -r requirements.txt
pip install yt-dlp      # for YouTube/URL support

Step 3 — Configure environment

cp .env.example .env
nano .env

Minimum required config:

# For local-only (free, no API key needed):
LLM_PROVIDER=ollama
VISION_PROVIDER=ollama        # ← use ollama, not openai, unless you have credits
OLLAMA_VISION_MODEL=llava
EMBEDDING_PROVIDER=ollama

# For faster responses (requires paid OpenAI credits):
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your-key
VISION_PROVIDER=openai
VISION_MODEL=gpt-4o

Step 4 — Pull Ollama models

ollama pull llava              # vision model (describes frames)
ollama pull llama3.2           # text LLM (answers questions)
ollama pull nomic-embed-text   # embeddings

Step 5 — Start Qdrant

Option A — Docker (recommended):

docker compose up -d

Option B — Binary (if Docker isn't running):

curl -L https://github.com/qdrant/qdrant/releases/latest/download/qdrant-aarch64-apple-darwin.tar.gz -o qdrant.tar.gz
tar -xzf qdrant.tar.gz
./qdrant

Leave this terminal open. Qdrant runs on port 6333.

Step 6 — Start backend

cd backend
uvicorn main:app --reload --port 8000

Step 7 — Start frontend (new terminal)

cd frontend
streamlit run app.py

Open http://localhost:8501


Common Errors & Fixes

ffmpeg is not installed. Aborting due to --abort-on-error

brew install ffmpeg    # Mac
sudo apt install ffmpeg  # Ubuntu

VISION_PROVIDER=openai — Incorrect API key / quota exceeded

Your OpenAI account has no credits. Switch to local LLaVA:

# In .env:
VISION_PROVIDER=ollama
OLLAMA_VISION_MODEL=llava

# Pull the model:
ollama pull llava

Connection refused on port 6333 (Qdrant)

Qdrant is not running. Start it:

# Docker:
docker compose up -d

# Or binary:
./qdrant

yt-dlp not installed

pip install yt-dlp

❌ Backend shows ollama/gpt-4o instead of ollama/llava after editing .env

The backend cached the old config. Restart it:

# Kill the uvicorn process (Ctrl+C) then:
uvicorn main:app --reload --port 8000

No such file or directory: ./qdrant

You're in the wrong folder or the binary wasn't extracted. Re-download:

curl -L https://github.com/qdrant/qdrant/releases/latest/download/qdrant-aarch64-apple-darwin.tar.gz -o qdrant.tar.gz
tar -xzf qdrant.tar.gz
./qdrant

❌ URL processing fails with Unknown error

Check your FastAPI backend terminal — the real error is always printed there. Common causes:

  • Qdrant not running
  • Vision API key invalid
  • ffmpeg not installed
  • yt-dlp not installed

❌ Answer is empty after querying

The OpenAI LLM key has no credits. Switch to Ollama:

# In .env:
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.2

FP16 is not supported on CPU; using FP32 instead

This is just a warning, not an error. Whisper runs fine on CPU with FP32. Ignore it.


Startup Checklist

Run through this every time before using VELA:

□ Qdrant running on port 6333
□ Ollama running (ollama serve)
□ llava, llama3.2, nomic-embed-text pulled in Ollama
□ ffmpeg installed (ffmpeg -version)
□ yt-dlp installed (pip show yt-dlp)
□ .env configured with correct VISION_PROVIDER=ollama
□ Backend running: uvicorn main:app --reload --port 8000
□ Frontend running: streamlit run app.py

Start Everything (copy-paste)

# Terminal 1 — Qdrant
./qdrant

# Terminal 2 — Ollama (if not running as a service)
ollama serve

# Terminal 3 — Backend
cd /path/to/vela/backend
uvicorn main:app --reload --port 8000

# Terminal 4 — Frontend
cd /path/to/vela/frontend
streamlit run app.py

Configuration Reference

Variable Default Options Description
LLM_PROVIDER ollama ollama, openai, anthropic Text LLM for answering questions
VISION_PROVIDER ollama ollama, openai Vision LLM for frame description
OLLAMA_VISION_MODEL llava llava, llava-llama3 Local vision model
VISION_MODEL gpt-4o gpt-4o, gpt-4o-mini OpenAI vision model
WHISPER_MODEL base tiny, base, small, medium, large Transcription quality vs speed
SCENE_THRESHOLD 27.0 Lower = more frames Scene detection sensitivity
MAX_FRAMES_PER_VIDEO 300 Any int Frame extraction cap
TOP_K 10 Any int Chunks retrieved before reranking
RERANK_TOP_N 4 Any int Chunks kept after reranking

Example Queries

After processing a video:

  • "What is shown at the 2-minute mark?"
  • "Summarize the key points discussed."
  • "What diagrams or slides are visible?"
  • "At what timestamp does the presenter show the results?"
  • "What is written on the whiteboard?"

Project Structure

vela/
├── backend/
│   ├── main.py                  # FastAPI — /process, /process/url, /query
│   ├── config.py                # Central config + BYOK routing
│   ├── agents/
│   │   ├── frame_agent.py       # Scene detection + keyframe extraction
│   │   ├── vision_agent.py      # Vision LLM frame description
│   │   ├── transcript_agent.py  # Whisper transcription
│   │   ├── fusion_agent.py      # Visual + audio fusion
│   │   ├── synthesis_agent.py   # Timestamped answer generation
│   │   └── url_agent.py         # yt-dlp YouTube/URL downloader
│   ├── graph/
│   │   └── orchestrator.py      # LangGraph state graph
│   ├── rag/
│   │   ├── embedder.py
│   │   ├── indexer.py
│   │   └── retriever.py
│   └── utils/
│       └── video_utils.py
├── frontend/
│   └── app.py                   # Streamlit UI
├── frames/                      # Extracted keyframes (auto-created)
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md

License

MIT

About

A production-grade multimodal AI system that understands video through combined visual and audio analysis. VELA extracts keyframes using scene detection, describes them with a vision LLM, transcribes audio with Whisper, fuses both modalities into a semantic index, and answers natural language questions with timestamped citations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages