A production-grade multimodal AI system that understands video through combined visual and audio analysis. Upload any video or paste a YouTube/Vimeo/TikTok URL — VELA extracts keyframes, describes them with a vision LLM, transcribes audio with Whisper, and answers natural language questions with timestamped citations.
Video File or URL (YouTube, Vimeo, TikTok, etc.)
↓
Frame Extraction Agent ← OpenCV scene detection + fallback sampling
↓
Vision Description Agent ← GPT-4o vision / LLaVA via Ollama (BYOK)
↓
Transcript Agent ← OpenAI Whisper (runs locally, free)
↓
Fusion Agent ← Merges visual + audio into timestamped chunks
↓
Indexing Agent ← Embeds chunks → Qdrant vector DB
↓
Retrieval + Synthesis ← Semantic search + reranking + LLM answer
↓
Streamlit UI ← Q&A with frame previews + timestamps
Mac:
brew install ffmpegUbuntu:
sudo apt install ffmpegffmpeg is required for Whisper audio extraction and yt-dlp video merging. Without it, URL processing will fail.
cd vela
pip install -r requirements.txt
pip install yt-dlp # for YouTube/URL supportcp .env.example .env
nano .envMinimum required config:
# For local-only (free, no API key needed):
LLM_PROVIDER=ollama
VISION_PROVIDER=ollama # ← use ollama, not openai, unless you have credits
OLLAMA_VISION_MODEL=llava
EMBEDDING_PROVIDER=ollama
# For faster responses (requires paid OpenAI credits):
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your-key
VISION_PROVIDER=openai
VISION_MODEL=gpt-4oollama pull llava # vision model (describes frames)
ollama pull llama3.2 # text LLM (answers questions)
ollama pull nomic-embed-text # embeddingsOption A — Docker (recommended):
docker compose up -dOption B — Binary (if Docker isn't running):
curl -L https://github.com/qdrant/qdrant/releases/latest/download/qdrant-aarch64-apple-darwin.tar.gz -o qdrant.tar.gz
tar -xzf qdrant.tar.gz
./qdrantLeave this terminal open. Qdrant runs on port 6333.
cd backend
uvicorn main:app --reload --port 8000cd frontend
streamlit run app.pyOpen http://localhost:8501
brew install ffmpeg # Mac
sudo apt install ffmpeg # UbuntuYour OpenAI account has no credits. Switch to local LLaVA:
# In .env:
VISION_PROVIDER=ollama
OLLAMA_VISION_MODEL=llava
# Pull the model:
ollama pull llavaQdrant is not running. Start it:
# Docker:
docker compose up -d
# Or binary:
./qdrantpip install yt-dlpThe backend cached the old config. Restart it:
# Kill the uvicorn process (Ctrl+C) then:
uvicorn main:app --reload --port 8000You're in the wrong folder or the binary wasn't extracted. Re-download:
curl -L https://github.com/qdrant/qdrant/releases/latest/download/qdrant-aarch64-apple-darwin.tar.gz -o qdrant.tar.gz
tar -xzf qdrant.tar.gz
./qdrantCheck your FastAPI backend terminal — the real error is always printed there. Common causes:
- Qdrant not running
- Vision API key invalid
- ffmpeg not installed
- yt-dlp not installed
The OpenAI LLM key has no credits. Switch to Ollama:
# In .env:
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.2This is just a warning, not an error. Whisper runs fine on CPU with FP32. Ignore it.
Run through this every time before using VELA:
□ Qdrant running on port 6333
□ Ollama running (ollama serve)
□ llava, llama3.2, nomic-embed-text pulled in Ollama
□ ffmpeg installed (ffmpeg -version)
□ yt-dlp installed (pip show yt-dlp)
□ .env configured with correct VISION_PROVIDER=ollama
□ Backend running: uvicorn main:app --reload --port 8000
□ Frontend running: streamlit run app.py
# Terminal 1 — Qdrant
./qdrant
# Terminal 2 — Ollama (if not running as a service)
ollama serve
# Terminal 3 — Backend
cd /path/to/vela/backend
uvicorn main:app --reload --port 8000
# Terminal 4 — Frontend
cd /path/to/vela/frontend
streamlit run app.py| Variable | Default | Options | Description |
|---|---|---|---|
LLM_PROVIDER |
ollama |
ollama, openai, anthropic |
Text LLM for answering questions |
VISION_PROVIDER |
ollama |
ollama, openai |
Vision LLM for frame description |
OLLAMA_VISION_MODEL |
llava |
llava, llava-llama3 |
Local vision model |
VISION_MODEL |
gpt-4o |
gpt-4o, gpt-4o-mini |
OpenAI vision model |
WHISPER_MODEL |
base |
tiny, base, small, medium, large |
Transcription quality vs speed |
SCENE_THRESHOLD |
27.0 |
Lower = more frames | Scene detection sensitivity |
MAX_FRAMES_PER_VIDEO |
300 |
Any int | Frame extraction cap |
TOP_K |
10 |
Any int | Chunks retrieved before reranking |
RERANK_TOP_N |
4 |
Any int | Chunks kept after reranking |
After processing a video:
- "What is shown at the 2-minute mark?"
- "Summarize the key points discussed."
- "What diagrams or slides are visible?"
- "At what timestamp does the presenter show the results?"
- "What is written on the whiteboard?"
vela/
├── backend/
│ ├── main.py # FastAPI — /process, /process/url, /query
│ ├── config.py # Central config + BYOK routing
│ ├── agents/
│ │ ├── frame_agent.py # Scene detection + keyframe extraction
│ │ ├── vision_agent.py # Vision LLM frame description
│ │ ├── transcript_agent.py # Whisper transcription
│ │ ├── fusion_agent.py # Visual + audio fusion
│ │ ├── synthesis_agent.py # Timestamped answer generation
│ │ └── url_agent.py # yt-dlp YouTube/URL downloader
│ ├── graph/
│ │ └── orchestrator.py # LangGraph state graph
│ ├── rag/
│ │ ├── embedder.py
│ │ ├── indexer.py
│ │ └── retriever.py
│ └── utils/
│ └── video_utils.py
├── frontend/
│ └── app.py # Streamlit UI
├── frames/ # Extracted keyframes (auto-created)
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md
MIT