🎬 VELA — Video Engine for Language & Analysis

A production-grade multimodal AI system that understands video through combined visual and audio analysis. Upload any video or paste a YouTube/Vimeo/TikTok URL — VELA extracts keyframes, describes them with a vision LLM, transcribes audio with Whisper, and answers natural language questions with timestamped citations.

Architecture

Video File or URL (YouTube, Vimeo, TikTok, etc.)
           ↓
  Frame Extraction Agent    ← OpenCV scene detection + fallback sampling
           ↓
  Vision Description Agent  ← GPT-4o vision / LLaVA via Ollama (BYOK)
           ↓
  Transcript Agent          ← OpenAI Whisper (runs locally, free)
           ↓
  Fusion Agent              ← Merges visual + audio into timestamped chunks
           ↓
  Indexing Agent            ← Embeds chunks → Qdrant vector DB
           ↓
  Retrieval + Synthesis     ← Semantic search + reranking + LLM answer
           ↓
  Streamlit UI              ← Q&A with frame previews + timestamps

Quickstart

Step 1 — Install system dependencies

Mac:

brew install ffmpeg

Ubuntu:

sudo apt install ffmpeg

ffmpeg is required for Whisper audio extraction and yt-dlp video merging. Without it, URL processing will fail.

Step 2 — Install Python dependencies

cd vela
pip install -r requirements.txt
pip install yt-dlp      # for YouTube/URL support

Step 3 — Configure environment

cp .env.example .env
nano .env

Minimum required config:

# For local-only (free, no API key needed):
LLM_PROVIDER=ollama
VISION_PROVIDER=ollama        # ← use ollama, not openai, unless you have credits
OLLAMA_VISION_MODEL=llava
EMBEDDING_PROVIDER=ollama

# For faster responses (requires paid OpenAI credits):
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your-key
VISION_PROVIDER=openai
VISION_MODEL=gpt-4o

Step 4 — Pull Ollama models

ollama pull llava              # vision model (describes frames)
ollama pull llama3.2           # text LLM (answers questions)
ollama pull nomic-embed-text   # embeddings

Step 5 — Start Qdrant

Option A — Docker (recommended):

docker compose up -d

Option B — Binary (if Docker isn't running):

curl -L https://github.com/qdrant/qdrant/releases/latest/download/qdrant-aarch64-apple-darwin.tar.gz -o qdrant.tar.gz
tar -xzf qdrant.tar.gz
./qdrant

Leave this terminal open. Qdrant runs on port 6333.

Step 6 — Start backend

cd backend
uvicorn main:app --reload --port 8000

Step 7 — Start frontend (new terminal)

cd frontend
streamlit run app.py

Open http://localhost:8501

Common Errors & Fixes

❌ `ffmpeg is not installed. Aborting due to --abort-on-error`

brew install ffmpeg    # Mac
sudo apt install ffmpeg  # Ubuntu

❌ `VISION_PROVIDER=openai` — Incorrect API key / quota exceeded

Your OpenAI account has no credits. Switch to local LLaVA:

# In .env:
VISION_PROVIDER=ollama
OLLAMA_VISION_MODEL=llava

# Pull the model:
ollama pull llava

❌ `Connection refused` on port 6333 (Qdrant)

Qdrant is not running. Start it:

# Docker:
docker compose up -d

# Or binary:
./qdrant

❌ `yt-dlp not installed`

pip install yt-dlp

❌ Backend shows `ollama/gpt-4o` instead of `ollama/llava` after editing .env

The backend cached the old config. Restart it:

# Kill the uvicorn process (Ctrl+C) then:
uvicorn main:app --reload --port 8000

❌ `No such file or directory: ./qdrant`

You're in the wrong folder or the binary wasn't extracted. Re-download:

curl -L https://github.com/qdrant/qdrant/releases/latest/download/qdrant-aarch64-apple-darwin.tar.gz -o qdrant.tar.gz
tar -xzf qdrant.tar.gz
./qdrant

❌ URL processing fails with `Unknown error`

Check your FastAPI backend terminal — the real error is always printed there. Common causes:

Qdrant not running
Vision API key invalid
ffmpeg not installed
yt-dlp not installed

❌ Answer is empty after querying

The OpenAI LLM key has no credits. Switch to Ollama:

# In .env:
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.2

❌ `FP16 is not supported on CPU; using FP32 instead`

This is just a warning, not an error. Whisper runs fine on CPU with FP32. Ignore it.

Startup Checklist

Run through this every time before using VELA:

□ Qdrant running on port 6333
□ Ollama running (ollama serve)
□ llava, llama3.2, nomic-embed-text pulled in Ollama
□ ffmpeg installed (ffmpeg -version)
□ yt-dlp installed (pip show yt-dlp)
□ .env configured with correct VISION_PROVIDER=ollama
□ Backend running: uvicorn main:app --reload --port 8000
□ Frontend running: streamlit run app.py

Start Everything (copy-paste)

# Terminal 1 — Qdrant
./qdrant

# Terminal 2 — Ollama (if not running as a service)
ollama serve

# Terminal 3 — Backend
cd /path/to/vela/backend
uvicorn main:app --reload --port 8000

# Terminal 4 — Frontend
cd /path/to/vela/frontend
streamlit run app.py

Configuration Reference

Variable	Default	Options	Description
`LLM_PROVIDER`	`ollama`	`ollama`, `openai`, `anthropic`	Text LLM for answering questions
`VISION_PROVIDER`	`ollama`	`ollama`, `openai`	Vision LLM for frame description
`OLLAMA_VISION_MODEL`	`llava`	`llava`, `llava-llama3`	Local vision model
`VISION_MODEL`	`gpt-4o`	`gpt-4o`, `gpt-4o-mini`	OpenAI vision model
`WHISPER_MODEL`	`base`	`tiny`, `base`, `small`, `medium`, `large`	Transcription quality vs speed
`SCENE_THRESHOLD`	`27.0`	Lower = more frames	Scene detection sensitivity
`MAX_FRAMES_PER_VIDEO`	`300`	Any int	Frame extraction cap
`TOP_K`	`10`	Any int	Chunks retrieved before reranking
`RERANK_TOP_N`	`4`	Any int	Chunks kept after reranking

Example Queries

After processing a video:

"What is shown at the 2-minute mark?"
"Summarize the key points discussed."
"What diagrams or slides are visible?"
"At what timestamp does the presenter show the results?"
"What is written on the whiteboard?"

Project Structure

vela/
├── backend/
│   ├── main.py                  # FastAPI — /process, /process/url, /query
│   ├── config.py                # Central config + BYOK routing
│   ├── agents/
│   │   ├── frame_agent.py       # Scene detection + keyframe extraction
│   │   ├── vision_agent.py      # Vision LLM frame description
│   │   ├── transcript_agent.py  # Whisper transcription
│   │   ├── fusion_agent.py      # Visual + audio fusion
│   │   ├── synthesis_agent.py   # Timestamped answer generation
│   │   └── url_agent.py         # yt-dlp YouTube/URL downloader
│   ├── graph/
│   │   └── orchestrator.py      # LangGraph state graph
│   ├── rag/
│   │   ├── embedder.py
│   │   ├── indexer.py
│   │   └── retriever.py
│   └── utils/
│       └── video_utils.py
├── frontend/
│   └── app.py                   # Streamlit UI
├── frames/                      # Extracted keyframes (auto-created)
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 VELA — Video Engine for Language & Analysis

Architecture

Quickstart

Step 1 — Install system dependencies

Step 2 — Install Python dependencies

Step 3 — Configure environment

Step 4 — Pull Ollama models

Step 5 — Start Qdrant

Step 6 — Start backend

Step 7 — Start frontend (new terminal)

Common Errors & Fixes

❌ `ffmpeg is not installed. Aborting due to --abort-on-error`

❌ `VISION_PROVIDER=openai` — Incorrect API key / quota exceeded

❌ `Connection refused` on port 6333 (Qdrant)

❌ `yt-dlp not installed`

❌ Backend shows `ollama/gpt-4o` instead of `ollama/llava` after editing .env

❌ `No such file or directory: ./qdrant`

❌ URL processing fails with `Unknown error`

❌ Answer is empty after querying

❌ `FP16 is not supported on CPU; using FP32 instead`

Startup Checklist

Start Everything (copy-paste)

Configuration Reference

Example Queries

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎬 VELA — Video Engine for Language & Analysis

Architecture

Quickstart

Step 1 — Install system dependencies

Step 2 — Install Python dependencies

Step 3 — Configure environment

Step 4 — Pull Ollama models

Step 5 — Start Qdrant

Step 6 — Start backend

Step 7 — Start frontend (new terminal)

Common Errors & Fixes

❌ ffmpeg is not installed. Aborting due to --abort-on-error

❌ VISION_PROVIDER=openai — Incorrect API key / quota exceeded

❌ Connection refused on port 6333 (Qdrant)

❌ yt-dlp not installed

❌ Backend shows ollama/gpt-4o instead of ollama/llava after editing .env

❌ No such file or directory: ./qdrant

❌ URL processing fails with Unknown error

❌ Answer is empty after querying

❌ FP16 is not supported on CPU; using FP32 instead

Startup Checklist

Start Everything (copy-paste)

Configuration Reference

Example Queries

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

❌ `ffmpeg is not installed. Aborting due to --abort-on-error`

❌ `VISION_PROVIDER=openai` — Incorrect API key / quota exceeded

❌ `Connection refused` on port 6333 (Qdrant)

❌ `yt-dlp not installed`

❌ Backend shows `ollama/gpt-4o` instead of `ollama/llava` after editing .env

❌ `No such file or directory: ./qdrant`

❌ URL processing fails with `Unknown error`

❌ `FP16 is not supported on CPU; using FP32 instead`

Packages