This repository contains a video analysis system with capabilities for keyframe extraction, facial expression recognition, and video question answering evaluation.
video_agent/
├── data_process/
│ ├── complex_question_selection.py
│ ├── error_case_saved.py
│ ├── error_complex.py
│ └── separation_complex_question.py
├── evaluation/
│ ├── complex_question_2API.ipynb
│ └── complex_question_agent_gpt4v.ipynb
│ └── complex_question_agent_llama.ipynb
│ └── complex_question_agent_local.ipynb
│ └── VQA_gpt4v.ipynb
│ └── eval_mvbench_0.py
│ └── eval_mvbench_1.py
└── local_model/
├── Facial_Expression_Recognition/
├── KeyFrame_Extraction/
├── PaddleVideo/
├── VideoCaptioningTransformer/
└── requirement.txt
Install the required dependencies:
cd local_model
pip install -r requirement.txtThe evaluation scripts eval_mvbench_0.py support multiple video formats and preprocessing:
- Video Reading: Uses
decordlibrary for efficient video loading - Frame Sampling: Configurable number of segments (default: 8-16 frames)
- Resolution: Standardized to 224x224 pixels
- Data Transformations:
GroupScale: Rescales images maintaining aspect ratioGroupCenterCrop: Center crops to target sizeGroupNormalize: Normalizes pixel valuesToTorchFormatTensor: Converts to PyTorch tensor format
Evaluate video understanding capabilities on MVBench dataset:
# Run MVBench evaluation
python evaluation/eval_mvbench_1.pyEvaluation Process:
- Load video samples with question-answer pairs
- Process videos through the model pipeline
- Generate predictions for multiple-choice questions
- Calculate accuracy metrics by task type
- Save results to JSON file
The system provides a Agent interface:
# Run ipynb file for different base models:
complex_question_agent_gpt4v.ipynb
complex_question_agent_llama.ipynb
complex_question_agent_local.ipynb
VQA_gpt4v.ipynb The evaluation system tracks:
- Overall Accuracy: Percentage of correct predictions
- Task-specific Accuracy: Performance breakdown by question type
- Detailed Results: Individual predictions with ground truth comparisons
Results are automatically saved in structured formats (JSON/CSV) for further analysis.
- Multi-modal Processing: Handles video, audio, and text inputs
- Flexible Keyframe Extraction: Three different extraction methods
- Comprehensive Evaluation: Support for multiple benchmark datasets
- GPU Acceleration: Optimized for CUDA-enabled systems
- Modular Design: Easy to extend with new models and evaluation metrics