Efficient vector search combining semantic similarity with metadata filtering.
- Zhangding Liu - Project Integration, Experiment Framework, Baselines, Evaluation, and Readme
- Yao-Ting Huang - Data Module
- Zaowei Dai - Indexing Module
- Yichang Xu - Search Module and Experimental Results Analysis
This section maps the main components described in our report to their code implementations.
data/loader.py: Wikipedia embedding dataset loader (OpenAI ada-002, d=1536)data/preprocessor.py: Synthetic metadata generation (category, importance, year, region)data/query_generator.py: Query workload generation with controllable selectivityscripts/build_data.py: Dataset generation script (100k samples default)scripts/build_query.py: Query generation script (500 queries)
indexing/ivf_index.py: FAISS IVF-PQ index wrapper with selective list searchindexing/metadata_signatures.py: Per-list metadata signature builder- Categorical signatures (bitsets)
- Numeric signatures (min/max ranges + bucket masks)
indexing/signature_storage.py: Signature persistence using Parquet
search/pruning.py: Filter-aware IVF list pruning using metadata signaturessearch/adaptive_deepening.py: Adaptive nprobe and k' parameter adjustmentsearch/search_engine.py: Main hybrid search coordinator
Implemented standard baseline methods from IR literature, adapted to our project's filter format and data structures:
baselines/prefilter_bruteforce.py: Pre-filter baseline (filter → exact search)baselines/postfilter_ann.py: Post-filter baseline with iterative retry logic (ANN → filter → retry)- References: Standard pre-filtering and post-filtering approaches in filtered vector search
Built evaluation framework implementing standard IR metrics, integrated with our hybrid search system:
evaluation/oracle.py: Exact filtered search for ground truth computationevaluation/metrics.py: Recall@k (set-based), latency statistics (percentiles), memory profilingevaluation/evaluator.py: Experiment framework with selectivity-based binningexperiments/run_experiments.py: Comparative experiment runner across k=[10,20,50]experiments/analyze_results.py: Visualization pipeline (7 individual plots for LaTeX)- References: Standard Recall@k definition, NumPy statistical functions, FAISS documentation
results/comparison.csv: Method comparison across k=[10,20,50]results/selectivity_analysis.csv: Performance by selectivity binsresults/index_stats.csv: Index build time and memory overheadresults/figures/exp1_*.png: Method comparison plots (4 files)results/figures/exp2_*.png: Selectivity analysis plots (3 files)
# Clone and install
git clone https://github.com/ZhangdingLiu/CS6400_Project_Group11.git
cd CS6400_Project_Group11
pip install -r requirements.txt
# Run experiments (see "Running Experiments" section below)
python scripts/build_data.py
python scripts/build_query.py
python experiments/run_experiments.py
python experiments/analyze_results.pyCS6400_Project_Group11/
├── scripts/ # Data generation scripts
├── data/ # Data loading & preprocessing
├── indexing/ # IVF-PQ index & metadata signatures
├── search/ # Filter-aware pruning & adaptive deepening
├── baselines/ # PreFilter, PostFilter baselines
├── evaluation/ # Evaluation metrics & framework
├── experiments/ # Experiment runners & result analysis
├── results/ # Experiment results & figures
└── requirements.txt # Python dependencies
- Filter-Aware Pruning: Metadata signatures eliminate IVF lists that can't satisfy filters
- Adaptive Deepening: Dynamically adjust search parameters based on intermediate results
- Hybrid Search: Combines vector similarity with structured metadata filtering
# Step 1: Generate dataset (default: 100k samples, ~586MB)
python scripts/build_data.py
# Step 2: Generate queries (500 queries)
python scripts/build_query.py
# Step 3: Run experiments (~8-15 min)
python experiments/run_experiments.py
# Step 4: Generate visualizations
python experiments/analyze_results.py# 10k samples (quick test)
python scripts/build_data.py --n-samples 10000
# 50k samples
python scripts/build_data.py --n-samples 50000
# Then run steps 2-4 as above- Data:
data_files/embeddings.npy,metadata.parquet,queries.json - Results:
results/comparison.csv,selectivity_analysis.csv,index_stats.csv - Figures:
results/figures/exp1_*.png,exp2_*.png(7 individual plots for LaTeX)
- Recall@k (k=10, 20, 50): Search quality
- Latency (P50/P95/Mean): Query performance
- Index Build Time: Indexing efficiency
- Memory Usage: Index + signature overhead
- FAISS: IVF-PQ indexing for efficient vector search
- DuckDB: Metadata filtering
- NumPy/Pandas: Data processing
- Matplotlib: Result visualization
See requirements.txt for full dependencies.