Skip to content

sonixse/SmartCV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Meet SmartCV: A Multi-Agent System for Semantic Talent Matching 🀝

"The right person for the right role β€” understood, not just matched."


What is this?

SmartCV is a custom multi-agent system that takes a candidate's CV and finds the job offers that genuinely fit them β€” ranked by real semantic compatibility, not keyword overlap.

We built this from scratch. Seven specialized agents, each with a single responsibility, coordinated by an orchestrator that adapts the flow based on what it finds at each step. No pre-packaged agent was used. Every role was designed, justified, and implemented by the team.

The result behaves more like a team of expert reviewers working in parallel than a search engine.


The Problem

Traditional CV screening tools match by keywords. If a CV says scikit-learn and the job posting says supervised ML, it's a miss β€” even though they mean the same thing.

We fix this with semantic embeddings: both CVs and job requirements are converted into vectors that represent meaning, not letters. Skills that are conceptually close end up mathematically close. Then we filter must-have constraints, rank semantically, and explain the gaps.

This is not a pipeline. It's a reasoning system.


Meet the Agents

We named each agent after a figure from history whose defining contribution mirrors exactly what that agent does.


🟣 JOHN VON NEUMANN β€” The Orchestrator

John von Neumann designed the architecture of the modern computer: a central unit that coordinates memory, processing, and I/O. Our orchestrator does the same.

Built on LangGraph, John von Neumann manages the full pipeline state. It decides which agents activate, handles conditional branching (e.g., only waking up Lamarr if grey-zone skills exist), and ensures every agent receives exactly the context it needs .

Input:  CV upload
Output: Coordinates the full agent graph
Tool:   LangGraph stateful graph

🟠 ADA LOVELACE β€” The Interpreter Agent

Ada Lovelace wrote the first algorithm β€” the first time anyone converted a human idea into structured instructions a machine could follow. This agent does the same: it takes a CV, a deeply human document, and converts it into structured data a system can reason over.

The Interpreter Agent reads the raw PDF, extracts a structured candidate profile β€” skills, years of experience per domain, education level and field, spoken languages β€” and validates it into a Pydantic schema. CVs are messy, multilingual, and inconsistent. Ada Lovelace handles all of it.

Input:  Raw CV text (PDF β†’ string)
Output: Structured CandidateProfile (Pydantic)
Tool:   LLM + Pydantic validation

πŸ”΅ MARIE CURIE β€” The Qualifier Agent

Marie Curie's work was built on absolute scientific rigor. Either the element was radioactive or it wasn't β€” no approximations, no negotiation. The first person to win two Nobel Prizes in two different sciences didn't deal in grey areas.

The Qualifier enforces must-have constraints deterministically across four rules:

  1. Years of experience β€” candidate must meet or exceed the vacancy minimum
  2. Education level β€” checked against a hierarchy (No degree β†’ Bachelor's β†’ Master's β†’ PhD)
  3. Required languages β€” each language checked by name and minimum CEFR level
  4. Exact skill overlap β€” a soft bonus for direct name matches, before Alan Turing runs semantic comparison

It runs as pure code β€” fast, auditable, and immune to LLM hallucination. If a vacancy requires B2 English and the candidate has A1, the answer is no. Not "probably not." No.

Why languages go here, not in embeddings: A semantic model might place "Catalan" near "Spanish" and grant a partial match. But language requirements are operational constraints, not fuzzy preferences. The Qualifier enforces this β€” which is also the ethically correct call.

Input:  CandidateProfile + Vacancy requirements
Output: pass/fail flag + score (0..4) + failed_checks list + reasons list
        failed_checks tells Steve Jobs exactly which rules the candidate failed,
        enabling targeted coaching instead of generic gap analysis
Tool:   Deterministic rule engine (Python)

🟒 ALAN TURING β€” The Linguist Agent

Alan Turing asked whether machines could understand meaning. This agent is the answer.

The Linguist Agent performs semantic skills comparison: it converts every required skill from the vacancy into an embedding vector and compares them against the candidate's skill vectors stored in ChromaDB. To generate these vectors, we use intfloat/multilingual-e5-base β€” a pre-trained open-source multilingual embedding model from HuggingFace. Think of it as a component that reads a piece of text and converts it into a list of numbers representing its meaning. Skills that mean similar things end up as similar numbers. Cosine similarity produces three categories:

Category Threshold Meaning
βœ… MATCH > 0.87 Semantically equivalent ("PySpark" β‰ˆ "distributed data processing")
⚠️ GREY ZONE 0.84 – 0.87 Possibly related β€” needs reasoning
❌ NO MATCH < 0.84 Not covered

On thresholds: The values 0.87 and 0.84 are heuristic starting points, informed by common practice in semantic similarity tasks. They are defined in src/config.py (LINGUIST_MATCH_THRESHOLD and LINGUIST_GREY_THRESHOLD) and can be tuned without touching any other file. We adjust them based on three signals: (1) false positives β€” if clearly unrelated skills are landing in MATCH, we raise the upper threshold; (2) false negatives β€” if obviously equivalent skills like "Python" vs "Python 3" are falling into GREY ZONE, we lower it; (3) grey zone volume β€” if too many skills end up in GREY ZONE, the Detective becomes a bottleneck, so we tune until the grey zone catches only genuinely ambiguous cases.

Input:  Candidate skills + Vacancy skill requirements
Output: Per-skill classification (MATCH / GREY ZONE / NO MATCH)
Tool:   BGE (meaning-vector model) + ChromaDB nearest-neighbor search

🟑 HEDY LAMARR β€” The Detective Agent

Hedy Lamarr invented frequency-hopping spread spectrum β€” the ability to detect a clear signal by intelligently navigating through noise and ambiguity. The basis of WiFi, Bluetooth, and GPS. This agent does the same: it finds the real signal in skills that are too noisy for a simple match.

The Detective Agent handles ambiguity reasoning β€” it only activates when Alan Turing flags GREY ZONE skills. It reads the actual CV context β€” project descriptions, job history, tool mentions β€” and judges whether the candidate likely has the skill implicitly. It always cites the specific evidence it used. No silent decisions.

Input:  Grey-zone skills + full CV context
Output: MATCH / NO MATCH verdict per skill + quoted evidence
Tool:   LLM with chain-of-thought
Activation: Conditional β€” only when grey zones exist

πŸ”΄ SERENA WILLIAMS β€” The Podium Agent

Serena Williams dominated the WTA ranking for over 20 years. Her legacy is not just the trophies β€” it is the points, accumulated consistently, relentlessly, across every surface and every era. This agent does the same: it aggregates every signal into a final score and ranks without hesitation.

The Podium Agent handles scoring and ranking: it aggregates the outputs from Marie Curie, Alan Turing, and Lamarr into a single weighted compatibility score per vacancy. Weights are calibrated per skill category (must-have vs. nice-to-have) and role seniority. The result is a ranked list of vacancies, each with a transparent, decomposed score.

Handling the no-match case. The Podium Agent never returns an empty result. Even when scores are universally low β€” meaning the candidate does not match any vacancy well β€” the ranking is still produced and surfaced. A low score is not a dead end; it is the most honest and useful input The Visionary Agent could ever receive. The worse the match, the richer the coaching output. A candidate with zero strong matches does not see a blank screen β€” they see a precise, personalised roadmap of exactly what to build to become competitive. The system turns its own worst-case scenario into its most valuable output.

Input:  Qualification results + skill match results (all agents)
Output: Compatibility score (0–100) per vacancy, ranked β€” always, regardless of score
Tool:   Weighted scoring formula (Python)

🟀 STEVE JOBS β€” The Visionary Agent

Steve Jobs never accepted "good enough." He identified gaps, cut the noise, and told people exactly what they needed to build β€” and why it mattered. This agent does the same for your career.

The Visionary Agent acts as career coach: it receives the gap analysis (skills missing or weak across top-ranked vacancies) and generates personalized, prioritized recommendations. It accounts for what the candidate already knows and suggests the highest-leverage next steps β€” not a generic list of skills, but a reasoned development path.

When scores are high, Steve Jobs fine-tunes β€” "one more skill and you jump from rank 3 to rank 1". When scores are universally low, Steve Jobs takes over completely: it reframes the entire output from a ranking into a development plan, telling the candidate not just what they are missing but in which order to tackle it and why β€” prioritised by impact on employability across all vacancies simultaneously.

Strong match example: "You have the ML foundations for Data Scientist roles. Adding MLflow (you already use Docker β€” it's a 2-day ramp) would make you competitive for 3 more vacancies in this list."

No match example: "None of the current vacancies are a strong fit yet β€” but you are closer than you think. Your Python base covers 60% of what Data Analyst Junior requires. Focus on SQL and Power BI first: those two skills unlock 5 of the 8 vacancies in the dataset. You could be competitive within 3 months."

Input:  CandidateProfile + top vacancies + gap analysis
Output: Ranked skill recommendations with impact justification
Tool:   LLM with structured output

πŸ† JOHANNES GUTENBERG β€” The Publisher Agent

Johannes Gutenberg invented the printing press β€” the original act of making information displayable and accessible to the masses. Johannes Gutenberg turns the pipeline's output into something a human can actually read and act on.

The Publisher Agent handles results and display: it persists all results to SQLite β€” including the analysis results, the vacancy descriptions, and the submitted CV profile β€” structures the output for the interface, and drives what the candidate actually sees: the ranked list, the per-skill breakdown, and the Steve Jobs coaching output β€” all rendered in a custom Streamlit interface.

Input:  Final ranked results + coaching output
Output: Rendered Streamlit interface for the candidate
Tool:   SQLite + Streamlit

Full System Architecture

                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚ Streamlit Interface β”‚
                        β”‚  (candidate uploads β”‚
                        β”‚     CV as PDF)      β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚    🟣 JOHN VON NEUMANN   β”‚
                        β”‚     Orchestrator    β”‚
                        β”‚    (LangGraph)      β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚                   β”‚                   β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  🟠 ADA LOVELACE     β”‚        β”‚        β”‚   πŸ”΅ MARIE CURIE      β”‚
    β”‚ The Interpreter Agt β”‚        β”‚        β”‚  The Qualifier Agt  β”‚
    β”‚  LLM + Pydantic     β”‚        β”‚        β”‚   Rule Engine       β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚                   β”‚                   β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚    🟒 ALAN TURING        β”‚
                        β”‚  The Linguist Agent β”‚
                        β”‚ BGE vectors+ChromaDBβ”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Grey zones found?           β”‚
                    β”‚  YES ──► 🟑 HEDY LAMARR    β”‚
                    β”‚        The Detective Agent    β”‚
                    β”‚          LLM + Evidence      β”‚
                    β”‚  NO  ──► skip               β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚  πŸ”΄ SERENA WILLIAMS  β”‚
                        β”‚   The Podium Agent  β”‚
                        β”‚  Weighted Ranking   β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚    🟀 STEVE JOBS          β”‚
                        β”‚ The Visionary Agent β”‚
                        β”‚  LLM + Gap Analysis β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚   πŸ† JOHANNES GUTENBERG     β”‚
                        β”‚ The Publisher Agent β”‚
                        β”‚ SQLite + Streamlit  β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Agent legend

Symbol Agent Type
🟣 John von Neumann Orchestrator β€” LangGraph
🟠 Ada Lovelace LLM Agent β€” Interpreter
πŸ”΅ Marie Curie Deterministic β€” Rule engine
🟒 Alan Turing Data Agent β€” BGE + ChromaDB
🟑 Hedy Lamarr LLM Agent β€” Chain-of-thought
πŸ”΄ Serena Williams Deterministic β€” Scoring formula
🟀 Steve Jobs LLM Agent β€” Gap analysis
πŸ† Johannes Gutenberg Deterministic β€” SQLite + Streamlit

Tech Stack

Component Tool Reason
Orchestration LangGraph Stateful, conditional agent graph β€” not a fixed pipeline
LLM Llama 3 (Ollama) Open source, local, zero API cost
Embedding model intfloat/multilingual-e5-base Open-source multilingual embedding model. Converts text into meaning-vectors β€” skills that mean similar things end up as similar numbers. Handles English, Spanish, Catalan, and more without preprocessing. Used off the shelf, no training required.
Vector DB ChromaDB Nearest-neighbor search native, file-local, zero server
Relational DB SQLite Stores analysis results, scores, vacancy descriptions, and submitted CV profiles
Data validation Pydantic Structured, typed output from LLM agents
Frontend Streamlit Custom dark UI, 7-language i18n, semantic ranking view

Data

  • Job offers: 8 positions across the AI/Data stack β€” from Data Analyst Junior to AI Researcher and MLOps Engineer
  • CVs: 10 synthetic candidate profiles, covering a range of seniority levels, skill combinations, and backgrounds

Getting Started β€” Reproduce This Project

Everything you need to run SmartCV locally, from zero to a working app.

1. Requirements

Requirement Version / Notes
Python 3.10 or higher
RAM 8 GB minimum (16 GB recommended for smooth LLM inference)
Disk ~5 GB free (model weights + ChromaDB index)
OS Linux, macOS, or Windows (WSL2 recommended on Windows)

2. Clone and install dependencies

git clone https://github.com/sonixse/Challenge3-SmartCV
cd Challenge3-SmartCV
pip install -r requirements.txt

3. Install Ollama and pull the LLM

The agents that use a language model (Ada Lovelace, Hedy Lamarr, Steve Jobs) run locally via Ollama. No API key needed.

# a) Install Ollama β€” download from https://ollama.com for your OS

# b) Pull the model (one-time, ~4 GB download)
ollama pull llama3

# c) Start the Ollama server (keep this terminal open)
ollama serve

If you see a connection error when launching the app, Ollama is not running. Start it with ollama serve first.


4. Index the job vacancies into ChromaDB

Alan Turing (Linguist) uses ChromaDB + BGE embeddings to do semantic skill matching. You need to index the vacancy data once before the first run.

python scripts/index_vacancies.py

This creates the vector index in data/chroma/. You only need to run this once (or again if the vacancy data changes).


5. Launch the app

streamlit run app.py

Open your browser at http://localhost:8501, upload a CV in PDF format, and click Show Ranking.


Checklist before launching

  • pip install -r requirements.txt completed
  • ollama pull llama3 completed (one-time, ~4 GB)
  • ollama serve is running in a separate terminal
  • python scripts/index_vacancies.py ran without errors

Troubleshooting

Symptom Cause Fix
Cannot connect to Ollama ollama serve not running Run it in a separate terminal
index_vacancies.py fails ChromaDB path missing The script creates data/chroma/ automatically; check write permissions
First run is slow (~30 s) intfloat/multilingual-e5-base downloading One-time download (~500 MB); subsequent runs use the cached model
streamlit: command not found Streamlit not installed Re-run pip install -r requirements.txt
0 matches for a CV All offers eliminated by Qualifier Check that the CV's language names are in English (e.g. "Spanish", not "EspaΓ±ol")

The LangGraph pipeline lives in src/orchestrator/graph.py. Agent thresholds and weights are in src/config.py.


How to Use the App

Once running, upload any CV in PDF format and click Show Ranking. In seconds you get:

  • Top-ranked job matches with semantic compatibility scores
  • Per-skill breakdown: MATCH / GREY ZONE (with Lamarr's reasoning) / NO MATCH
  • Personalized gap analysis and coaching from the SmartCV Assessor

Why This Is a Genuine Multi-Agent System

The constraint was clear: no pre-built agents. Here is how we comply β€” and go further:

  • 7 agents, 7 roles β€” every agent has a single, defined responsibility with typed inputs and outputs
  • Conditional activation β€” Lamarr only runs when Turing finds ambiguity. Johannes Gutenberg only renders once Serena Williams has a final score. The system is not a fixed pipeline; it adapts.
  • Deliberate LLM vs. code separation β€” Marie Curie and Serena Williams run as pure code because their tasks are deterministic. Ada Lovelace, Lamarr, and Steve Jobs use LLMs because their tasks require language understanding. This is an architectural decision, not a default.
  • The orchestrator has state β€” Von Neumann tracks what has run, what is pending, and what the current candidate profile looks like at each step.

The 5 Evaluation Dimensions

1. Innovation & Originality Semantic embeddings + a conditional reasoning layer (Lamarr) + a personalized career coaching agent (Steve Jobs). Most CV tools do keyword matching. We do semantic understanding with explainability and a development roadmap. The agent naming is not decoration β€” it's a communication strategy that makes the architecture instantly memorable.

2. Feasibility & Scalability Every component is production-realistic. ChromaDB scales to millions of vectors. BGE (our embedding model) is fast enough for real-time queries. SQLite swaps to PostgreSQL with one config change. The Streamlit app can be containerised and served behind a reverse proxy. The LangGraph orchestrator pattern works at any scale.

3. Clarity & Conciseness One agent, one job. The architecture is legible: you can point at any node and explain what it does, why it's there, and why it uses the tool it uses. The conditional branching is a single decision point (grey zones exist?).

4. Collaboration & Engagement Jobs makes the system valuable to candidates, not just recruiters. This turns a B2B screening tool into something with direct user value β€” a career advisor that gives you a ranked to-do list for your next role.

5. Ethical Considerations

  • No protected attributes (age, gender, nationality) enter the scoring
  • Marie Curie's rules are transparent and auditable β€” no silent LLM disqualifications
  • Lamarr always cites its evidence β€” no black-box decisions in grey zones
  • Language requirements are operational constraints, not cultural signals (handled by Marie Curie, not Alan Turing)
  • All models run locally β€” no candidate data leaves the system

What We'd Build With More Time

  • Two-stage retrieval: use a lighter version of BGE for fast top-50 candidate retrieval, then the full model for final reranking. This is how production semantic search systems work β€” we prototyped it in theory and would implement it in a production version.
  • Feedback loop: Collect recruiter accept/reject decisions and adjust Serena Williams's weights over time. Light online learning with zero retraining.
  • Explainability dashboard: Johannes Gutenberg extended with a visual breakdown of each score component β€” useful for HR audits and regulatory compliance.
  • Multi-language CV support: BGE (our embedding model) handles multilingual text; Ada Lovelace would be extended to parse CVs natively in Spanish, Catalan, and English without preprocessing.
  • REST API layer: Expose the full agent graph as an API so it can plug into existing ATS systems. John von Neumann becomes a service, not a script.
  • Synthetic CV generation at scale: Programmatic generation of edge-case CVs to stress-test Lamarr and calibrate Alan Turing's thresholds.

Download external resume data

The Team

Five people, two tracks, one system.

Backend Β· Agents Β· Orchestration Β· Pipeline An industrial engineer, a computer engineer, and an AI engineer β€” the people who built the agents and made them talk to each other.

Frontend Β· Presentation Β· Documentation Β· Impact A biomedicine specialist and a business & technology expert β€” the people who made the system legible, defensible, and worth presenting.

The agent names are a small tribute to that structure: each one carries the spirit of a discipline that someone on this team lives in.


"We did not use a prebuilt agent. We designed a custom multi-agent architecture where each of the seven agents has a specific, justified role β€” from CV parsing to semantic matching, hard filtering, ambiguity reasoning, scoring, coaching, and display β€” coordinated by an orchestrator that adapts the flow based on what it finds at every step."

About

SmartCV is a custom multi-agent system that matches CVs to job vacancies using semantic embeddings and LLM reasoning, ranking opportunities and coaching candidates on skill gaps. Built with LangGraph, BGE, ChromaDB and Llama 3.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages