HireMe is an end-to-end hiring workflow that moves a candidate from application submission to interview scheduling, offer signature, and Slack onboarding.
- See
docs/overview.mdfor a concise component-by-component summary.
This project includes:
- Candidate-facing career flow with job listings and application form
- Resume parsing and first-layer screening
- LLM-based candidate scoring and shortlisting
- Candidate research enrichment (LinkedIn/X/GitHub/portfolio)
- Interview scheduling orchestration with Google Calendar holds
- Offer-letter e-signature flow with DocuSign
- Post-signature Slack onboarding automation with AI-personalized welcome message
- This prototype uses test/dummy candidate records and test integration accounts.
- No production or real candidate PII should be used for demo/submission runs.
- Any mocked outputs (for example fallback transcript summaries) are explicitly documented.
- Backend: FastAPI (Python)
- Frontend: Next.js
- Database: PostgreSQL
- Object storage: AWS S3 (resume + offer artifacts)
- Queues/async workers: AWS SQS
- AI inference: AWS Bedrock
- Email: SMTP-based notification service
- Runtime/orchestration: Docker + docker-compose
- Container registry: AWS ECR stores versioned Docker images for backend/frontend worker services.
- Secrets handling: AWS Secrets Manager stores API keys, tokens, DB credentials, and integration secrets (DocuSign/Slack/Fireflies/etc.) instead of hardcoding in code or images.
- Compute runtime: AWS EC2 VM runs pulled Docker images (from ECR) and starts application services.
- Frontend delivery: CloudFront is placed in front of the frontend origin for caching, HTTPS delivery, and lower-latency global access.
- Build Docker images for services.
- Push images to AWS ECR.
- Provision/update secrets in AWS Secrets Manager.
- On EC2, pull latest images from ECR.
- Inject secrets into runtime environment.
- Run backend/worker services on EC2; run frontend service on EC2 as frontend origin.
- Route frontend traffic through CloudFront (EC2 frontend origin) for edge delivery.
- PostgreSQL schema evolution is migration-driven via Alembic (
alembic upgrade head). - Application startup does not run PostgreSQL schema patch DDL/DML.
- Primary LLM tasks (screening/research synthesis): AWS Bedrock primary model
- Secondary LLM tasks (cost-optimized summarization/welcome): AWS Bedrock fallback model
- Prompt-based scoring for candidate-job fit (0-100)
- Structured JSON-constrained prompts for deterministic parsing of model output
- Coding support used during project implementation: OpenAI Codex coding agent
- Google Calendar API:
- Reads interviewer availability
- Creates tentative hold slots
- Confirms one selected slot and generates Google Meet link
- Fireflies API (GraphQL):
- Attempts live interview capture and transcript retrieval
- Stores transcript URL + summary on candidate profile
- DocuSign API:
- Sends offer letter envelope for digital signature
- Webhook/callback updates signed/declined state
- Slack API:
- Sends workspace invite after offer signature
- Handles first join event and sends AI-personalized welcome DM
- Sends HR/internal onboarding confirmation message
- SerpAPI + profile extractors:
- LinkedIn discovery and cross-reference
- Portfolio discovery
- GitHub API:
- Repository/activity/language extraction for profile enrichment
- X/Twitter API (v2):
- Standalone extractor exists for handle-based extraction
- Shortlisted strict enrichment pipeline intentionally uses deterministic Twitter mock output
- Application + screening:
- Candidate submits resume and metadata.
- Resume is parsed asynchronously.
- Lightweight pre-filter runs before LLM to reduce cost/latency.
- If pre-filter passes, LLM score is computed; threshold controls shortlist.
- Research enrichment:
- For shortlisted candidates, enrichment workers gather LinkedIn/GitHub/portfolio evidence plus a deterministic Twitter mock block.
- System cross-checks against resume and generates discrepancies + 3-5 sentence brief.
- LLM synthesis receives curated/sanitized evidence only and returns confidence + provenance refs.
- API returns both:
online_research_summary(legacy raw JSON string)research_summary(validated typed nested object for safe consumers)
- Confidence gate behavior:
- if
manual_review_required=trueor confidence islow, candidate is routed to explicit reviewer path - scheduling queue endpoint blocks until reviewer action
- if
- Research worker emits quality telemetry counters (manual review, low confidence, high severity flags, parse failures, fallback-model usage).
- Scheduling:
- System finds 3-5 manager slots (45 min) in next business window.
- All offered slots are held immediately to prevent conflicts.
- On candidate confirmation, one slot is finalized and others are released.
- Candidate can accept directly in Google Calendar (
Yes) without replying to email. - Current implementation does not depend on email replies to finalize booking.
- Automatic calendar attendee-decline sync (
No-> auto-cancel/reschedule) is a planned hardening step; current flow supports explicit cancel/reschedule actions.
- Offer + onboarding:
- Offer letter is generated and sent via DocuSign.
- On signature completion, Slack invite flow starts.
- On first Slack join, AI-generated personalized welcome is sent and HR is notified.
This project intentionally made trade-offs to deliver a working end-to-end system within limited time.
- Pre-filter before LLM scoring
- What we changed:
- Added deterministic prefilter gates before LLM scoring (
prefilter_min_*, keyword/skill matches, bounded prefilter text). - Screening now uses token/phrase matching (with small canonical aliases like
js -> javascript) instead of raw substring checks. - Requirements are split into must-have vs nice-to-have; first-layer gating uses must-have set.
- Experience gate enforces minimum years and treats maximum years as configurable (
prefilter_enforce_max_years, defaultfalse).
- Added deterministic prefilter gates before LLM scoring (
- Why:
- To reduce LLM cost, lower queue pressure, and speed up screening.
- Trade-off:
- Some strong but non-standard profiles can still be filtered out early (false negatives), though max-years false rejects are reduced by default.
- Smaller/secondary model for selected AI tasks
- What we changed:
- Used
fallback_model_idfor selected tasks (for example work-summary/welcome-style generation) instead of always using the primary model.
- Used
- Why:
- To control inference cost and reduce latency.
- Trade-off:
- Lower accuracy/consistency ceiling compared to always using the primary model.
- Fireflies transcript fallback when extraction is incomplete
- What we changed:
- Kept the interview pipeline moving when Fireflies transcript payload was incomplete by using fallback/mock summary paths instead of hard-failing.
- Why:
- To avoid blocking downstream stages (admin review, status progression) while integration was still unstable.
- Trade-off:
- Transcript fidelity is lower in some edge cases until extraction reliability is hardened.
- Research payload compaction before persistence/LLM synthesis
- What we changed:
- Capped and compacted enrichment payloads (hit limits + clipped fields +
max_research_json_chars) before storage/synthesis.
- Capped and compacted enrichment payloads (hit limits + clipped fields +
- Why:
- To keep token usage bounded and prevent oversized noisy prompts.
- Trade-off:
- Some long-tail evidence is dropped, which can reduce context depth.
| Integration | Current State | Why | Production Plan |
|---|---|---|---|
| Fireflies transcript retrieval | Real integration with fallback/mock summary when transcript payload is incomplete | Avoid blocking interview pipeline on unreliable transcript fields | Webhook-first transcript ID matching + stronger retries + strict real-transcript completion |
| X/Twitter enrichment in shortlist pipeline | Deterministic mock block only (mode=mock) in strict orchestrated flow |
Avoid false profile attribution and unsupported claims from weak identity resolution | Re-enable only with verified handle/URL confidence checks and evidence grounding |
| Slack admin invite in restricted workspaces | API invite may fail on token limitations; fallback invite-link email is used | Keep onboarding unblocked without enterprise admin token guarantees | Workspace-admin token hardening + deterministic invite API success path |
- LLM score is not perfectly consistent run-to-run.
- LLM dependencies can introduce hallucination risk when external evidence is noisy.
- Research payloads from LinkedIn/X/GitHub can become large/noisy and lose signal after trimming.
- Some legacy flows still rely on broad scans where incremental/indexed patterns are needed for scale.
- Queue-depth-aware backpressure is implemented for webhook queue; broader queue autoscaling/alerting is still infrastructure-dependent.
- A robust LLM circuit-breaker strategy is not fully implemented yet.
- Full load testing/performance characterization has not been completed yet.
These hardening items are now in place for review:
- Durable background processing for webhook and deferred email side effects
- Slack
team_join, Fireflies transcript completion, and application confirmation email are now queued as durable SQS jobs. - New worker:
app.scripts.sqs_webhook_event_worker.
- Idempotency for external webhook retries
- Added
processed_webhook_eventsstore with claim/complete/fail state transitions. - Worker enforces idempotency before running side effects.
- Fireflies deterministic candidate mapping (no API-side full scan fallback)
- Fireflies processing now resolves candidates via direct confirmed-meeting-link lookup only.
- Added normalized/indexed meeting-link lookup path in repository/database.
- Queue observability + backpressure controls
- Added queue-depth checks on webhook enqueue with warning/reject thresholds.
- Added periodic worker telemetry logs (success/failure/duplicates + queue depth).
- Deferred async behavior tests
- Added tests for Slack deferred enqueue, Fireflies fast-ACK enqueue, and webhook worker idempotent processing paths.
- Repository hygiene
- Duplicate
* 2.*artifact files were removed.
Remaining recommendation:
- add DLQ replay automation/runbook (DLQ wiring is deployment-specific and should be enabled in infrastructure).
| Requirement Area | Current Behavior | Rationale / Note |
|---|---|---|
| Resume file types | .pdf, .docx, and .doc currently accepted |
.doc retained for backward compatibility; can be disabled for strict PDF/DOCX-only policy |
| Portfolio/GitHub optionality | portfolio_url and github_url are optional in application form/payload |
Aligned with assignment brief |
| Calendar attendee response sync | Booking finalizes at confirmation API time; attendee responseStatus sync is not fully automated |
Core scheduling is stable; attendee accept/decline sync is planned hardening |
| Reminder/expiry timing | 24-hour reminder with 48-hour hold expiry | Chosen to nudge early while still enforcing 48-hour slot release window |
- AI scoring and evaluation reliability:
- Build a formal evaluation harness for scoring consistency and fairness.
- Add calibration datasets and repeated-run variance checks.
- Add confidence scoring and stronger human-in-the-loop override guidance.
- Hallucination detection and guardrails:
- Add evidence-grounding checks before persisting AI claims.
- Reject/flag responses that do not map to extracted evidence.
- Add automated contradiction checks between resume and external profiles.
- Database and scalability:
- Replace broad/full scans with indexed query patterns and incremental processing windows.
- Improve worker partitioning and back-pressure handling for high throughput.
- Add deeper observability on queue lag, retry reasons, and processing latency.
- Security and code quality hardening:
- Expand security review (input validation, secret handling, webhook hardening, auth boundaries).
- Run deeper static analysis and quality gates with tools like SonarQube.
- Add production-style load tests + failure-injection tests for deployment confidence.
- LLM/Queue resilience:
- Implement proper backpressure strategy (bounded concurrency, queue-depth-aware throttling).
- Add LLM circuit breaker (timeout/error thresholds, open/half-open/closed states, controlled fallback routing).
Because this system uses LLMs in screening, research synthesis, offer drafting, and onboarding communication, the next major improvement would be a task-specific LLM evaluation harness. The goal would not be to measure generic model quality, but to measure whether the model is helping the hiring pipeline make better, safer, and more consistent decisions.
- Screening accuracy
- Compare AI shortlist/reject outcomes against a small human-labeled evaluation set for each role.
- Track precision, recall, and false reject rate, especially for candidates near the shortlist threshold.
- Score consistency
- Run the same candidate through the scoring flow multiple times and measure score variance.
- Check whether repeated runs cause unstable threshold flips around the shortlist cutoff.
- Evidence grounding
- Evaluate whether candidate brief claims are supported by resume data or extracted external evidence.
- Flag unsupported or weakly supported claims to reduce hallucination risk in manager-facing summaries.
- Fairness smoke tests
- Test matched candidate pairs with similar qualifications but different names, schools, or resume phrasing.
- Check whether score differences are driven by job-relevant evidence rather than irrelevant background signals.
With more time, I would add a small structured dataset containing:
- target role,
- candidate identifier,
- resume summary or parsed candidate payload,
- expected decision (
reject,borderline,shortlist), - expected score band,
- evaluator notes.
This would allow repeated offline evaluation without depending on live submissions.
The most useful metrics for this system would be:
- shortlist precision,
- shortlist recall,
- false reject rate,
- average score variance across repeated runs,
- grounding pass rate for generated briefs,
- fairness observations from matched-profile tests.
This would improve performance in three ways:
- better screening quality by catching weak prompt/model behavior early,
- better operational trust by making score and brief outputs more consistent,
- better risk control by identifying hallucination and bias issues before they affect hiring decisions.
If extended further, I would add:
- a versioned evaluation dataset in the repository,
- a script to run batch scoring and summarization evaluations,
- regression reporting for prompt/model changes,
- confidence-based routing or human-review escalation when evidence quality is weak.