Skip to content

Fireblossom/EviMap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EviMap: Evidence-Grounded Topic Maps for Unlabeled Corpora

Try the Live Demo

Open the hosted dashboard to see the finished experience before running the pipeline locally.

EviMap turns an unfamiliar text corpus into a browsable topic map whose labels can be audited against exact source spans. It is designed for early corpus exploration, when analysts do not yet have a stable taxonomy, coding scheme, or search vocabulary.

The problem EviMap addresses is the scale and verifiability gap: manual coding preserves evidence but does not scale, while clustering, topic models, and one-shot LLM summaries scale but leave users asking why a label should be trusted. EviMap makes generated labels inspectable by grounding them in the phrases that produced them.

The hosted demo lets you browse a three-level map, open linked documents, inspect highlighted evidence phrases, and combine topics as exploratory queries. This repository contains the reproducible pipeline that regenerates the same kind of evidence-grounded artifact from JSONL documents.

documents
  -> LLM evidence phrase extraction
  -> span alignment and phrase index
  -> local sentence-transformer embeddings
  -> KMeans-scaffolded, LLM co-association grouping
  -> leaf topics, mid-level groups, top-level aspects
  -> auditable topic-map artifacts

The hosted dashboard is a published artifact. Reproduction is centered on running the pipeline from input documents, not on downloading a prebuilt static page.

Demo Story

The paper demo starts with job postings because the evidence is easy to judge. In the hosted dashboard, a good path is:

  1. Open the Job Postings corpus.
  2. Drill from a top-level skill or service aspect into a fine-grained language topic such as Bilingual or Multilingual Skill.
  3. Open a supporting document and inspect the highlighted phrase spans.
  4. Click another highlighted phrase in the same document to jump to its topic.
  5. Use Combine Two Topics to find documents that discuss both language requirements and customer-facing work.
  6. Inspect the Other or long-tail material instead of treating the induced map as a final gold taxonomy.

That path is the core claim: labels are not presented as answers to accept. They are claims that can be followed back to the phrases and documents that support them.

What EviMap Shows

  • A compact top-level map of corpus aspects.
  • Fine-grained topics induced from evidence phrases, not whole-document labels.
  • Character-offset evidence spans for each matched phrase.
  • Source documents with highlighted evidence.
  • A packaged static frontend that can be deployed to Cloudflare Pages or any static host.

Paper Alignment

This repository mirrors the system realization described in the paper:

  • Input records use the paper's normalized {doc_id, text, metadata} format.
  • An external OpenAI-compatible LLM builds a lightweight domain profile and extracts exact evidence phrases.
  • Extracted phrases are deduplicated into a phrase index, aligned back to character offsets, and embedded with paraphrase-multilingual-MiniLM-L12-v2.
  • KMeans is used as a coarse scaffold so each LLM grouping call sees a small, related context.
  • Multi-round co-association voting stabilizes grouping decisions.
  • The hierarchy is built bottom-up from evidence phrases into leaf topics, mid-level groups, and a fixed 14-aspect top layer.
  • Intermediate files, prompts, model settings, and span-level provenance are written to the run directory for audit.

The lightweight dashboard generated by this repository is an artifact viewer for local runs. The hosted demo is the full interactive dashboard used to show hierarchy browsing, phrase-click navigation, topic combination, and multi-corpus switching.

1. Install

Use Python 3.9+.

cd evimap_poc
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt

2. Configure the External LLM

EviMap uses an OpenAI-compatible chat-completions API for evidence extraction, semantic grouping, and naming. The defaults match the paper setup: a local OpenAI-compatible DeepSeek endpoint.

cp .env.example .env
export EVIMAP_LLM_BASE_URL=http://127.0.0.1:18021/v1
export EVIMAP_LLM_MODEL=deepseek-ai/DeepSeek-V4-Flash
export EVIMAP_LLM_API_KEY=local

For OpenAI-hosted models, leave EVIMAP_LLM_BASE_URL empty or unset and set OPENAI_API_KEY. Do not set EVIMAP_LLM_ENABLE_THINKING unless your OpenAI-compatible backend accepts that extension field.

3. Run the Full Pipeline

python -m evimap.run_pipeline \
  --input data/sample_job_posts.jsonl \
  --output runs/sample_job_posts \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --embedding-backend local \
  --embedding-model paraphrase-multilingual-MiniLM-L12-v2 \
  --top-k 14 \
  --phrase-domains 12 \
  --topic-domains 4 \
  --rounds 5 \
  --chunk-size 60 \
  --min-coassoc 0.7 \
  --workers 64

Equivalent convenience command:

bash scripts/run_sample.sh

Both commands call the configured external LLM. The repository does not replace the paper pipeline with a mock model.

The parameters above mirror the paper-style configuration: DeepSeek-V4-Flash for LLM calls, paraphrase-multilingual-MiniLM-L12-v2 for phrase embeddings, five voting rounds, a 0.7 co-association threshold, and a fixed 14-aspect top layer. The sample corpus is intentionally small so the full external-LLM pipeline can be tested quickly.

For a real corpus, provide JSONL records with this shape:

{"doc_id": "doc-001", "text": "full document text", "metadata": {"source": "optional"}}

4. Inspect the Outputs

The run directory is designed for audit and debugging:

runs/sample_job_posts/
  config.json
  documents.jsonl
  01_profile/domain_profile.json
  02_extraction/extractions.jsonl
  03_index/phrase_entries.jsonl
  03_index/phrase_occurrences.jsonl
  03_index/unmatched_phrases.jsonl
  03_index/phrase_embeddings.npy
  04_leaf_topics/topics.jsonl
  04_leaf_topics/topic_occurrences.jsonl
  05_hierarchy/mid_groups.jsonl
  05_hierarchy/aspects.jsonl
  06_artifact/topic_map.json
  06_artifact/run_report.md
  06_artifact/dashboard.html

Important files:

  • phrase_occurrences.jsonl: every matched evidence phrase with doc_id, start, and end character offsets.
  • topics.jsonl: leaf topics induced from evidence phrases. Each topic keeps member phrase ids and supporting document ids.
  • aspects.jsonl: top-level aspects with nested mid groups and leaf topic ids.
  • topic_map.json: compact artifact for downstream inspection or a dashboard.

5. Preview the Generated Dashboard

The generated dashboard is a lightweight artifact viewer for a local run.

python -m http.server 8000 --directory runs/sample_job_posts/06_artifact

Open http://localhost:8000/dashboard.html.

6. Package and Deploy the Frontend

After a pipeline run finishes, package its generated artifact for a static host:

python scripts/build_frontend.py \
  --run runs/sample_job_posts \
  --out dist

This writes:

dist/
  index.html              landing page
  dashboard.html          generated run viewer
  topic_map.json
  run_report.md
  deploy_manifest.json
  _headers
  _redirects

Preview the packaged frontend:

python -m http.server 8000 --directory dist

For local debugging without copying the artifact, build a small debug site with the same landing page and a symlink to the run output:

python scripts/build_debug_site.py --run runs/sample_job_posts
cd site/debug && python -m http.server 8000

Deploy the packaged frontend to Cloudflare Pages with Wrangler direct upload:

export CLOUDFLARE_API_TOKEN=...
bash scripts/deploy_cloudflare_pages.sh evimap-demo

Useful environment variables:

export RUN_DIR=runs/sample_job_posts
export DIST_DIR=dist
export CF_PAGES_BRANCH=main
export EVIMAP_SITE_TITLE="EviMap"

Equivalent npm scripts are included for convenience:

npm run build:frontend
npm run deploy:pages -- evimap-demo

7. Tests

The included tests do not call the LLM.

PYTHONPATH=. python -m unittest discover -s tests

8. Notes for Larger Corpora

  • Increase --workers only as far as your LLM endpoint can handle.
  • Increase --phrase-domains and keep --chunk-size modest so each LLM grouping call sees a small, related context.
  • Use --max-docs for a paper/demo sample before running a full corpus.
  • The embedding default is paraphrase-multilingual-MiniLM-L12-v2, matching the paper. You can switch to an OpenAI-compatible embedding endpoint with --embedding-backend openai.

Add a New Domain

Want to see EviMap on another corpus or domain? Contact the authors with a short domain description and a few representative JSONL records. We can help prepare a new dashboard run when the data can be shared or sampled safely.

About

Evidence-grounded hierarchical topic maps for exploring unlabeled text corpora, with LLM extraction, co-association grouping, and source-span provenance.

Topics

Resources

Stars

Watchers

Forks

Contributors