Benchmark-routed GraphRAG for supply-chain intelligence, packaged as a production-style FastAPI dashboard with Neo4j, visual analytics, and an AI copilot.
This repository is not just a web app. It contains the full project lifecycle:
- exploratory data analysis on the SupplyGraph dataset,
- export of cleaned graph-ready CSVs,
- graph loading into Neo4j,
- benchmarking of three retrieval strategies,
- a deployed dashboard that turns those benchmark findings into a routed AI copilot.
- Project at a glance
- What this repository contains
- End-to-end workflow
- Knowledge graph and RAG architecture
- Repository layout
- Quick start: run the full stack
- Run each part individually
- Configuration reference
- API surface
- What happens on startup
- Troubleshooting
| Item | Value |
|---|---|
| Main application | supply_chain_app/ |
| Backend framework | FastAPI |
| Graph database | Neo4j |
| Frontend style | Single-page dashboard served by FastAPI |
| Core AI pattern | Hybrid retrieval with intent routing |
| Retrieval strategies | Semantic RAG, GraphRAG, Text2Cypher |
| Generation model in project materials | gemma-3-27b-it |
| Embedding model in project materials | gemini-embedding-001 |
| Processed products | 40 |
| Unique plants | 25 |
| Unique storage locations | 13 |
| Product-to-plant links | 276 |
| Product-to-storage links | 276 |
| Observation rows | 70,720 |
| Resulting graph size after bootstrap | 70,798 nodes, 71,272 edges |
| Benchmark notebook | graphrag_benchmark.ipynb |
| EDA notebook | DataExploration/EDA.ipynb |
| Final report | Reports/FinalReport/main.pdf |
The project turns supply-chain data into a graph-backed decision-support system. It supports:
- KPI dashboards for products, plants, storages, and time-series flow
- operational risk monitoring based on fulfillment ratios
- product inspection with related-product context
- natural-language copilot queries routed to the best retrieval strategy
- benchmark-backed transparency on why a route was selected
The benchmark work in graphrag_benchmark.ipynb shows that no single retrieval method dominates every question type:
| Approach | Average score | Best at | Main weakness |
|---|---|---|---|
| GraphRAG | 4.20 / 5 | relational reasoning, similarity, graph-aware context | depends on good seed retrieval |
| Gemini RAG | 4.17 / 5 | broad narrative synthesis | weaker on exact graph lookups |
| Text2Cypher | 3.62 / 5 | exact structured and analytical queries | weaker on fuzzy reasoning |
That is why the app uses deterministic intent routing instead of forcing every question through one pipeline.
| Area | Purpose | Status in repo |
|---|---|---|
DataExploration/EDA.ipynb |
explores raw SupplyGraph data and produces cleaned outputs | notebook committed |
DataExploration/Processed/ |
cleaned CSV exports used by the app and graph loaders | committed and ready to use |
DataLoader.py |
standalone legacy graph loader from processed CSVs into Neo4j | committed |
graphrag_benchmark.ipynb |
compares Semantic RAG, Text2Cypher, and GraphRAG | committed |
embeddings/ |
notebook-level embedding cache | committed |
supply_chain_app/ |
deployable FastAPI dashboard, API, frontend, loaders, runtime cache | committed |
Reports/FinalReport/ |
final written report, figures, and compiled PDF | committed |
- The processed data needed to run the dashboard is already included.
- The raw SupplyGraph dataset used by the EDA notebook is not included in this repository.
- The app can run without a
GEMINI_API_KEY; it falls back to deterministic offline behavior. - The benchmark notebook does not have the same offline behavior for judging; it expects live API keys.
flowchart LR
A[Raw SupplyGraph CSVs<br/>not committed] --> B[DataExploration/EDA.ipynb]
B --> C[Processed CSV exports]
C --> D[Neo4j knowledge graph]
C --> E[supply_chain_app startup bootstrap]
D --> F[Product profiles]
F --> G[EmbeddingService]
G --> H[Embedding cache<br/>NPZ + metadata]
D --> I[Analytics API]
D --> J[Text2Cypher]
D --> K[GraphRAG]
H --> K
H --> L[Semantic RAG]
M[graphrag_benchmark.ipynb] --> J
M --> K
M --> L
J --> N[Intent router]
K --> N
L --> N
I --> O[FactoryPulse dashboard]
N --> P[AI copilot]
P --> O
DataExploration/EDA.ipynbinvestigates the raw supply-chain dataset and exports cleaned tables.- Those tables land in
DataExploration/Processed/. - Neo4j is populated either automatically at app startup, through
supply_chain_app/scripts/load_graph.py, or through the top-levelDataLoader.py. - The benchmark notebook evaluates three retrieval approaches on a 20-question mixed-intent benchmark.
- The deployed app encodes those findings in a simple router:
- structured or analytical questions -> Text2Cypher
- relational or reasoning questions -> GraphRAG
- broad open-ended synthesis -> Semantic RAG
The graph model implemented in the code is:
| Node | Key fields |
|---|---|
Product |
code, group, subgroup |
Plant |
id |
Storage |
id |
Observation |
obs_key, date, metric, unit_type, value |
| Relationship | Meaning |
|---|---|
(:Product)-[:ASSIGNED_TO_PLANT]->(:Plant) |
structural production location link |
(:Product)-[:STORED_IN]->(:Storage) |
structural storage location link |
(:Product)-[:HAS_OBSERVATION]->(:Observation) |
daily metric history |
flowchart TB
Browser[Browser UI] --> FastAPI[FastAPI app]
FastAPI --> Analytics[AnalyticsService]
FastAPI --> Copilot[CopilotService]
FastAPI --> Products[Product routes]
FastAPI --> Health[Health route]
Analytics --> Neo4j[(Neo4j)]
Products --> Neo4j
Health --> Neo4j
Copilot --> Router[IntentRouterService]
Router --> T2C[Text2CypherService]
Router --> GR[GraphRAGService]
Router --> SR[SemanticRAGService]
T2C --> Neo4j
T2C --> LLM[LLMService]
GR --> Neo4j
GR --> Embed[EmbeddingService]
GR --> LLM
SR --> Embed
SR --> LLM
Embed --> Cache[EmbeddingStore<br/>runtime/embeddings]
| Strategy | How it works here | Used when |
|---|---|---|
| Semantic RAG | top-k embedding search over product profiles, then prompt-based synthesis | open-ended narrative questions |
| GraphRAG | embedding-based seed retrieval, graph neighborhood expansion, metric-aware context serialization | relational and reasoning questions |
| Text2Cypher | heuristic or LLM-generated read-only Cypher, executed against Neo4j, then summarized | exact structural or analytical questions |
Text2Cypheronly allows read-only Cypher and blocks write-like clauses such asCREATE,MERGE,DELETE,SET,CALL,APOC, andLOAD CSV.- If
Text2Cypherfails to generate or safely execute a query, the router falls back toGraphRAG. - If no Gemini key is available:
- generation falls back to a deterministic offline text response,
- embeddings fall back to a local hash-based embedding model.
The frontend served from supply_chain_app/app/frontend/static/ exposes three main panels:
Overview: KPI cards, monthly production vs delivery trend, group distribution, top delivery products, and risk profileOperations: plant load, storage pressure, risk watchlist, and product explorerAI Copilot: natural-language questions, route explanation, and optional generated Cypher
The UI auto-refreshes data every 90 seconds.
EventualSmartFactory/
|- docker-compose.yml
|- DataLoader.py
|- graphrag_benchmark.ipynb
|- README.md
|- embeddings/
| `- gemini_products.npz
|- DataExploration/
| |- EDA.ipynb
| `- Processed/
| |- products.csv
| |- product_plant.csv
| |- product_storage.csv
| `- observations.csv
|- supply_chain_app/
| |- Dockerfile
| |- README.md
| |- requirements.txt
| |- scripts/
| | `- load_graph.py
| |- runtime/
| | `- embeddings/
| | |- products.npz
| | `- products.meta.json
| `- app/
| |- main.py
| |- api/
| |- core/
| |- domain/
| |- frontend/static/
| |- repositories/
| |- services/
| `- storage/
`- Reports/
`- FinalReport/
|- main.tex
|- main.pdf
`- figures/
| If you want to... | Start here |
|---|---|
| run the full dashboard quickly | docker-compose.yml |
| inspect the deployed backend | supply_chain_app/app/main.py |
| understand the routing logic | supply_chain_app/app/services/rag/intent_router.py |
| inspect graph queries and analytics | supply_chain_app/app/repositories/neo4j_repository.py |
| rebuild the graph manually | supply_chain_app/scripts/load_graph.py or DataLoader.py |
| reproduce EDA | DataExploration/EDA.ipynb |
| reproduce benchmark results | graphrag_benchmark.ipynb |
| read the write-up | Reports/FinalReport/main.pdf |
This is the easiest and most faithful way to run the project.
- Docker Desktop with Compose support
- an optional
GEMINI_API_KEYif you want live LLM answers instead of offline fallback behavior
PowerShell:
# Optional: only needed for live Gemini-backed generation and embeddings
$env:GEMINI_API_KEY="your_key_here"
docker compose up --buildThen open:
- Dashboard:
http://localhost:8000 - FastAPI docs:
http://localhost:8000/docs - Health endpoint:
http://localhost:8000/api/v1/health - Neo4j Browser:
http://localhost:7474
Neo4j default credentials from docker-compose.yml:
- username:
neo4j - password:
password
docker-compose.yml starts two services:
neo4jsupply_chain_app
It also mounts:
./DataExploration/Processedinto the app as read-only processed input data./supply_chain_app/runtimeinto the app for persistent embedding cache reuse
On first startup, if Neo4j is empty and AUTO_BOOTSTRAP_DATA=true, the app will:
- create constraints and indexes,
- import the processed CSVs into Neo4j,
- fetch product profiles,
- build or reuse product embeddings,
- start serving the frontend and API.
The very first run can take longer than later runs because of bootstrap and embedding initialization.
Useful when you want the database available for the notebooks or a local app run.
docker compose up -d neo4jThen open http://localhost:7474 and log in with:
- username:
neo4j - password:
password
This is helpful if you want code reloads or to work directly inside supply_chain_app/.
- Python 3.11 recommended, matching the Dockerfile
- Neo4j already running locally or via
docker compose up -d neo4j
cd supply_chain_app
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txtThe app defaults are container-style paths such as /app/data/processed, so for local host runs you should override them.
$env:NEO4J_URI="bolt://localhost:7687"
$env:NEO4J_USER="neo4j"
$env:NEO4J_PASSWORD="password"
$env:NEO4J_DATABASE="neo4j"
# Optional
$env:GEMINI_API_KEY="your_key_here"
# Important for host execution
$env:PROCESSED_DATA_DIR="../DataExploration/Processed"
$env:EMBEDDING_CACHE_NPZ="./runtime/embeddings/products.npz"
$env:EMBEDDING_CACHE_META="./runtime/embeddings/products.meta.json"
$env:AUTO_BOOTSTRAP_DATA="true"python -m uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadOpen:
http://localhost:8000http://localhost:8000/docs
The recommended explicit loader is supply_chain_app/scripts/load_graph.py.
Why use it:
- it uses the same config model as the deployed app,
- it only bootstraps if the graph is empty,
- it is the safest manual loading path in this repository.
From inside supply_chain_app/:
$env:NEO4J_URI="bolt://localhost:7687"
$env:NEO4J_USER="neo4j"
$env:NEO4J_PASSWORD="password"
$env:NEO4J_DATABASE="neo4j"
$env:PROCESSED_DATA_DIR="../DataExploration/Processed"
python scripts/load_graph.pyThis is a second, older loader that reads directly from DataExploration/Processed.
From the repository root:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install pandas neo4j$env:NEO4J_URI="bolt://localhost:7687"
$env:NEO4J_USER="neo4j"
$env:NEO4J_PASSWORD="password"
$env:BATCH_SIZE="5000"
$env:CLEAR_EXISTING="false"python DataLoader.pyIf you set:
$env:CLEAR_EXISTING="true"the script will delete the current Neo4j graph before reloading it.
Notebook file:
DataExploration/EDA.ipynb
The notebook auto-detects raw data in either:
RawData/DataExploration/RawData/
with a structure similar to:
RawData/
|- Nodes/
|- Edges/
`- Temporal Data/
|- Unit/
`- Weight/
Those raw source files are not committed here. The notebook is present, but it will only run end-to-end if you supply the original raw dataset in one of the expected locations.
python -m pip install -r supply_chain_app/requirements.txt
python -m pip install jupyterlab matplotlib seaborn networkx scikit-learn tqdmjupyter lab DataExploration/EDA.ipynbThe notebook exports the cleaned CSVs used by the app into:
DataExploration/Processed/products.csvDataExploration/Processed/product_plant.csvDataExploration/Processed/product_storage.csvDataExploration/Processed/observations.csv
Because those files are already committed, you do not need to rerun the notebook just to run the dashboard.
Notebook file:
graphrag_benchmark.ipynb
- Neo4j running and loaded with the graph
GEMINI_API_KEYDEEPSEEK_API_KEY
python -m pip install -r supply_chain_app/requirements.txt
python -m pip install jupyterlab matplotlib seaborn networkx scikit-learn tqdm$env:NEO4J_URI="bolt://localhost:7687"
$env:NEO4J_USER="neo4j"
$env:NEO4J_PASSWORD="password"
$env:GEMINI_API_KEY="your_gemini_key"
$env:DEEPSEEK_API_KEY="your_deepseek_key"
# Optional
$env:DEEPSEEK_JUDGE_MODEL="deepseek-chat"jupyter lab graphrag_benchmark.ipynb- graph extraction from Neo4j
- graph visualization
- semantic embedding retrieval
- Text2Cypher generation
- GraphRAG subgraph-as-context retrieval
- a 20-question benchmark across structural, analytical, semantic, and reasoning prompts
- LLM-as-judge scoring
- final comparison plots and summary tables
The benchmark notebook uses a separate embedding cache under:
embeddings/gemini_products.npz
This is distinct from the app runtime cache under:
supply_chain_app/runtime/embeddings/
The report already exists as:
Reports/FinalReport/main.pdf
If you want to rebuild it, use a LaTeX distribution such as MiKTeX or TeX Live.
cd Reports/FinalReport
pdflatex main.tex
pdflatex main.texRunning pdflatex twice is usually enough for stable references and layout in this project because the references are written directly in the document rather than through a separate .bib pipeline.
These are the most important environment variables for the dashboard app.
| Variable | Default in code | Required | Purpose |
|---|---|---|---|
NEO4J_URI |
bolt://localhost:7687 |
yes | Neo4j connection URI |
NEO4J_USER |
neo4j |
yes | Neo4j username |
NEO4J_PASSWORD |
password |
yes | Neo4j password |
NEO4J_DATABASE |
neo4j |
no | Neo4j database name |
GEMINI_API_KEY |
empty | no | enables live Gemini generation and remote embeddings |
PROCESSED_DATA_DIR |
/app/data/processed |
effectively yes | source directory for bootstrap CSVs |
EMBEDDING_CACHE_NPZ |
/app/runtime/embeddings/products.npz |
no | vector cache path |
EMBEDDING_CACHE_META |
/app/runtime/embeddings/products.meta.json |
no | vector metadata path |
AUTO_BOOTSTRAP_DATA |
true |
no | import processed data into an empty graph at startup |
BOOTSTRAP_BATCH_SIZE |
5000 |
no | load batch size |
| Variable | Default | Required | Purpose |
|---|---|---|---|
GEMINI_API_KEY |
none | yes for benchmark notebook | retrieval and generation in notebook |
DEEPSEEK_API_KEY |
none | yes for benchmark notebook | judge model access |
DEEPSEEK_JUDGE_MODEL |
deepseek-chat |
no | judge model name |
DEEPSEEK_BASE_URL |
https://api.deepseek.com/v1/chat/completions |
no | custom judge endpoint override |
| Variable | Default | Required | Purpose |
|---|---|---|---|
NEO4J_URI |
bolt://localhost:7687 |
yes | Neo4j connection |
NEO4J_USER |
neo4j |
yes | Neo4j username |
NEO4J_PASSWORD |
password |
yes | Neo4j password |
BATCH_SIZE |
5000 |
no | row batch size |
CLEAR_EXISTING |
false |
no | delete existing graph before reload |
Advanced app defaults from supply_chain_app/app/core/config.py
| Setting | Default |
|---|---|
APP_NAME |
Supply Chain Intelligence Hub |
APP_VERSION |
2.0.0 |
API_PREFIX |
/api/v1 |
LLM_MODEL |
gemma-3-27b-it |
EMBED_MODEL |
gemini-embedding-001 |
EMBEDDING_DIMS |
3072 |
SEMANTIC_TOP_K |
4 |
GRAPHRAG_SEED_K |
2 |
GRAPHRAG_PEER_LIMIT |
12 |
LLM_TIMEOUT_SECONDS |
45 |
Base prefix:
/api/v1
| Method | Route | What it returns |
|---|---|---|
GET |
/api/v1/health |
Neo4j and embedding readiness |
GET |
/api/v1/benchmark/strategy |
benchmark summary and routing policy |
GET |
/api/v1/analytics/dashboard |
KPIs, group mix, top delivery, monthly flow |
GET |
/api/v1/analytics/risk?limit=25 |
risk watchlist ranked by fulfillment ratio |
GET |
/api/v1/analytics/factory-floor?plant_limit=12&storage_limit=12 |
plant load, storage pressure, network density |
GET |
/api/v1/products?group=A&limit=100 |
product list with optional group filter |
GET |
/api/v1/products/{code} |
product detail, recent observations, related products |
POST |
/api/v1/copilot/query |
routed AI answer, strategy, sources, optional Cypher |
curl -X POST http://localhost:8000/api/v1/copilot/query \
-H "Content-Type: application/json" \
-d "{\"question\":\"Which products are assigned to plant 1903?\"}"strategyroute_reasonbenchmark_referenceanswersourcescypherwhen the route isText2Cypherdebug_contextgenerated_at
The startup path in supply_chain_app/app/main.py and supply_chain_app/app/services/container.py is:
- build the
AppContainer, - connect to Neo4j,
- optionally bootstrap the graph from processed CSVs if it is empty,
- fetch product profiles from Neo4j,
- build or load embedding cache,
- expose API routes and the static dashboard,
- serve
/as the SPA entry point and/static/*for assets.
This means a healthy app startup is more than just "the server is listening"; it implies database connectivity and embedding readiness.
- If
docker compose upfails around the Neo4j data volume, create a root-levelData/directory because the compose file binds Neo4j persistence there. - If the app starts but the copilot keeps replying with "offline fallback", set
GEMINI_API_KEY. - If a local host run cannot find processed CSVs, make sure
PROCESSED_DATA_DIRpoints to../DataExploration/Processedfrom insidesupply_chain_app/. - If the benchmark notebook says
DEEPSEEK_API_KEY is missing, add it to the environment before opening the notebook. - If
DataExploration/EDA.ipynbcannot find raw data, place the original dataset underRawData/orDataExploration/RawData/with the expected folder structure. - If Neo4j is running but the app reports degraded health, check
http://localhost:8000/api/v1/healthand verify that the graph was actually loaded and embeddings were built. - If you want a safe manual graph import path, prefer
supply_chain_app/scripts/load_graph.pyoverDataLoader.pybecause it does not clear an existing graph.
If you want to understand the whole project with the least friction:
- read this root README,
- open
Reports/FinalReport/main.pdf, - inspect
supply_chain_app/app/services/container.py, - inspect
supply_chain_app/app/services/rag/, - inspect
graphrag_benchmark.ipynb, - inspect
DataExploration/EDA.ipynbif you want the preprocessing story.
EventualSmartFactory is best understood as a complete applied GenAI systems project rather than a single app folder. The repository already contains the processed data, the benchmark notebook, the dashboard, and the final report, so the fastest path is to run Docker Compose and open the dashboard. The main things that are optional are the live LLM API keys and the raw source dataset for reproducing the EDA from scratch.
Authors : Mohammed- Rida EL HANI, Othmane AZOUBI, Yahya MANSOUB


