DataPilotFlow

Intelligent Enterprise Knowledge Platform — RAG Sources and Retrieval Management

DataPilotFlow is an open, event-driven platform for building and querying enterprise knowledge bases powered by Retrieval-Augmented Generation (RAG). It automates the full pipeline from document ingestion to intelligent, source-cited conversational answers — deployable in minutes with a single Docker Compose command.

Overview

DataPilotFlow turns your documents, websites, and third-party knowledge sources into a queryable intelligence layer. Users upload files or configure crawling sources, the platform processes and indexes them asynchronously into a vector database, and a multi-agent AI system answers natural language questions with full source attribution — streamed in real time over WebSocket.

The system is built as a set of loosely coupled Python microservices backed by a React dashboard, connected through an event-driven message bus. Every component is containerized and production-ready.

In practice: configure a source, run an ingestion job, open a conversation, and ask questions. The platform handles extraction, chunking, embedding, retrieval, and generation.

Key Features

Multi-Agent Orchestration — A supervisor agent coordinates a RAG agent, tool calls, and external MCP servers using LangGraph
Event-Driven Processing — Document ingestion runs asynchronously via RabbitMQ; jobs are non-blocking and resumable
Multiple Source Types — Ingest from local files (PDF, Markdown, text), web crawling (single page, multi-page, website), and Confluence spaces
Vector Search — Milvus stores and queries high-dimensional embeddings for semantic similarity retrieval
MCP Server Exposure — The RAG agent exposes its retrieval capabilities as a Model Context Protocol server for external agent integration
Real-Time Streaming — Answers stream to the browser via WebSocket; job progress updates in real time
Tool Registry — Connect and manage remote MCP tool servers; agents discover and invoke tools dynamically
Role-Based Access Control — Admin, User, and Viewer roles with JWT authentication
Production Docker Deployment — Nine containerized services with health checks, dependency ordering, and persistent data volumes

Screenshots

Dashboard

The home screen provides a live overview of knowledge sources, ingestion jobs, active conversations, registered tools, and MCP servers. Quick actions surface the most common workflows.

Processing Jobs

Knowledge injection jobs process data sources into searchable vector embeddings. Each job tracks its configuration, execution status, and completion timestamp.

Job Status and Monitoring

The job status view shows status distribution across all jobs, a processing activity timeline, and per-job progress metrics including document and chunk counts.

Crawling Sources

Configure data sources for extraction and indexing. Supports web scraping (single page, multiple pages, full website crawler), local file paths, and Confluence space imports.

AI Agents

Create and manage reusable AI agents. RAG agents handle document retrieval; Assistant agents act as supervisors that orchestrate tools, RAG lookups, and multi-turn reasoning.

Tools and MCP Servers

Register remote MCP servers and manage their exposed tools. Agents discover and invoke tools from any connected server at runtime.

Vector Status

Inspect vector collections stored in Milvus — record counts, dimension sizes, job contributions, and individual chunk records with their source URLs.

Architecture

High-Level System Diagram

graph LR
    User["Browser"] --> Dashboard["Dashboard — :3000"]
    Dashboard --> API["API Server — :8800"]
    API --> Agent["Assistant Agent"]
    Agent --> RAG["RAG Agent — :65510"]
    RAG --> Milvus["Milvus"]
    API --> RabbitMQ["RabbitMQ"]
    RabbitMQ --> Processors["Processors"]
    Processors --> Milvus
    Processors --> MongoDB["MongoDB"]
    API --> MongoDB

Knowledge Ingestion Pipeline

graph LR
    A["User submits a source"] --> B["API validates and stores metadata"]
    B --> C["Event published to RabbitMQ"]
    C --> D["Event Listener triggers Processor"]
    D --> E["Text extraction and parsing"]
    E --> F["Chunking and embedding generation"]
    F --> G["Vectors stored in Milvus"]
    F --> H["Raw documents stored in MinIO"]
    G --> I["Job status updated and user notified"]
    H --> I

Query and Response Pipeline

graph LR
    A["User submits a question"] --> B["API routes to Assistant Agent"]
    B --> C["Agent calls RAG Agent via MCP"]
    C --> D["RAG Agent embeds the query"]
    D --> E["Semantic search in Milvus"]
    E --> F["LLM generates answer with chunks"]
    F --> G["Answer streamed via WebSocket"]

Project Structure

datapilotflow/
├── datapilotflow-domain/          # Core domain models (Pydantic), configuration, validation
├── datapilotflow-infrastructure/  # DAOs, MongoDB/Milvus/RabbitMQ/MinIO clients
├── datapilotflow-services/        # Business logic: Knowledge, Auth, Notifications, Tools
├── datapilotflow-api/             # FastAPI REST server + WebSocket — Port 8800
├── datapilotflow-rag-agent/       # RAG agent with MCP server exposure — Port 65510
├── datapilotflow-assistant-agent/ # Supervisor agent, multi-turn conversations
├── datapilotflow-events/          # RabbitMQ event publishers and listeners
├── datapilotflow-processors/      # Text extraction, chunking, embedding, web crawling
├── datapilotflow-dashboard/       # React + Vite + Mantine UI frontend — Port 3000
└── docker/                        # docker-compose.yml and Dockerfiles

Module Responsibilities

Module	Role	Key Dependencies
`datapilotflow-domain`	Foundation — models and config	Pydantic only
`datapilotflow-infrastructure`	Data access layer	domain
`datapilotflow-services`	Business logic and orchestration	domain, infrastructure
`datapilotflow-api`	HTTP/WebSocket interface	all modules
`datapilotflow-rag-agent`	Semantic retrieval + MCP server	domain, infrastructure, services
`datapilotflow-assistant-agent`	Supervisor agent + tool routing	domain, infrastructure, services
`datapilotflow-events`	Async messaging via RabbitMQ	domain, infrastructure
`datapilotflow-processors`	Document ingestion pipeline	domain, infrastructure, services
`datapilotflow-dashboard`	Web UI	API over HTTP/WebSocket

Quick Start

Requirements: Docker, Docker Compose, 10 GB free disk space, a modern browser.

# Clone the repository
git clone https://github.com/your-org/datapilotflow.git
cd datapilotflow

# Start all services
cd docker
docker-compose up -d

# Verify services are healthy
docker-compose ps

Once healthy, open http://localhost:3000.

Service Access Points

Service	URL	Description
Dashboard	http://localhost:3000	Web UI
REST API	http://localhost:8800	API server
API Docs	http://localhost:8800/docs	Swagger / OpenAPI
RabbitMQ Console	http://localhost:15675	Message broker management
Milvus	http://localhost:19530	Vector database

First Steps After Deployment

A default admin account is created automatically on first startup:

Field	Value
Username	`admin`
Password	`admin123`
Email	`admin@datapilotflow.com`
Role	Platform Admin

Change the password immediately after first login via User Management > My Profile.

Open http://localhost:3000 and log in with the credentials above
Navigate to Knowledge > Configuration > Crawling Sources and add a source
Go to Knowledge > Configuration > Jobs and create an ingestion job for the source
Monitor progress in Knowledge > Monitoring > Job Status
Once the job completes, open Conversations and ask a question

Configuration

All environment variables are defined directly in docker/docker-compose.yml. The defaults work out of the box for a local deployment. Key variables per service:

# API Server
API_SERVER_HOST=0.0.0.0
API_SERVER_PORT=8800

# Security — change this before any public deployment
JWT_SECRET_KEY=supersecret

# MongoDB
MONGO_HOST=mongodb
MONGO_PORT=27017
MONGO_DB_NAME=datapilotflow
MONGO_USER=datapilotflow
MONGO_PASS=datapilotflow123

# Milvus Vector DB
VECTOR_DB_HOST=milvus
VECTOR_DB_HTTP_PORT=19530

# RabbitMQ
RABBITMQ_HOST=rabbitmq
RABBITMQ_PORT=5672
RABBITMQ_USER=datapilotflow
RABBITMQ_PASS=datapilotflow123

# MCP RAG Agent
MCP_SERVER_HOST=0.0.0.0
MCP_SERVER_PORT=65510

To change any value, edit the environment block of the relevant service in docker/docker-compose.yml before starting the stack.

Security note: JWT_SECRET_KEY defaults to supersecret. Set it to a long random string before any public or production deployment.

Troubleshooting

Dashboard does not load

Run docker-compose ps and verify all services show healthy
Wait up to 60 seconds on first start for Milvus to initialize
Check logs: docker-compose logs dashboard

Ingestion job stays in Pending

Check that the event listener service is running: docker-compose logs datapilotflow-event-listeners
Verify RabbitMQ is healthy: docker-compose ps | grep rabbitmq

Questions return no results

Confirm the ingestion job completed with a non-zero chunk count in the Job Status view
Check the Vector Status page to verify records exist in the collection
Check RAG agent logs: docker-compose logs datapilotflow-mcp-rag

API errors

Open the Swagger UI at http://localhost:8800/docs for request/response documentation
Check API logs: docker-compose logs datapilotflow-api-server

Module Documentation

Each module contains its own detailed README:

datapilotflow-api — REST endpoints, authentication, WebSocket protocol
datapilotflow-rag-agent — RAG implementation and MCP server
datapilotflow-assistant-agent — Agent orchestration and supervisor pattern
datapilotflow-services — Business logic services
datapilotflow-domain — Domain models and core entities
datapilotflow-infrastructure — Database clients and data access
datapilotflow-events — Event publishing and consumption
datapilotflow-processors — Document processing pipeline
docker — Deployment and container configuration

License

DataPilotFlow is licensed under the Apache License 2.0 with a Commons Clause restriction.

Free for individuals — personal use, education, and non-commercial research are fully permitted under Apache 2.0 terms.

Enterprise and commercial use requires a separate license — this includes using the platform within a company, offering it as a managed service, or integrating it into a commercial product.

To inquire about a commercial license: flowdatapilot@gmail.com

See the full LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataPilotFlow

Table of Contents

Overview

Key Features

Screenshots

Dashboard

Processing Jobs

Job Status and Monitoring

Crawling Sources

AI Agents

Tools and MCP Servers

Vector Status

Architecture

High-Level System Diagram

Knowledge Ingestion Pipeline

Query and Response Pipeline

Project Structure

Module Responsibilities

Quick Start

Service Access Points

First Steps After Deployment

Configuration

Troubleshooting

Module Documentation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 514 Commits
datapilotflow-api		datapilotflow-api
datapilotflow-assistant-agent		datapilotflow-assistant-agent
datapilotflow-dashboard		datapilotflow-dashboard
datapilotflow-domain		datapilotflow-domain
datapilotflow-events		datapilotflow-events
datapilotflow-infrastructure		datapilotflow-infrastructure
datapilotflow-processors		datapilotflow-processors
datapilotflow-rag-agent		datapilotflow-rag-agent
datapilotflow-services		datapilotflow-services
docker		docker
imgs		imgs
.cursorrules		.cursorrules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
datapilotflow.code-workspace		datapilotflow.code-workspace
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DataPilotFlow

Table of Contents

Overview

Key Features

Screenshots

Dashboard

Processing Jobs

Job Status and Monitoring

Crawling Sources

AI Agents

Tools and MCP Servers

Vector Status

Architecture

High-Level System Diagram

Knowledge Ingestion Pipeline

Query and Response Pipeline

Project Structure

Module Responsibilities

Quick Start

Service Access Points

First Steps After Deployment

Configuration

Troubleshooting

Module Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages