A comprehensive platform for discovering, tracking, and analyzing neuroscience datasets across multiple repositories. This project helps researchers find datasets, track citations, and understand data sharing patterns in neuroscience research.
Develop tools to:
- Bright Data: Help the neuroscience community find datasets across repositories and identify the most cited and used datasets
- Dark Data: Determine which neuroscience papers share and reuse data, and estimate data sharing rates across modalities, journals, and funders
- Incentivize Sharing: Incentivize researchers to share data by showcasing citations and use of shared data
- Data Pipeline: Develop a pipeline that regularly scrapes datasets and papers, and identifies which datasets are associated with each paper, both for sharing and reuse
- Bright Data Dashboard: Create a bright data dashboard that aggregates datasets across repositories and links to papers that cite each dataset
- Queryable Database: Create a queryable database of papers, their shared datasets, and used datasets. We may also build a dark data dashboard with summary plots
- ✅ Built an initial frontend and backend for a bright data dashboard
- ✅ Created an Airflow pipeline that automates ingestion of dataset metadata from DANDI
- ✅ Evaluated Data Citation Corpus and OpenAlex for collecting citations - both index many citations but are not complete and lack full text for distinguishing primary from secondary usage
- Connect frontend filters to backend and add pagination
- Set up CI/CD pipelines and find a hosting platform
- Create ingestion pipelines for OpenNeuro and other data sources
- Research paper fulltext sources, such as the PubMed Central Open Access Subset, and ODDPub's NLP approach to finding data citations
- Set up paper ingestion pipeline with LLM-based citation identification
- Docker Desktop (or Docker Engine + Docker Compose)
- At least 4GB of RAM available for Docker
- At least 10GB of free disk space
-
Set the Airflow user ID (Linux/Mac only):
echo -e "AIRFLOW_UID=$(id -u)" > .env
On Windows, the
.envfile is already configured withAIRFLOW_UID=50000. -
Initialize Airflow (first time only):
docker compose up airflow-init
To start all services:
docker compose up -d --build --force-recreateThis command will:
- Build all Docker images from scratch (
--build) - Recreate all containers even if they already exist (
--force-recreate) - Run all services in detached mode (
-d)
To stop all services:
docker compose downTo stop all services and remove all volumes (including database data):
docker compose down -v-v flag will delete all database data. Use this when you want a completely fresh start.
If you started a PR preview stack with a project name (example: pr-45) using docker-compose.pr.yml, you must pass the same -p and -f flags when tearing it down:
docker compose -p pr-45 -f docker-compose.pr.yml down -vAfter bringing the stack up, you can confirm the running Postgres version with:
docker compose exec -T postgres psql -U airflow -d airflow -c "SELECT version();"For a PR preview stack (example: pr-45):
docker compose -p pr-45 -f docker-compose.pr.yml exec -T postgres psql -U airflow -d airflow -c "SELECT version();"Once the project is running, access the following services:
| Service | URL | Credentials |
|---|---|---|
| Frontend Dashboard | http://localhost:3000 | N/A |
| API Backend | http://localhost:8000 | N/A |
| API Documentation (Swagger) | http://localhost:8000/docs | N/A |
| API Documentation (ReDoc) | http://localhost:8000/redoc | N/A |
| Airflow Web UI | http://localhost:8080 | Username: airflowPassword: airflow |
| pgAdmin (Database Management) | http://localhost:5050 | Email: admin@admin.comPassword: admin |
| PostgreSQL Database | localhost:5432 | Username: airflowPassword: airflow |
When you first access pgAdmin, you should see two pre-configured database servers:
- Local PostgreSQL - Airflow (airflow database) - Stores Airflow metadata
- Local PostgreSQL - DAG Data (dag_data database) - Stores neuroscience datasets
First time connecting to a server:
- Click on a server name
- Enter the password:
airflow - Check "Save password" to avoid entering it again
If servers don't appear, try refreshing the browser (Ctrl+Shift+R or Cmd+Shift+R).
.
├── api/ # FastAPI backend service
│ ├── Dockerfile
│ ├── main.py # Main API application
│ └── requirements.txt # Python dependencies
│
├── frontend/ # React + TypeScript frontend
│ ├── src/
│ │ ├── App.tsx # Main React component
│ │ ├── components/ # React components
│ │ ├── services/ # API service layer
│ │ └── types/ # TypeScript type definitions
│ ├── public/ # Static files
│ ├── Dockerfile
│ └── package.json # Node.js dependencies
│
├── airflow/ # Apache Airflow image + DAGs/config/plugins
│ ├── Dockerfile # Custom Airflow image with dependencies
│ ├── requirements.txt # Python package dependencies (Airflow)
│ ├── dags/ # Apache Airflow DAGs
│ │ ├── dandi_ingestion.py # DAG for ingesting DANDI datasets
│ │ ├── populate_datasets_dag.py # DAG for populating datasets from multiple sources
│ │ ├── database_example_dag.py # Example DAG demonstrating database operations
│ │ ├── example_dag.py # Basic Airflow example
│ │ └── utils/ # Shared utilities for DAGs
│ │ ├── database.py # Database connection and query utilities
│ │ └── environment.py # Environment detection utilities
│ ├── config/ # Airflow configuration files
│ │ └── airflow.cfg
│ ├── plugins/ # Custom Airflow plugins
│ └── logs/ # Airflow logs (auto-created)
│
├── database/ # Database initialization scripts
│ ├── init-db.sql # Initial database schema setup
│ └── pgadmin-servers.json # pgAdmin server configuration
│
├── docs/ # Documentation
│ ├── API_USAGE.md # API usage guide
│ ├── DATABASE_SETUP.md # Database setup details
│ └── data_citation_notes.md
│
├── docker-compose.yml # Docker Compose configuration
└── README.md # This file
The project uses PostgreSQL with two databases:
airflow- Stores Airflow metadata (DAG runs, task instances, etc.)dag_data- Stores neuroscience datasets and related data
Stores datasets fetched from the DANDI Archive API.
Created by: dandi_ingestion DAG
Columns:
dataset_id(VARCHAR) - Unique identifier from DANDItitle(TEXT) - Dataset titlemodality(VARCHAR) - Data modality (e.g., "fMRI", "EEG", "Electrophysiology")citations(INTEGER) - Number of citationsurl(TEXT) - URL to the datasetdescription(TEXT) - Dataset descriptioncreated_at(TIMESTAMP) - Record creation timestampupdated_at(TIMESTAMP) - Last update timestampversion(VARCHAR) - Dataset version
Stores datasets from multiple sources (Kaggle, OpenNeuro, PhysioNet).
Created by: populate_datasets_dag DAG
Columns:
id(SERIAL) - Primary keysource(VARCHAR) - Source platform (e.g., "Kaggle", "OpenNeuro", "PhysioNet")dataset_id(VARCHAR) - Unique identifier from sourcetitle(TEXT) - Dataset titlemodality(VARCHAR) - Data modalitycitations(INTEGER) - Number of citationsurl(TEXT) - URL to the datasetdescription(TEXT) - Dataset descriptioncreated_at(TIMESTAMP) - Record creation timestampupdated_at(TIMESTAMP) - Last update timestamp
Indexes:
idx_datasets_source- Index on source columnidx_datasets_modality- Index on modality columnidx_datasets_citations- Index on citations (DESC) for sorting
A SQL view that combines data from both dandi_dataset and neuroscience_datasets tables using a UNION ALL operation.
Purpose: Provides a unified interface to query all datasets regardless of their source, making it easy for the API and frontend to access all datasets with a single query.
How it works:
- Combines data from
dandi_dataset(marked as source "DANDI") andneuroscience_datasets(with their respective sources) - Standardizes column names and types across both tables
- The API uses this view by default (falls back to
neuroscience_datasetstable if view doesn't exist)
Auto-creation: The view is automatically created/updated when:
- The
populate_neuroscience_datasetsDAG runs (after table creation) - The
dandi_ingestionDAG runs (after DANDI data insertion)
Manual refresh: You can manually create or refresh the view using:
- API endpoint:
POST http://localhost:8000/api/refresh-view - Or by calling
create_unified_datasets_view()from the database utilities
View Structure:
SELECT
source,
dataset_id,
title,
modality,
citations,
url,
description,
created_at,
updated_at,
version
FROM unified_datasetsThe project runs the following Docker services:
| Service | Description | Port |
|---|---|---|
| postgres | PostgreSQL database server | 5432 |
| airflow-webserver | Airflow web UI for managing DAGs | 8080 |
| airflow-scheduler | Airflow scheduler (runs DAGs) | N/A |
| airflow-init | One-time initialization service | N/A |
| pgadmin | Database management web UI | 5050 |
| api | FastAPI backend service | 8000 |
| frontend | React frontend application | 3000 |
The FastAPI backend provides REST endpoints for accessing neuroscience datasets.
GET /- Health checkGET /api/health- Database health check (includes view status)GET /api/datasets- Fetch datasets with optional filters (source, modality, search)GET /api/datasets/stats- Get dataset statisticsPOST /api/refresh-view- Manually create or refresh the unified_datasets viewGET /api/debug/view-info- Debug endpoint to check view status and data sources
GET /api/datasets supports:
source- Filter by source (DANDI, Kaggle, OpenNeuro, PhysioNet)modality- Filter by modality (fMRI, EEG, Electrophysiology, etc.)search- Search in title and description (case-insensitive)
Example:
GET http://localhost:8000/api/datasets?source=DANDI&modality=fMRI&search=visual
For detailed API usage, see docs/API_USAGE.md.
-
dandi_ingestion- Fetches and ingests datasets from DANDI Archive API- Schedule: Daily (
@daily) - Creates/updates
dandi_datasettable - Automatically creates/refreshes
unified_datasetsview
- Schedule: Daily (
-
populate_neuroscience_datasets- Populates datasets from multiple sources (Kaggle, OpenNeuro, PhysioNet)- Creates/updates
neuroscience_datasetstable - Automatically creates/refreshes
unified_datasetsview
- Creates/updates
-
database_example_dag- Demonstrates database operations with environment detection -
example_dag- Basic Airflow example for learning
- Access Airflow UI at http://localhost:8080
- Log in with username
airflowand passwordairflow - Find your DAG in the list
- Toggle the DAG to enable it (if paused)
- Click the play button to trigger a manual run, or wait for the scheduled run
- View all logs:
docker compose logs -f - View specific service logs:
docker compose logs -f airflow-schedulerdocker compose logs -f airflow-webserverdocker compose logs -f frontenddocker compose logs -f api
- Restart services:
docker compose restart - Restart specific service:
docker compose restart frontend - Rebuild and restart:
docker compose up -d --build - Stop all services:
docker compose down - Stop and remove volumes:
docker compose down -v
-
Connect to PostgreSQL (from host):
psql -h localhost -p 5432 -U airflow -d dag_data
Password:
airflow -
Check view status via API:
curl http://localhost:8000/api/debug/view-info
-
Refresh unified view via API:
curl -X POST http://localhost:8000/api/refresh-view
Permission errors on Linux/Mac:
- Make sure
AIRFLOW_UIDin.envmatches your user ID
Port already in use:
- Change the port in
docker-compose.ymlunder the respective service'sportssection
DAGs not appearing:
- Check the scheduler logs:
docker compose logs -f airflow-scheduler - Ensure your DAG files are in the
airflow/dags/directory - Verify DAG files don't have syntax errors
Frontend not loading:
- Check frontend logs:
docker compose logs -f frontend - Verify the API is running:
docker compose logs -f api - Check browser console for errors
Only seeing data from one source in frontend:
- Check view status: Visit
http://localhost:8000/api/debug/view-info - Refresh the view:
POST http://localhost:8000/api/refresh-view - Verify both tables have data in pgAdmin
- Check API logs to see which table/view is being queried
- Ensure both DAGs have run successfully (
dandi_ingestionandpopulate_neuroscience_datasets)
API connection errors:
- Check if backend is running:
docker compose logs -f api - Verify database connection:
GET http://localhost:8000/api/health - Check if tables exist in pgAdmin
Empty datasets in frontend:
- The frontend shows a "No Datasets Found" message if the database is empty
- Run the DAGs to populate data:
populate_neuroscience_datasetsanddandi_ingestion - Check if DAGs completed successfully in Airflow UI
Database reset:
- To completely reset the database:
docker compose down -v - This deletes all data. Re-initialize with:
docker compose up airflow-init
- This setup uses the LocalExecutor, which is suitable for local development
- For production, consider using CeleryExecutor or KubernetesExecutor
- The database is stored in a Docker volume and persists between restarts
- Frontend has hot reload enabled for development
- All services automatically restart on failure
- API Usage Guide - Detailed API documentation and examples
- Database Setup - Database configuration details
- Data Citation Notes - Notes on data citation research
[TBD]