A comprehensive data analysis pipeline for US diabetes prevalence data with interactive visualizations, statistical validation, and machine learning insights.
This project analyzes CDC diabetes data (2019-2022) to:
- Explore geographic and demographic patterns in diabetes prevalence
- Extract and validate research claims from scientific literature
- Identify clusters of similar states using K-Means clustering
- Detect anomalous data points using Isolation Forest
- Generate an interactive HTML dashboard with ECharts visualizations
- Interactive Dashboard - Multi-page HTML dashboard with ECharts 5.4.3
- Geographic Analysis - US choropleth maps showing state-level prevalence
- Demographic Breakdowns - Analysis by age, sex, race/ethnicity, income, education
- Trend Analysis - Time series and year-over-year comparisons
- ML Clustering - K-Means clustering to identify state groupings
- Anomaly Detection - Isolation Forest to identify outlier states
- Research Validation - Extract claims from literature and test against data
# Clone or download the repository
cd datasets-gov
# Install in development mode
pip install -e .
# Or install with all optional dependencies
pip install -e ".[all]"python -m diabetes_analysis verify-pipeline# Run complete pipeline (EDA -> Research -> Validation -> Dashboard)
python -m diabetes_analysis run-all --use-sample-claims
# Or run individual steps
python -m diabetes_analysis run-eda
python -m diabetes_analysis run-research --use-samples
python -m diabetes_analysis run-validation
python -m diabetes_analysis generate-dashboardAfter running the pipeline, open the generated dashboard:
diabetes_analysis/output/index.html
┌─────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ diabetes_datasets/extracted/diabetes_indicators.csv │
│ + Additional CDC datasets (mortality, type-2, etc.) │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 1: DATA LOADING │
│ diabetes_analysis/data/ │
│ - loader.py: Load CSV files into pandas DataFrames │
│ - cleaner.py: Clean, normalize, validate data │
│ - schemas.py: Pydantic schemas for validation │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 2: EDA │
│ diabetes_analysis/eda/ │
│ - pandas_analysis.py: Custom aggregations and statistics │
│ - profiler.py: ydata-profiling reports (optional) │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 3: RESEARCH │
│ diabetes_analysis/research/ │
│ - searcher.py: Search for scientific articles │
│ - fetcher.py: Fetch article content │
│ - extractor.py: Extract statistical claims │
│ - claims_registry.py: Store and manage claims │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 4: VALIDATION │
│ diabetes_analysis/validation/ │
│ - hypothesis_tester.py: Test claims against data │
│ - clustering.py: K-Means state clustering │
│ - anomaly.py: Isolation Forest anomaly detection │
│ - statistics.py: Statistical tests (t-test, chi-square) │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 5: VISUALIZATION │
│ diabetes_analysis/viz/ │
│ - data_embedder.py: Embed data as JSON for frontend │
│ - generator.py: Generate HTML pages from templates │
│ - charts/: Chart configuration modules │
│ - geographic.py, demographic.py, correlation.py │
│ - timeseries.py, novel.py (sankey, radar, treemap, etc.) │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ OUTPUT │
│ diabetes_analysis/output/ │
│ - index.html: Main dashboard │
│ - pages/: Individual analysis pages │
│ - assets/: CSS, JS, and embedded data │
└─────────────────────────────────────────────────────────────────────┘
# Show all available commands
python -m diabetes_analysis --help
# Show current configuration
python -m diabetes_analysis show-config
# Run EDA with custom directories
python -m diabetes_analysis run-eda --data-dir /path/to/data --output-dir /path/to/output
# Run research with sample claims (no web fetching)
python -m diabetes_analysis run-research --use-samples
# Run validation
python -m diabetes_analysis run-validation
# Generate dashboard
python -m diabetes_analysis generate-dashboard
# Run full pipeline
python -m diabetes_analysis run-all --use-sample-claimsfrom diabetes_analysis.data import DataLoader
from diabetes_analysis.eda import PandasAnalyzer
from diabetes_analysis.validation import ClusterAnalyzer
# Load data
loader = DataLoader()
datasets = loader.load_all_datasets()
# Run analysis
analyzer = PandasAnalyzer()
results = analyzer.run_all_analyses(datasets)
# Clustering
clusterer = ClusterAnalyzer()
features = clusterer.prepare_state_features(datasets["diabetes_indicators"])
clusters = clusterer.kmeans_clustering(features)datasets-gov/
├── README.md # This file
├── CONTRIBUTING.md # Development guide
├── setup.py # Package configuration
├── requirements.txt # Dependencies
├── .gitignore # Git ignore patterns
├── test_pipeline.py # Pipeline verification script
│
├── diabetes_analysis/ # Main package
│ ├── __init__.py
│ ├── __main__.py # python -m diabetes_analysis
│ ├── cli.py # Click CLI commands
│ ├── config.py # Configuration settings
│ │
│ ├── data/ # Data loading & validation
│ │ ├── __init__.py
│ │ ├── loader.py # CSV loading
│ │ ├── cleaner.py # Data cleaning
│ │ └── schemas.py # Pydantic schemas
│ │
│ ├── eda/ # Exploratory analysis
│ │ ├── __init__.py
│ │ ├── pandas_analysis.py # Custom analysis
│ │ └── profiler.py # ydata-profiling
│ │
│ ├── research/ # Research claim extraction
│ │ ├── __init__.py
│ │ ├── searcher.py # Article search
│ │ ├── fetcher.py # Content fetching
│ │ ├── extractor.py # Claim extraction
│ │ └── claims_registry.py # Claim storage
│ │
│ ├── validation/ # Hypothesis validation
│ │ ├── __init__.py
│ │ ├── statistics.py # Statistical tests
│ │ ├── clustering.py # K-Means clustering
│ │ ├── anomaly.py # Anomaly detection
│ │ └── hypothesis_tester.py # Claim validation
│ │
│ └── viz/ # Visualization
│ ├── __init__.py
│ ├── generator.py # HTML generation
│ ├── data_embedder.py # Data embedding
│ ├── charts/ # Chart modules
│ │ ├── geographic.py
│ │ ├── demographic.py
│ │ ├── correlation.py
│ │ ├── timeseries.py
│ │ └── novel.py
│ └── templates/ # Jinja2 templates
│
├── diabetes_datasets/ # Raw data files
│ ├── extracted/ # Processed CSV
│ └── specific/ # Additional datasets
│
└── scripts/ # Utility scripts
├── fetch_diabetes_data.py
├── fetch_specific_diabetes.py
└── extract_diabetes_data.py
The dashboard includes:
| Chart Type | Description |
|---|---|
| US Choropleth Map | State-level diabetes prevalence |
| Bar Charts | Top/bottom states by prevalence |
| Line Charts | Year-over-year trends |
| Demographic Bars | Breakdown by age, sex, race |
| Correlation Heatmap | Feature correlations |
| K-Means Scatter | State clustering visualization |
| Anomaly Plot | Outlier state detection |
| Sankey Diagram | Research claims flow |
| Radar Chart | Cluster profiles |
| Treemap | Regional hierarchy |
| Bump Chart | State ranking changes |
| Stacked Area | Regional trend contributions |
| Waterfall | Regional contribution breakdown |
- Python >= 3.10
- pandas >= 2.0
- numpy >= 1.24
- scipy >= 1.11
- scikit-learn >= 1.3
- jinja2 >= 3.1
- pydantic >= 2.5
- click >= 8.1
- ydata-profiling >= 4.6 (for detailed EDA profiles)
- httpx >= 0.25 (for article fetching)
- beautifulsoup4 >= 4.12 (for content extraction)
- pytest >= 7.4 (for testing)
This project uses public data from:
This project is for educational and research purposes.