Skip to content

curcio/health_stats_analysis

Repository files navigation

Diabetes Analysis Dashboard

A comprehensive data analysis pipeline for US diabetes prevalence data with interactive visualizations, statistical validation, and machine learning insights.

Overview

This project analyzes CDC diabetes data (2019-2022) to:

  • Explore geographic and demographic patterns in diabetes prevalence
  • Extract and validate research claims from scientific literature
  • Identify clusters of similar states using K-Means clustering
  • Detect anomalous data points using Isolation Forest
  • Generate an interactive HTML dashboard with ECharts visualizations

Features

  • Interactive Dashboard - Multi-page HTML dashboard with ECharts 5.4.3
  • Geographic Analysis - US choropleth maps showing state-level prevalence
  • Demographic Breakdowns - Analysis by age, sex, race/ethnicity, income, education
  • Trend Analysis - Time series and year-over-year comparisons
  • ML Clustering - K-Means clustering to identify state groupings
  • Anomaly Detection - Isolation Forest to identify outlier states
  • Research Validation - Extract claims from literature and test against data

Quick Start

Installation

# Clone or download the repository
cd datasets-gov

# Install in development mode
pip install -e .

# Or install with all optional dependencies
pip install -e ".[all]"

Verify Setup

python -m diabetes_analysis verify-pipeline

Run the Pipeline

# Run complete pipeline (EDA -> Research -> Validation -> Dashboard)
python -m diabetes_analysis run-all --use-sample-claims

# Or run individual steps
python -m diabetes_analysis run-eda
python -m diabetes_analysis run-research --use-samples
python -m diabetes_analysis run-validation
python -m diabetes_analysis generate-dashboard

View Dashboard

After running the pipeline, open the generated dashboard:

diabetes_analysis/output/index.html

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        DATA SOURCES                                  │
│  diabetes_datasets/extracted/diabetes_indicators.csv                │
│  + Additional CDC datasets (mortality, type-2, etc.)                │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    STEP 1: DATA LOADING                              │
│  diabetes_analysis/data/                                             │
│  - loader.py: Load CSV files into pandas DataFrames                 │
│  - cleaner.py: Clean, normalize, validate data                      │
│  - schemas.py: Pydantic schemas for validation                      │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    STEP 2: EDA                                       │
│  diabetes_analysis/eda/                                              │
│  - pandas_analysis.py: Custom aggregations and statistics           │
│  - profiler.py: ydata-profiling reports (optional)                  │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    STEP 3: RESEARCH                                  │
│  diabetes_analysis/research/                                         │
│  - searcher.py: Search for scientific articles                      │
│  - fetcher.py: Fetch article content                                │
│  - extractor.py: Extract statistical claims                         │
│  - claims_registry.py: Store and manage claims                      │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    STEP 4: VALIDATION                                │
│  diabetes_analysis/validation/                                       │
│  - hypothesis_tester.py: Test claims against data                   │
│  - clustering.py: K-Means state clustering                          │
│  - anomaly.py: Isolation Forest anomaly detection                   │
│  - statistics.py: Statistical tests (t-test, chi-square)            │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    STEP 5: VISUALIZATION                             │
│  diabetes_analysis/viz/                                              │
│  - data_embedder.py: Embed data as JSON for frontend                │
│  - generator.py: Generate HTML pages from templates                 │
│  - charts/: Chart configuration modules                             │
│    - geographic.py, demographic.py, correlation.py                  │
│    - timeseries.py, novel.py (sankey, radar, treemap, etc.)         │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         OUTPUT                                       │
│  diabetes_analysis/output/                                           │
│  - index.html: Main dashboard                                        │
│  - pages/: Individual analysis pages                                │
│  - assets/: CSS, JS, and embedded data                              │
└─────────────────────────────────────────────────────────────────────┘

Usage

CLI Commands

# Show all available commands
python -m diabetes_analysis --help

# Show current configuration
python -m diabetes_analysis show-config

# Run EDA with custom directories
python -m diabetes_analysis run-eda --data-dir /path/to/data --output-dir /path/to/output

# Run research with sample claims (no web fetching)
python -m diabetes_analysis run-research --use-samples

# Run validation
python -m diabetes_analysis run-validation

# Generate dashboard
python -m diabetes_analysis generate-dashboard

# Run full pipeline
python -m diabetes_analysis run-all --use-sample-claims

As a Python Package

from diabetes_analysis.data import DataLoader
from diabetes_analysis.eda import PandasAnalyzer
from diabetes_analysis.validation import ClusterAnalyzer

# Load data
loader = DataLoader()
datasets = loader.load_all_datasets()

# Run analysis
analyzer = PandasAnalyzer()
results = analyzer.run_all_analyses(datasets)

# Clustering
clusterer = ClusterAnalyzer()
features = clusterer.prepare_state_features(datasets["diabetes_indicators"])
clusters = clusterer.kmeans_clustering(features)

Project Structure

datasets-gov/
├── README.md                      # This file
├── CONTRIBUTING.md                # Development guide
├── setup.py                       # Package configuration
├── requirements.txt               # Dependencies
├── .gitignore                     # Git ignore patterns
├── test_pipeline.py               # Pipeline verification script
│
├── diabetes_analysis/             # Main package
│   ├── __init__.py
│   ├── __main__.py                # python -m diabetes_analysis
│   ├── cli.py                     # Click CLI commands
│   ├── config.py                  # Configuration settings
│   │
│   ├── data/                      # Data loading & validation
│   │   ├── __init__.py
│   │   ├── loader.py              # CSV loading
│   │   ├── cleaner.py             # Data cleaning
│   │   └── schemas.py             # Pydantic schemas
│   │
│   ├── eda/                       # Exploratory analysis
│   │   ├── __init__.py
│   │   ├── pandas_analysis.py     # Custom analysis
│   │   └── profiler.py            # ydata-profiling
│   │
│   ├── research/                  # Research claim extraction
│   │   ├── __init__.py
│   │   ├── searcher.py            # Article search
│   │   ├── fetcher.py             # Content fetching
│   │   ├── extractor.py           # Claim extraction
│   │   └── claims_registry.py     # Claim storage
│   │
│   ├── validation/                # Hypothesis validation
│   │   ├── __init__.py
│   │   ├── statistics.py          # Statistical tests
│   │   ├── clustering.py          # K-Means clustering
│   │   ├── anomaly.py             # Anomaly detection
│   │   └── hypothesis_tester.py   # Claim validation
│   │
│   └── viz/                       # Visualization
│       ├── __init__.py
│       ├── generator.py           # HTML generation
│       ├── data_embedder.py       # Data embedding
│       ├── charts/                # Chart modules
│       │   ├── geographic.py
│       │   ├── demographic.py
│       │   ├── correlation.py
│       │   ├── timeseries.py
│       │   └── novel.py
│       └── templates/             # Jinja2 templates
│
├── diabetes_datasets/             # Raw data files
│   ├── extracted/                 # Processed CSV
│   └── specific/                  # Additional datasets
│
└── scripts/                       # Utility scripts
    ├── fetch_diabetes_data.py
    ├── fetch_specific_diabetes.py
    └── extract_diabetes_data.py

Visualizations

The dashboard includes:

Chart Type Description
US Choropleth Map State-level diabetes prevalence
Bar Charts Top/bottom states by prevalence
Line Charts Year-over-year trends
Demographic Bars Breakdown by age, sex, race
Correlation Heatmap Feature correlations
K-Means Scatter State clustering visualization
Anomaly Plot Outlier state detection
Sankey Diagram Research claims flow
Radar Chart Cluster profiles
Treemap Regional hierarchy
Bump Chart State ranking changes
Stacked Area Regional trend contributions
Waterfall Regional contribution breakdown

Dependencies

Required

  • Python >= 3.10
  • pandas >= 2.0
  • numpy >= 1.24
  • scipy >= 1.11
  • scikit-learn >= 1.3
  • jinja2 >= 3.1
  • pydantic >= 2.5
  • click >= 8.1

Optional

  • ydata-profiling >= 4.6 (for detailed EDA profiles)
  • httpx >= 0.25 (for article fetching)
  • beautifulsoup4 >= 4.12 (for content extraction)
  • pytest >= 7.4 (for testing)

Data Sources

This project uses public data from:

License

This project is for educational and research purposes.

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors