Skip to content

ActiveInferenceInstitute/Journal-Utilities

Repository files navigation

Active Inference Institute

Journal-Utilities

A modular, config-driven pipeline for processing the Active Inference Institute video library.

Python 3.12+ Tests Coverage License

Download · Transcribe · Extract · Export · Browse · Chat


How It Works

graph LR
    subgraph Ingest
        A[🎬 YouTube Channel] -->|yt-dlp| B[📥 Download]
        B --> C[📝 Transcripts]
        B --> D[🎵 Audio]
    end

    subgraph Process
        D -->|mlx-whisper / WhisperX| C
        C -->|Cohere AI| E[🧠 Entities & Graph]
        C -->|5 formats| F[📄 Export]
    end

    subgraph Serve
        C --> G[🌐 Web Interface]
        E --> G
        G -->|Ollama RAG| H[� Chat]
    end

    style A fill:#e63946,color:#fff
    style G fill:#457b9d,color:#fff
    style H fill:#2a9d8f,color:#fff
Loading

One command runs the full pipeline: python run.py One file controls all options: config.inisee reference →


✨ Features

📥 Download & Transcribe

Enumerate 695+ videos from the Active Inference channel. Download transcripts, audio, and video with cookie auth, rate limiting, and resume. Transcribe locally on Apple Silicon or GPU.

Download Guide · Transcription Engines · YouTube Module

🌐 Web Interface & Chat

FastAPI SPA with searchable video library, embedded YouTube player, transcript viewer, and category browser. Ollama-powered RAG chat with automatic context retrieval.

Web Interface · Chat Engine

📄 Multi-Format Export

Batch-export to Markdown, JSON, HTML, PDF, and plaintext — each enriched with metadata headers (title, category, series, speakers, duration, URL, views).

Export Guide

🧠 Knowledge Extraction

Cohere AI entity extraction (people, concepts, theories, organizations) and relationship mapping into a SurrealDB knowledge graph.

RAG Pipeline · Data & Database


🚀 Quick Start

# 1. Clone & install
git clone https://github.com/ActiveInferenceInstitute/Journal-Utilities.git
cd Journal-Utilities
uv venv && source .venv/bin/activate
uv pip install -e ".[dev,interface,export]"

# 2. Run the default pipeline (Config → Validate → Export → Test → Serve)
python run.py

Pipeline Commands

python run.py config       # Show current configuration
python run.py download     # Download from YouTube
python run.py export       # Export transcripts to all enabled formats
python run.py test         # Run 389-test suite
python run.py serve        # Launch web UI at http://localhost:8000
python run.py full         # Full pipeline: download → export

🏗️ Architecture

graph TB
    subgraph "CLI Layer"
        RUN["run.py — Pipeline Runner"]
        S1["scripts/download_channel.py"]
        S2["scripts/transcribe_missing.py"]
        S3["scripts/scaffold_youtube_courses.py"]
    end

    subgraph "src/journal_utilities/"
        direction TB
        YT["youtube/<br/>channel · playlist · categorizer"]
        DL["download/<br/>downloader"]
        TR["transcribe/<br/>mlx-whisper · WhisperX"]
        EX["export/<br/>exporter (5 formats)"]
        DATA["data/<br/>database · importer"]
        RAG["rag/<br/>extractors · graph · models"]
        IF["interface/<br/>app · chat_engine · data_loader"]
        RN["render/<br/>renderer"]
    end

    subgraph "External Services"
        OL["Ollama (LLM)"]
        DB["SurrealDB"]
        CO["Cohere AI"]
        YT_API["YouTube (yt-dlp)"]
    end

    RUN --> EX & IF & DL
    S1 --> YT & DL
    S2 --> TR
    S3 --> RN

    DL --> YT_API
    RAG --> CO & DB
    DATA --> DB
    IF --> OL

    style RUN fill:#e63946,color:#fff
    style IF fill:#457b9d,color:#fff
    style EX fill:#2a9d8f,color:#fff
Loading
Full directory tree
Journal-Utilities/
├── src/journal_utilities/        # Main package
│   ├── youtube/                  #   Channel enumeration, categorizer
│   ├── download/                 #   yt-dlp download engine
│   ├── transcribe/               #   MLX-Whisper + WhisperX
│   ├── data/                     #   SurrealDB client + Coda importer
│   ├── interface/                #   FastAPI SPA + Ollama chat
│   ├── rag/                      #   Entity extraction pipeline
│   ├── render/                   #   Course scaffolding
│   ├── export/                   #   Multi-format transcript export
│   └── utils/                    #   Shared utilities
├── scripts/                      # CLI tools
├── tests/                        # 389 tests (pytest)
├── data/                         # Input, output, database storage
├── docs/                         # 10 module guides
├── run.py                        # Pipeline runner
├── config.ini                    # All configuration
└── pyproject.toml                # Python 3.12+

📚 Documentation

All technical detail lives in docs/. The README you're reading is the overview and entry point. See docs/JOURNAL_SCHEMA.md for the ActiveInferenceJournal v2 schema and docs/REFACTOR_READINESS.md for the refactor pipeline (scripts/refactor_journal.py).

Guide What You'll Find
Configuration config.ini sections, environment variables, pipeline step control
YouTube Channel enumeration, playlist parsing, title categorization
Download Cookie auth, 403 troubleshooting, download strategies
Transcription MLX-Whisper (Mac), WhisperX (GPU), model selection
Export Format details, metadata enrichment, library API
Web Interface API endpoints, SPA frontend, development server
Chat Engine Ollama RAG, prompt engineering, model auto-discovery
RAG & Graph Cohere extraction, entity schema, knowledge graph
Data & Database SurrealDB schema, Coda import, audit trails
Render Playlist → course scaffolding, module.md format
Agent Guide Architecture, code patterns, agent development rules

🧪 Testing

python run.py test                         # Via pipeline runner
uv run pytest tests/ -v --cov=src          # With coverage report

🔧 Environment

Required in .env:

Variable Purpose
HUGGINGFACE_TOKEN WhisperX speaker diarization
COHERE_API_KEY Entity extraction (RAG)
CODA_API_TOKEN Coda session data
OLLAMA_MODEL Chat model (default: gemma3:4b)
OLLAMA_BASE_URL Ollama API URL (default: http://localhost:11434)

🙏 Acknowledgements

WhisperX pipeline & SurrealDB — Holly Grimm @hollygrimm (2024) YouTube download & local Whisper — 2025–2026 AssemblyAI scripts — Dave Douglass (2022)

About

Utilities and Documentation for creating contents for the Active Inference Journal

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages