Skip to content

DogezRule/AI-Kent

Repository files navigation

Innovation Academy Beacon RAG System

A complete RAG (Retrieval-Augmented Generation) chatbot for Innovation Academy, built with Ollama, ChromaDB, and Flask. It crawls the Beacon website, ingests calendar events, staff data, and FAQs, then serves an AI assistant that can answer questions about IA with sourced responses.

Features

Core

  • WordPress REST API Integration — Ingests all posts, pages, and custom post types with tag/category metadata
  • HTML Crawling — Deep crawls the entire beacon2.fcsia.com website with link following
  • Google Calendar Sync — Fetches school calendar events (filtered to recent + future only)
  • Staff Directory — Direct name lookup with fuzzy matching
  • FAQ Knowledge Base — Ingests curated FAQ.txt for common questions

AI & Search

  • Vector Search — Fast semantic search using ChromaDB + nomic-embed-text embeddings
  • AI-Powered Image Analysis — OCR and content understanding via Ollama vision models (beacon posts only)
  • Conversational Context — Handles follow-up questions ("what about healthcare?", "the one after that?")
  • Smart Source Citation — Links to Google Calendar, Staff Directory, and specific Beacon articles (filters out generic pages)

Web Interface

  • Flask Web App — Clean dark-themed chat UI at http://localhost:5000
  • Chat History — Persisted in local storage across sessions
  • Current Date Display — Shows today's date in the welcome message
  • Friendly Source Links — Up to 5 sources shown with readable labels (📅 Calendar, 👤 Staff, 🔗 Articles)

Efficiency

  • No-Images Cache — Pages with no images are remembered and skipped on future ingests
  • Calendar Date Filter — Only ingests events from the past week onward (no old events from previous years)
  • Image Analysis Scoping — Only analyzes images on Beacon blog posts, not external or generic pages
  • WordPress Tags — Tags and categories are fetched and included in ingested content for better search

Prerequisites

1. Install Ollama

Download and install from https://ollama.ai

2. Pull Required Models

# Text embeddings (required)
ollama pull nomic-embed-text

# Chat model (required)
ollama pull llama3.2

# Vision model for image analysis (optional)
ollama pull moondream

3. Install Python Dependencies

pip install -r requirements.txt

Requires: Python 3.10+, flask (for web UI)

Quick Start

1. Run Ingestion

Crawl the website, sync calendar, and build the vector database:

python ingest.py

First run: ~20–40 minutes (depends on internet + Ollama speed) Subsequent runs: Faster thanks to the no-images cache

2. Start the Web UI

python app.py

Open http://localhost:5000 in your browser.

3. CLI Mode (Alternative)

# Interactive
python chat.py

# Single question
python chat.py "when is the next flex friday"
python chat.py "who is mr kent"

Configuration

Edit config.py to customize:

Setting Default Description
CHAT_MODEL llama3.2 Ollama model for generating answers
EMBED_MODEL nomic-embed-text Model for text embeddings
VISION_MODEL moondream Model for image OCR/analysis
IMAGE_ANALYSIS_ENABLED True Toggle image analysis on/off
DEFAULT_TOP_K 25 Number of results to retrieve per query
HTML_MAX_DEPTH 12 Maximum crawl depth for link following
HTML_WORKERS 32 Concurrent crawl workers
IMAGES_PER_PAGE 20 Max images to analyze per page
TEMPERATURE 0.3 LLM response creativity (lower = more factual)

Project Structure

AI-Kent/
├── app.py                 # Flask web server
├── chat.py                # Query engine & prompt construction
├── config.py              # All configuration settings
├── ingest.py              # Main ingestion pipeline
├── extractors.py          # HTML & WordPress content extraction
├── image_analyzer.py      # AI-powered image analysis (OCR)
├── utils.py               # URL normalization, date parsing, helpers
├── FAQ.txt                # Curated school FAQ knowledge base
├── staff.json             # Staff directory data
├── requirements.txt       # Python dependencies
├── templates/
│   └── index.html         # Chat web interface
├── data/                  # Organized raw data (auto-generated)
│   ├── posts/             # WordPress blog posts (.json + .md)
│   ├── pages/             # WordPress pages (.json + .md)
│   ├── markdown/          # Crawled HTML pages as markdown (.json + .md)
│   ├── calendar/          # Google Calendar event data
│   ├── images/            # Image analysis results
│   ├── showcase/          # Showcase project data
│   ├── html/              # Raw crawled page data
│   ├── sitemap/           # Crawled URL sitemaps
│   └── no_images_cache.json  # Cache of pages with no images
└── chroma_beacon2/        # ChromaDB vector database

How It Works

Ingestion Pipeline (ingest.py)

  1. Google Calendar — Fetches ICS feed, parses events, filters to past week + future only
  2. WordPress API — Pulls posts/pages with tag & category metadata via REST API
  3. HTML Crawling — Deep crawls with link following (up to 12 levels deep)
  4. Image Analysis — Uses vision model for OCR on Beacon blog post images only (skips cached no-image pages)
  5. FAQ & Staff — Ingests FAQ.txt and staff.json for direct knowledge
  6. Embedding — Generates vector embeddings using nomic-embed-text
  7. ChromaDB — Stores all documents + embeddings for semantic search

Query Pipeline (chat.py)

  1. Intent Detection — Detects time queries, staff queries, follow-up questions
  2. Calendar Injection — For date questions, fetches and filters upcoming events with keyword matching
  3. Staff Lookup — For "who is..." queries, does fuzzy name matching against staff.json
  4. Semantic Search — Embeds the question and finds top-K relevant documents in ChromaDB
  5. Context Assembly — Builds prompt with calendar events, staff info, FAQ matches, and web content
  6. LLM Generation — Generates answer with conversational context and date awareness
  7. Source Filtering — Returns clean, specific source URLs (filters out generic/category pages)

Source Citation Rules

  • Calendar events → Link to Google Calendar embed
  • Staff info → Link to fcsinnovationacademy.fultonschools.org/staff
  • Beacon articles → Link to specific post URL
  • Generic pages filtered/updates/..., /engineering/, /health-science/, /it/ category pages are excluded
  • Internal files hidden → FAQ.txt is used for answers but not listed as a source

School-Specific Features

Pathway Synonyms

  • Healthcare = Health Science — These terms are interchangeable throughout the system
  • "Save a Life Flex Friday" — Recognized as a Healthcare/Health Science pathway event

Acronym Expansion

  • FF → Flex Friday
  • IA → Innovation Academy
  • Monday Missions → Weekly video updates

Calendar Awareness

  • Knows today's date and uses it for "next event" queries
  • Handles follow-ups: "what about healthcare?" after asking about IT events
  • Treats Flex Friday as single-day events (uses start date only)
  • Supports "after that" / "next one" for sequential event lookup

Example Queries

"When is the next Flex Friday?"
"What about healthcare?"              (follow-up → finds next healthcare event)
"The one after that?"                 (follow-up → finds the next one after)
"Who is Mr. Kent?"
"What are the IT showcase projects?"
"Tell me about the engineering pathway"
"What events are coming up this week?"
"How do I sign up for a Flex Friday?"

Troubleshooting

"Collection not found"

Run python ingest.py first to build the database.

Images not being analyzed

  1. Check Ollama is running: ollama list
  2. Verify vision model: ollama pull moondream
  3. Check config: IMAGE_ANALYSIS_ENABLED = True

Slow ingestion

  • Use a lighter vision model: moondream (default) is already fast
  • Disable image analysis entirely: IMAGE_ANALYSIS_ENABLED = False
  • Subsequent runs are faster due to the no-images cache
  • Reduce crawl depth: HTML_MAX_DEPTH = 8

Out of memory / Context length errors

The system automatically retries with smaller chunks. If issues persist:

  • Reduce CHUNK_SIZE in config.py
  • Reduce DEFAULT_TOP_K for queries

Re-ingesting

To update the database with new content:

python ingest.py

This rebuilds everything from scratch. The vector database is deleted and recreated to ensure freshness. The no-images cache persists across runs for efficiency.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors