Innovation Academy Beacon RAG System

A complete RAG (Retrieval-Augmented Generation) chatbot for Innovation Academy, built with Ollama, ChromaDB, and Flask. It crawls the Beacon website, ingests calendar events, staff data, and FAQs, then serves an AI assistant that can answer questions about IA with sourced responses.

Features

Core

WordPress REST API Integration — Ingests all posts, pages, and custom post types with tag/category metadata
HTML Crawling — Deep crawls the entire beacon2.fcsia.com website with link following
Google Calendar Sync — Fetches school calendar events (filtered to recent + future only)
Staff Directory — Direct name lookup with fuzzy matching
FAQ Knowledge Base — Ingests curated FAQ.txt for common questions

AI & Search

Vector Search — Fast semantic search using ChromaDB + nomic-embed-text embeddings
AI-Powered Image Analysis — OCR and content understanding via Ollama vision models (beacon posts only)
Conversational Context — Handles follow-up questions ("what about healthcare?", "the one after that?")
Smart Source Citation — Links to Google Calendar, Staff Directory, and specific Beacon articles (filters out generic pages)

Web Interface

Flask Web App — Clean dark-themed chat UI at http://localhost:5000
Chat History — Persisted in local storage across sessions
Current Date Display — Shows today's date in the welcome message
Friendly Source Links — Up to 5 sources shown with readable labels (📅 Calendar, 👤 Staff, 🔗 Articles)

Efficiency

No-Images Cache — Pages with no images are remembered and skipped on future ingests
Calendar Date Filter — Only ingests events from the past week onward (no old events from previous years)
Image Analysis Scoping — Only analyzes images on Beacon blog posts, not external or generic pages
WordPress Tags — Tags and categories are fetched and included in ingested content for better search

Prerequisites

1. Install Ollama

Download and install from https://ollama.ai

2. Pull Required Models

# Text embeddings (required)
ollama pull nomic-embed-text

# Chat model (required)
ollama pull llama3.2

# Vision model for image analysis (optional)
ollama pull moondream

3. Install Python Dependencies

pip install -r requirements.txt

Requires: Python 3.10+, flask (for web UI)

Quick Start

1. Run Ingestion

Crawl the website, sync calendar, and build the vector database:

python ingest.py

First run: ~20–40 minutes (depends on internet + Ollama speed) Subsequent runs: Faster thanks to the no-images cache

2. Start the Web UI

python app.py

Open http://localhost:5000 in your browser.

3. CLI Mode (Alternative)

# Interactive
python chat.py

# Single question
python chat.py "when is the next flex friday"
python chat.py "who is mr kent"

Configuration

Edit config.py to customize:

Setting	Default	Description
`CHAT_MODEL`	`llama3.2`	Ollama model for generating answers
`EMBED_MODEL`	`nomic-embed-text`	Model for text embeddings
`VISION_MODEL`	`moondream`	Model for image OCR/analysis
`IMAGE_ANALYSIS_ENABLED`	`True`	Toggle image analysis on/off
`DEFAULT_TOP_K`	`25`	Number of results to retrieve per query
`HTML_MAX_DEPTH`	`12`	Maximum crawl depth for link following
`HTML_WORKERS`	`32`	Concurrent crawl workers
`IMAGES_PER_PAGE`	`20`	Max images to analyze per page
`TEMPERATURE`	`0.3`	LLM response creativity (lower = more factual)

Project Structure

AI-Kent/
├── app.py                 # Flask web server
├── chat.py                # Query engine & prompt construction
├── config.py              # All configuration settings
├── ingest.py              # Main ingestion pipeline
├── extractors.py          # HTML & WordPress content extraction
├── image_analyzer.py      # AI-powered image analysis (OCR)
├── utils.py               # URL normalization, date parsing, helpers
├── FAQ.txt                # Curated school FAQ knowledge base
├── staff.json             # Staff directory data
├── requirements.txt       # Python dependencies
├── templates/
│   └── index.html         # Chat web interface
├── data/                  # Organized raw data (auto-generated)
│   ├── posts/             # WordPress blog posts (.json + .md)
│   ├── pages/             # WordPress pages (.json + .md)
│   ├── markdown/          # Crawled HTML pages as markdown (.json + .md)
│   ├── calendar/          # Google Calendar event data
│   ├── images/            # Image analysis results
│   ├── showcase/          # Showcase project data
│   ├── html/              # Raw crawled page data
│   ├── sitemap/           # Crawled URL sitemaps
│   └── no_images_cache.json  # Cache of pages with no images
└── chroma_beacon2/        # ChromaDB vector database

How It Works

Ingestion Pipeline (`ingest.py`)

Google Calendar — Fetches ICS feed, parses events, filters to past week + future only
WordPress API — Pulls posts/pages with tag & category metadata via REST API
HTML Crawling — Deep crawls with link following (up to 12 levels deep)
Image Analysis — Uses vision model for OCR on Beacon blog post images only (skips cached no-image pages)
FAQ & Staff — Ingests FAQ.txt and staff.json for direct knowledge
Embedding — Generates vector embeddings using nomic-embed-text
ChromaDB — Stores all documents + embeddings for semantic search

Query Pipeline (`chat.py`)

Intent Detection — Detects time queries, staff queries, follow-up questions
Calendar Injection — For date questions, fetches and filters upcoming events with keyword matching
Staff Lookup — For "who is..." queries, does fuzzy name matching against staff.json
Semantic Search — Embeds the question and finds top-K relevant documents in ChromaDB
Context Assembly — Builds prompt with calendar events, staff info, FAQ matches, and web content
LLM Generation — Generates answer with conversational context and date awareness
Source Filtering — Returns clean, specific source URLs (filters out generic/category pages)

Source Citation Rules

Calendar events → Link to Google Calendar embed
Staff info → Link to fcsinnovationacademy.fultonschools.org/staff
Beacon articles → Link to specific post URL
Generic pages filtered → /updates/..., /engineering/, /health-science/, /it/ category pages are excluded
Internal files hidden → FAQ.txt is used for answers but not listed as a source

School-Specific Features

Pathway Synonyms

Healthcare = Health Science — These terms are interchangeable throughout the system
"Save a Life Flex Friday" — Recognized as a Healthcare/Health Science pathway event

Acronym Expansion

FF → Flex Friday
IA → Innovation Academy
Monday Missions → Weekly video updates

Calendar Awareness

Knows today's date and uses it for "next event" queries
Handles follow-ups: "what about healthcare?" after asking about IT events
Treats Flex Friday as single-day events (uses start date only)
Supports "after that" / "next one" for sequential event lookup

Example Queries

"When is the next Flex Friday?"
"What about healthcare?"              (follow-up → finds next healthcare event)
"The one after that?"                 (follow-up → finds the next one after)
"Who is Mr. Kent?"
"What are the IT showcase projects?"
"Tell me about the engineering pathway"
"What events are coming up this week?"
"How do I sign up for a Flex Friday?"

Troubleshooting

"Collection not found"

Run python ingest.py first to build the database.

Images not being analyzed

Check Ollama is running: ollama list
Verify vision model: ollama pull moondream
Check config: IMAGE_ANALYSIS_ENABLED = True

Slow ingestion

Use a lighter vision model: moondream (default) is already fast
Disable image analysis entirely: IMAGE_ANALYSIS_ENABLED = False
Subsequent runs are faster due to the no-images cache
Reduce crawl depth: HTML_MAX_DEPTH = 8

Out of memory / Context length errors

The system automatically retries with smaller chunks. If issues persist:

Reduce CHUNK_SIZE in config.py
Reduce DEFAULT_TOP_K for queries

Re-ingesting

To update the database with new content:

python ingest.py

This rebuilds everything from scratch. The vector database is deleted and recreated to ensure freshness. The no-images cache persists across runs for efficiency.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
chroma_beacon2		chroma_beacon2
data		data
offline_packages		offline_packages
templates		templates
.envrc		.envrc
.gitattributes		.gitattributes
.gitignore		.gitignore
FAQ.txt		FAQ.txt
LICENSE		LICENSE
README.md		README.md
Teacher_Names_Schedules.csv		Teacher_Names_Schedules.csv
app.py		app.py
chat.py		chat.py
config.py		config.py
extractors.py		extractors.py
image_analyzer.py		image_analyzer.py
ingest.py		ingest.py
requirements.txt		requirements.txt
staff.json		staff.json
test_chat.py		test_chat.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Innovation Academy Beacon RAG System

Features

Core

AI & Search

Web Interface

Efficiency

Prerequisites

1. Install Ollama

2. Pull Required Models

3. Install Python Dependencies

Quick Start

1. Run Ingestion

2. Start the Web UI

3. CLI Mode (Alternative)

Configuration

Project Structure

How It Works

Ingestion Pipeline (ingest.py)

Query Pipeline (chat.py)

Source Citation Rules

School-Specific Features

Pathway Synonyms

Acronym Expansion

Calendar Awareness

Example Queries

Troubleshooting

"Collection not found"

Images not being analyzed

Slow ingestion

Out of memory / Context length errors

Re-ingesting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Ingestion Pipeline (`ingest.py`)

Query Pipeline (`chat.py`)

Packages