A complete RAG (Retrieval-Augmented Generation) chatbot for Innovation Academy, built with Ollama, ChromaDB, and Flask. It crawls the Beacon website, ingests calendar events, staff data, and FAQs, then serves an AI assistant that can answer questions about IA with sourced responses.
- WordPress REST API Integration — Ingests all posts, pages, and custom post types with tag/category metadata
- HTML Crawling — Deep crawls the entire beacon2.fcsia.com website with link following
- Google Calendar Sync — Fetches school calendar events (filtered to recent + future only)
- Staff Directory — Direct name lookup with fuzzy matching
- FAQ Knowledge Base — Ingests curated FAQ.txt for common questions
- Vector Search — Fast semantic search using ChromaDB +
nomic-embed-textembeddings - AI-Powered Image Analysis — OCR and content understanding via Ollama vision models (beacon posts only)
- Conversational Context — Handles follow-up questions ("what about healthcare?", "the one after that?")
- Smart Source Citation — Links to Google Calendar, Staff Directory, and specific Beacon articles (filters out generic pages)
- Flask Web App — Clean dark-themed chat UI at
http://localhost:5000 - Chat History — Persisted in local storage across sessions
- Current Date Display — Shows today's date in the welcome message
- Friendly Source Links — Up to 5 sources shown with readable labels (📅 Calendar, 👤 Staff, 🔗 Articles)
- No-Images Cache — Pages with no images are remembered and skipped on future ingests
- Calendar Date Filter — Only ingests events from the past week onward (no old events from previous years)
- Image Analysis Scoping — Only analyzes images on Beacon blog posts, not external or generic pages
- WordPress Tags — Tags and categories are fetched and included in ingested content for better search
Download and install from https://ollama.ai
# Text embeddings (required)
ollama pull nomic-embed-text
# Chat model (required)
ollama pull llama3.2
# Vision model for image analysis (optional)
ollama pull moondreampip install -r requirements.txtRequires: Python 3.10+, flask (for web UI)
Crawl the website, sync calendar, and build the vector database:
python ingest.pyFirst run: ~20–40 minutes (depends on internet + Ollama speed) Subsequent runs: Faster thanks to the no-images cache
python app.pyOpen http://localhost:5000 in your browser.
# Interactive
python chat.py
# Single question
python chat.py "when is the next flex friday"
python chat.py "who is mr kent"Edit config.py to customize:
| Setting | Default | Description |
|---|---|---|
CHAT_MODEL |
llama3.2 |
Ollama model for generating answers |
EMBED_MODEL |
nomic-embed-text |
Model for text embeddings |
VISION_MODEL |
moondream |
Model for image OCR/analysis |
IMAGE_ANALYSIS_ENABLED |
True |
Toggle image analysis on/off |
DEFAULT_TOP_K |
25 |
Number of results to retrieve per query |
HTML_MAX_DEPTH |
12 |
Maximum crawl depth for link following |
HTML_WORKERS |
32 |
Concurrent crawl workers |
IMAGES_PER_PAGE |
20 |
Max images to analyze per page |
TEMPERATURE |
0.3 |
LLM response creativity (lower = more factual) |
AI-Kent/
├── app.py # Flask web server
├── chat.py # Query engine & prompt construction
├── config.py # All configuration settings
├── ingest.py # Main ingestion pipeline
├── extractors.py # HTML & WordPress content extraction
├── image_analyzer.py # AI-powered image analysis (OCR)
├── utils.py # URL normalization, date parsing, helpers
├── FAQ.txt # Curated school FAQ knowledge base
├── staff.json # Staff directory data
├── requirements.txt # Python dependencies
├── templates/
│ └── index.html # Chat web interface
├── data/ # Organized raw data (auto-generated)
│ ├── posts/ # WordPress blog posts (.json + .md)
│ ├── pages/ # WordPress pages (.json + .md)
│ ├── markdown/ # Crawled HTML pages as markdown (.json + .md)
│ ├── calendar/ # Google Calendar event data
│ ├── images/ # Image analysis results
│ ├── showcase/ # Showcase project data
│ ├── html/ # Raw crawled page data
│ ├── sitemap/ # Crawled URL sitemaps
│ └── no_images_cache.json # Cache of pages with no images
└── chroma_beacon2/ # ChromaDB vector database
- Google Calendar — Fetches ICS feed, parses events, filters to past week + future only
- WordPress API — Pulls posts/pages with tag & category metadata via REST API
- HTML Crawling — Deep crawls with link following (up to 12 levels deep)
- Image Analysis — Uses vision model for OCR on Beacon blog post images only (skips cached no-image pages)
- FAQ & Staff — Ingests FAQ.txt and staff.json for direct knowledge
- Embedding — Generates vector embeddings using
nomic-embed-text - ChromaDB — Stores all documents + embeddings for semantic search
- Intent Detection — Detects time queries, staff queries, follow-up questions
- Calendar Injection — For date questions, fetches and filters upcoming events with keyword matching
- Staff Lookup — For "who is..." queries, does fuzzy name matching against staff.json
- Semantic Search — Embeds the question and finds top-K relevant documents in ChromaDB
- Context Assembly — Builds prompt with calendar events, staff info, FAQ matches, and web content
- LLM Generation — Generates answer with conversational context and date awareness
- Source Filtering — Returns clean, specific source URLs (filters out generic/category pages)
- Calendar events → Link to Google Calendar embed
- Staff info → Link to
fcsinnovationacademy.fultonschools.org/staff - Beacon articles → Link to specific post URL
- Generic pages filtered →
/updates/...,/engineering/,/health-science/,/it/category pages are excluded - Internal files hidden → FAQ.txt is used for answers but not listed as a source
- Healthcare = Health Science — These terms are interchangeable throughout the system
- "Save a Life Flex Friday" — Recognized as a Healthcare/Health Science pathway event
- FF → Flex Friday
- IA → Innovation Academy
- Monday Missions → Weekly video updates
- Knows today's date and uses it for "next event" queries
- Handles follow-ups: "what about healthcare?" after asking about IT events
- Treats Flex Friday as single-day events (uses start date only)
- Supports "after that" / "next one" for sequential event lookup
"When is the next Flex Friday?"
"What about healthcare?" (follow-up → finds next healthcare event)
"The one after that?" (follow-up → finds the next one after)
"Who is Mr. Kent?"
"What are the IT showcase projects?"
"Tell me about the engineering pathway"
"What events are coming up this week?"
"How do I sign up for a Flex Friday?"
Run python ingest.py first to build the database.
- Check Ollama is running:
ollama list - Verify vision model:
ollama pull moondream - Check config:
IMAGE_ANALYSIS_ENABLED = True
- Use a lighter vision model:
moondream(default) is already fast - Disable image analysis entirely:
IMAGE_ANALYSIS_ENABLED = False - Subsequent runs are faster due to the no-images cache
- Reduce crawl depth:
HTML_MAX_DEPTH = 8
The system automatically retries with smaller chunks. If issues persist:
- Reduce
CHUNK_SIZEin config.py - Reduce
DEFAULT_TOP_Kfor queries
To update the database with new content:
python ingest.pyThis rebuilds everything from scratch. The vector database is deleted and recreated to ensure freshness. The no-images cache persists across runs for efficiency.
MIT