A robust, minimal-complexity Python web scraper built to download chapters from novelbin.com and prepare them for a Retrieval-Augmented Generation (RAG) system.
- Cloudflare Bypass: Native bypassing using
curl_cffiwhich mimics real browser fingerprints. - RAG-Ready Output: Saves cleanly formatted Markdown (
.md) files ready to be indexed by LangChain or similar frameworks. - Auto-Resume: Performs a preflight check against the Table of Contents to efficiently resume interrupted scraping sessions without re-downloading files.
- Polite Scraping: Randomly rotates User-Agents and implements jitter (random sleep) to stay under the radar.
- Resilient: Automatically retries on network failures or temporary Cloudflare blocks using exponential backoff.
- Rich Logging: Generates session metrics and outputs timestamped logs to the
logs/directory for auditing.
- Python 3.8+
- The required dependencies inside
requirements.txt
-
Install Dependencies
pip install -r requirements.txt
-
Configure the Scraper Edit the
config.jsonfile to set your preferences.novel_url: The main page for the novel (used for the auto-resume preflight check).start_chapter_url: The URL to the first chapter.output_dir: The local directory where markdown files will be saved.max_chapters: Set tonullto scrape everything, or a number (e.g.3) for testing.resume_scraping: Keeptrueto skip existing files.
-
Run the Scraper
python scraper.py
This codebase is split logically into a few distinct phases to make it readable:
- Config & Logging: Loads
config.jsonand sets up the dual console+file logging. - Preflight Module: Grabs the Table of Contents up front to map what's needed vs what's already saved locally.
- Fetching Engine (
fetch_with_retry): Built to be resilient against site blocks, utilizing randomized delays. - Parsing Engine (
parse_chapter): Built withBeautifulSoup4, this isolates the website-specific HTML parsing logic. If you ever scrape a different website, this is the only main logic block you'd need to rewrite. - Main Loop: The conductor that moves through chapters recursively using the "Next Chapter" link.
Every time you run the script, a metrics_*.json file is saved inside the logs/ directory showcasing the total run time, number of chapters scraped, and average time spent per chapter.