NovelBin Chapter Scraper

A robust, minimal-complexity Python web scraper built to download chapters from novelbin.com and prepare them for a Retrieval-Augmented Generation (RAG) system.

Features

Cloudflare Bypass: Native bypassing using curl_cffi which mimics real browser fingerprints.
RAG-Ready Output: Saves cleanly formatted Markdown (.md) files ready to be indexed by LangChain or similar frameworks.
Auto-Resume: Performs a preflight check against the Table of Contents to efficiently resume interrupted scraping sessions without re-downloading files.
Polite Scraping: Randomly rotates User-Agents and implements jitter (random sleep) to stay under the radar.
Resilient: Automatically retries on network failures or temporary Cloudflare blocks using exponential backoff.
Rich Logging: Generates session metrics and outputs timestamped logs to the logs/ directory for auditing.

Prerequisites

Python 3.8+
The required dependencies inside requirements.txt

Quick Start

Install Dependencies
```
pip install -r requirements.txt
```
Configure the Scraper Edit the config.json file to set your preferences.
- novel_url: The main page for the novel (used for the auto-resume preflight check).
- start_chapter_url: The URL to the first chapter.
- output_dir: The local directory where markdown files will be saved.
- max_chapters: Set to null to scrape everything, or a number (e.g. 3) for testing.
- resume_scraping: Keep true to skip existing files.
Run the Scraper
```
python scraper.py
```

Architecture for Learning

This codebase is split logically into a few distinct phases to make it readable:

Config & Logging: Loads config.json and sets up the dual console+file logging.
Preflight Module: Grabs the Table of Contents up front to map what's needed vs what's already saved locally.
Fetching Engine (fetch_with_retry): Built to be resilient against site blocks, utilizing randomized delays.
Parsing Engine (parse_chapter): Built with BeautifulSoup4, this isolates the website-specific HTML parsing logic. If you ever scrape a different website, this is the only main logic block you'd need to rewrite.
Main Loop: The conductor that moves through chapters recursively using the "Next Chapter" link.

Session Metrics

Every time you run the script, a metrics_*.json file is saved inside the logs/ directory showcasing the total run time, number of chapters scraped, and average time spent per chapter.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/became-the-patron-of-villains		data/became-the-patron-of-villains
logs		logs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NovelBin Chapter Scraper

Features

Prerequisites

Quick Start

Architecture for Learning

Session Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NovelBin Chapter Scraper

Features

Prerequisites

Quick Start

Architecture for Learning

Session Metrics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages