Skip to content

jadia/novelbin_scrapper

Repository files navigation

NovelBin Chapter Scraper

A robust, minimal-complexity Python web scraper built to download chapters from novelbin.com and prepare them for a Retrieval-Augmented Generation (RAG) system.

Features

  • Cloudflare Bypass: Native bypassing using curl_cffi which mimics real browser fingerprints.
  • RAG-Ready Output: Saves cleanly formatted Markdown (.md) files ready to be indexed by LangChain or similar frameworks.
  • Auto-Resume: Performs a preflight check against the Table of Contents to efficiently resume interrupted scraping sessions without re-downloading files.
  • Polite Scraping: Randomly rotates User-Agents and implements jitter (random sleep) to stay under the radar.
  • Resilient: Automatically retries on network failures or temporary Cloudflare blocks using exponential backoff.
  • Rich Logging: Generates session metrics and outputs timestamped logs to the logs/ directory for auditing.

Prerequisites

  • Python 3.8+
  • The required dependencies inside requirements.txt

Quick Start

  1. Install Dependencies

    pip install -r requirements.txt
  2. Configure the Scraper Edit the config.json file to set your preferences.

    • novel_url: The main page for the novel (used for the auto-resume preflight check).
    • start_chapter_url: The URL to the first chapter.
    • output_dir: The local directory where markdown files will be saved.
    • max_chapters: Set to null to scrape everything, or a number (e.g. 3) for testing.
    • resume_scraping: Keep true to skip existing files.
  3. Run the Scraper

    python scraper.py

Architecture for Learning

This codebase is split logically into a few distinct phases to make it readable:

  1. Config & Logging: Loads config.json and sets up the dual console+file logging.
  2. Preflight Module: Grabs the Table of Contents up front to map what's needed vs what's already saved locally.
  3. Fetching Engine (fetch_with_retry): Built to be resilient against site blocks, utilizing randomized delays.
  4. Parsing Engine (parse_chapter): Built with BeautifulSoup4, this isolates the website-specific HTML parsing logic. If you ever scrape a different website, this is the only main logic block you'd need to rewrite.
  5. Main Loop: The conductor that moves through chapters recursively using the "Next Chapter" link.

Session Metrics

Every time you run the script, a metrics_*.json file is saved inside the logs/ directory showcasing the total run time, number of chapters scraped, and average time spent per chapter.

About

Python Script to scrape Novelbin.com novel

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages