A fully functional multi-threaded web crawler with:
- Persistent crawling state (resume support)
- SQLite-based frontier & visited tables
- Robots.txt compliance
- Per-domain politeness delay
- Thread-safe URL claiming
- Configurable thread count and crawl depth
- Multi-threaded crawling
- Automatic resume even after crash
- HTML parsing with BeautifulSoup
- URL normalization
- Retries + failure handling
- Clean modular architecture
- CLI usage
pip install -r requirements.txtpython -m crawler.main https://example.com 6 2- Seed URL
- Number of worker threads
- Max depth
crawler/
main.py
controller.py
worker.py
db.py
utils.py
The crawler stores:
- Frontier (pending, in-progress, done, failed)
- Visited pages
Simply restart the program; it continues crawling from where it left off.
Fetched HTML pages are saved under data/pages/ with filenames equal to their SHA-256 content hash (e.g. data/pages/<hash>.html).
The crawler stores page metadata in the pages table (url, content_path, content_hash, title, meta_description, fetched_at, status_code).
If two URLs produce identical content bytes the crawler detects the duplicate via content hash and does not store the payload twice — it links the new URL to the existing content file to save space.