Skip to content

Modernize crawley to Python 3 (asyncio + httpx)#20

Merged
jmg merged 6 commits into
masterfrom
claude/web-crawler-python3-modernize-N21pe
Jun 5, 2026
Merged

Modernize crawley to Python 3 (asyncio + httpx)#20
jmg merged 6 commits into
masterfrom
claude/web-crawler-python3-modernize-N21pe

Conversation

@jmg

@jmg jmg commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Full port of the legacy Python 2 framework to a modern Python 3 (3.9+) crawling / scraping library, keeping the declarative crawler/scraper API.

Highlights

Core

  • Concurrency moved from eventlet green pools to asyncio (AsyncPool); crawlers are coroutines with a synchronous run() wrapper.
  • HTTP layer rebuilt on httpx (AsyncClient): cookies, proxies, timeouts.
  • Extractors: XPath, PyQuery and a new CSSExtractor (cssselect).

Modern scraping API (crawley.scraping)

  • Ergonomic fetch / afetch / afetch_all / scrape / parse and Document / Element with CSS selectors (::text, ::attr(name)), XPath, links() and declarative extract().
  • Same shortcuts (css, css_first, extract, doc) available on the crawler response.

Politeness & robustness

  • Opt-in robots.txt (respect_robots, on_robots_blocked, Crawl-delay).
  • Per-host rate limiting (crawl_delay, max_concurrency_per_host).
  • Retries with exponential backoff + jitter and Retry-After (max_retries, retry_backoff, retry_statuses).
  • Visited-url de-duplication (unique_urls) preventing redundant fetches and crawl loops.

Persistence

  • Relational layer migrated from elixir + SQLAlchemy 0.7.8 to SQLAlchemy 2.x; MongoDB on current pymongo; CouchDB via a small httpx client; JSON / XML / CSV document exports.

Framework

  • CLI (startproject / run / syncdb / migratedb / shell / browser), project types and the DSL parser/compiler ported to Python 3.
  • GUI ported from PyQt4/QtWebKit to PySide6/QtWebEngine (optional extra).
  • SMTP sender rewritten with email.message.EmailMessage.

Packaging & tooling

  • pyproject.toml (PEP 621) with optional extras (sql, mongo, gui, shell, http2, dev, docs) and a crawley console script.
  • Type hints on the public modules + PEP 561 py.typed; mypy in CI.
  • Removed setup.py / setup.cfg / requirements.txt / .travis.yml, the dead eventlet / elixir deps and the old Sphinx doc/ tree.
  • GitHub Actions CI (Python 3.9–3.12: ruff + mypy + pytest) and a docs deploy workflow.
  • Added a LICENSE file (GPL-3.0).

Docs, examples & tests

  • MkDocs Material documentation site (docs/), published via GitHub Pages.
  • Runnable, test-covered examples in examples/.
  • Hermetic pytest suite (~170 tests, ~92% core coverage) running against a local server — no network access required.

See CHANGELOG.md for the full list.

https://claude.ai/code/session_0159tWs4APGGyjhKYQGYiFWY


Generated by Claude Code

claude added 6 commits June 4, 2026 02:15
Full port of the legacy Python 2 framework to a modern Python 3 (3.9+)
crawling/scraping library, keeping the declarative crawler/scraper API.

Core
- Replace eventlet green pools with an asyncio-based concurrency model
  (AsyncPool bounded by a semaphore); crawlers are now coroutines with a
  synchronous `run()` convenience wrapper.
- Rewrite the HTTP layer on top of httpx.AsyncClient: cookies, proxies,
  timeouts, retries and randomized request delays.
- Extractors: XPath and PyQuery ported, new CSSExtractor (cssselect).
- Logging instead of print; type-friendly, lazy public API in crawley/__init__.

Persistence
- Relational layer rewritten on SQLAlchemy 2.x (declarative Entity/UrlEntity,
  Field/Unicode shims, connectors for sqlite/postgres/mysql/oracle), replacing
  the dead elixir + SQLAlchemy 0.7.8 stack.
- Document storages (JSON/CSV/XML) modernized; Mongo on current pymongo and
  CouchDB via a small httpx client (the couchdb lib is unmaintained on py3).

Framework
- CLI manager (startproject/run/syncdb/migratedb/shell/browser) and project
  types ported; DSL parser/compiler ported to py3 + SQLAlchemy.
- GUI scraping browser ported from PyQt4/QtWebKit to PySide6/QtWebEngine
  (optional extra, lazily imported).
- smtp sender rewritten with email.message.EmailMessage.

Packaging & tooling
- pyproject.toml (PEP 621) with optional extras [sql], [mongo], [gui],
  [shell], [http2], [dev]; console_scripts entry point.
- Remove setup.py/setup.cfg/requirements.txt/.travis.yml and other legacy
  build scripts.
- GitHub Actions CI (py 3.9-3.12, ruff + pytest) replacing Travis.

Tests
- New hermetic pytest suite (31 tests) running against a local HTTP server:
  extractors, matchers, url finder, http layer, async crawl flow, scrapers,
  persistence and CLI integration. No network access required.
Modern scraping API (crawley.scraping)
- New high-level, ergonomic scraping interface (parsel / requests-html
  flavoured) on top of httpx + lxml: Document / Element with css(),
  css_first(), xpath(), links(), title and declarative extract().
- CSS selectors accept ::text and ::attr(name) suffixes; links are
  resolved to absolute urls.
- Network helpers: fetch() / afetch() / afetch_all() (concurrent) and a
  one-call scrape(url, rules); plus parse() for raw html.
- Response gains .doc plus .css / .css_first / .extract shortcuts so the
  same API is available inside a crawler's scrape().
- Exposed from the top-level crawley package.

Tests (31 -> 133, ~92% core coverage)
- Hermetic pytest suite against a local HTTP server, covering: the scraping
  API, extractors, url matchers/finder, async HTTP layer, cookies, the
  AsyncPool, crawler flow + events/login/error handling, Fast/OffLine/Smart
  crawlers and scrapers, relational + document + NoSQL persistence (with fake
  drivers), the DSL parser/compiler, the INI config parser, the CLI commands
  (startproject/syncdb/migratedb/run), utils and the SMTP sender.

Tooling
- coverage config (omitting the headless-only PySide6 GUI), pytest-cov in the
  dev extra, CI runs with coverage reporting.
- README documents the modern scraping API; new examples/modern_scraping.py.
Politeness & robustness
- RetryPolicy (crawley.http.retry): retries on network errors and retryable
  HTTP statuses (429/5xx) with exponential backoff + jitter, honouring the
  Retry-After header; configurable via crawler attrs max_retries /
  retry_backoff / retry_statuses.
- HostRateLimiter (crawley.http.throttle): per-host minimum delay between
  requests and optional per-host concurrency cap (crawl_delay /
  max_concurrency_per_host).
- RobotsPolicy (crawley.http.robots): fetches and caches robots.txt per host,
  skips disallowed urls (new on_robots_blocked event) and applies the
  Crawl-delay directive. Opt-in via respect_robots.
- Wired into RequestManager (retry + throttle) and BaseCrawler; FastCrawler
  reuses the same policies.

Documentation (MkDocs Material + mkdocstrings)
- mkdocs.yml and docs/: home, installation, scraping API, crawlers & scrapers,
  politeness, persistence, CLI & framework, and an auto-generated API
  reference. Builds clean with `mkdocs build --strict`.
- New docs extra; docs deploy workflow; README updated (politeness section,
  docs pointer); .gitignore refreshed (site/, caches, coverage).

Tests (133 -> 154)
- test_retry, test_throttle, test_robots covering the new policies and their
  integration; conftest server now serves robots.txt, a flaky endpoint and a
  503-with-Retry-After endpoint. Core coverage ~92%.
Examples
- New examples/ scripts that double as documentation, each parametrized by a
  base_url (real site by default) so they run standalone and under tests:
  01 scraping API, 02 crawler with pagination, 03 polite crawler
  (robots/rate-limit/retries), 04 persistence to JSON, 05 concurrent fetch.
- examples/README.md indexes them; README and docs link to them.

Tests (154 -> 166)
- tests/test_examples.py loads every example by path and runs it against a new
  hermetic `quotes_server` fixture (a paginated quotes site), asserting the
  scraped results; plus an import smoke test for each.

Housekeeping
- Remove the legacy Sphinx `doc/` tree (superseded by the MkDocs `docs/`).
- ruff: exclude the legacy sample projects under examples/. `ruff check .`,
  `pytest` and `mkdocs build --strict` all pass.
Crawler
- Visited-url de-duplication (unique_urls, on by default): skips already-seen
  urls, avoiding redundant fetches and preventing crawl loops even with
  unbounded max_depth. The seen-set resets on each start().

Library hygiene
- Attach a NullHandler to the "crawley" logger so importing the package never
  emits "No handlers could be found" warnings; applications stay in control of
  logging configuration.

Docs & meta
- Document unique_urls in the crawler reference.
- Add CHANGELOG.md summarizing the 0.3.0 modernization.

Tests (166 -> 169)
- test_dedupe: de-dup on/off and a cycle (loop-a <-> loop-b) that only
  terminates thanks to de-duplication; new loop endpoints in the test server.
Typing
- Type hints on the public-facing modules (scraping, extractors, http
  response/retry/throttle/robots, toolbox) using PEP 563 deferred annotations.
- PEP 561 `py.typed` marker shipped in the wheel so downstream code gets the
  types.
- mypy config ([tool.mypy], follow_imports=silent over the typed modules) and a
  `mypy` step in CI; passes clean.

Project
- Add the GPL-3.0 LICENSE text.
- README points to the published docs site; CHANGELOG notes the typing/LICENSE
  additions.
@jmg jmg merged commit 706eecc into master Jun 5, 2026
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants