Modernize crawley to Python 3 (asyncio + httpx) by jmg · Pull Request #20 · jmg/crawley

jmg · 2026-06-05T17:33:54Z

Full port of the legacy Python 2 framework to a modern Python 3 (3.9+) crawling / scraping library, keeping the declarative crawler/scraper API.

Highlights

Core

Concurrency moved from eventlet green pools to asyncio (AsyncPool); crawlers are coroutines with a synchronous run() wrapper.
HTTP layer rebuilt on httpx (AsyncClient): cookies, proxies, timeouts.
Extractors: XPath, PyQuery and a new CSSExtractor (cssselect).

Modern scraping API (crawley.scraping)

Ergonomic fetch / afetch / afetch_all / scrape / parse and Document / Element with CSS selectors (::text, ::attr(name)), XPath, links() and declarative extract().
Same shortcuts (css, css_first, extract, doc) available on the crawler response.

Politeness & robustness

Opt-in robots.txt (respect_robots, on_robots_blocked, Crawl-delay).
Per-host rate limiting (crawl_delay, max_concurrency_per_host).
Retries with exponential backoff + jitter and Retry-After (max_retries, retry_backoff, retry_statuses).
Visited-url de-duplication (unique_urls) preventing redundant fetches and crawl loops.

Persistence

Relational layer migrated from elixir + SQLAlchemy 0.7.8 to SQLAlchemy 2.x; MongoDB on current pymongo; CouchDB via a small httpx client; JSON / XML / CSV document exports.

Framework

CLI (startproject / run / syncdb / migratedb / shell / browser), project types and the DSL parser/compiler ported to Python 3.
GUI ported from PyQt4/QtWebKit to PySide6/QtWebEngine (optional extra).
SMTP sender rewritten with email.message.EmailMessage.

Packaging & tooling

pyproject.toml (PEP 621) with optional extras (sql, mongo, gui, shell, http2, dev, docs) and a crawley console script.
Type hints on the public modules + PEP 561 py.typed; mypy in CI.
Removed setup.py / setup.cfg / requirements.txt / .travis.yml, the dead eventlet / elixir deps and the old Sphinx doc/ tree.
GitHub Actions CI (Python 3.9–3.12: ruff + mypy + pytest) and a docs deploy workflow.
Added a LICENSE file (GPL-3.0).

Docs, examples & tests

MkDocs Material documentation site (docs/), published via GitHub Pages.
Runnable, test-covered examples in examples/.
Hermetic pytest suite (~170 tests, ~92% core coverage) running against a local server — no network access required.

See CHANGELOG.md for the full list.

https://claude.ai/code/session_0159tWs4APGGyjhKYQGYiFWY

Generated by Claude Code

Full port of the legacy Python 2 framework to a modern Python 3 (3.9+) crawling/scraping library, keeping the declarative crawler/scraper API. Core - Replace eventlet green pools with an asyncio-based concurrency model (AsyncPool bounded by a semaphore); crawlers are now coroutines with a synchronous `run()` convenience wrapper. - Rewrite the HTTP layer on top of httpx.AsyncClient: cookies, proxies, timeouts, retries and randomized request delays. - Extractors: XPath and PyQuery ported, new CSSExtractor (cssselect). - Logging instead of print; type-friendly, lazy public API in crawley/__init__. Persistence - Relational layer rewritten on SQLAlchemy 2.x (declarative Entity/UrlEntity, Field/Unicode shims, connectors for sqlite/postgres/mysql/oracle), replacing the dead elixir + SQLAlchemy 0.7.8 stack. - Document storages (JSON/CSV/XML) modernized; Mongo on current pymongo and CouchDB via a small httpx client (the couchdb lib is unmaintained on py3). Framework - CLI manager (startproject/run/syncdb/migratedb/shell/browser) and project types ported; DSL parser/compiler ported to py3 + SQLAlchemy. - GUI scraping browser ported from PyQt4/QtWebKit to PySide6/QtWebEngine (optional extra, lazily imported). - smtp sender rewritten with email.message.EmailMessage. Packaging & tooling - pyproject.toml (PEP 621) with optional extras [sql], [mongo], [gui], [shell], [http2], [dev]; console_scripts entry point. - Remove setup.py/setup.cfg/requirements.txt/.travis.yml and other legacy build scripts. - GitHub Actions CI (py 3.9-3.12, ruff + pytest) replacing Travis. Tests - New hermetic pytest suite (31 tests) running against a local HTTP server: extractors, matchers, url finder, http layer, async crawl flow, scrapers, persistence and CLI integration. No network access required.

Modern scraping API (crawley.scraping) - New high-level, ergonomic scraping interface (parsel / requests-html flavoured) on top of httpx + lxml: Document / Element with css(), css_first(), xpath(), links(), title and declarative extract(). - CSS selectors accept ::text and ::attr(name) suffixes; links are resolved to absolute urls. - Network helpers: fetch() / afetch() / afetch_all() (concurrent) and a one-call scrape(url, rules); plus parse() for raw html. - Response gains .doc plus .css / .css_first / .extract shortcuts so the same API is available inside a crawler's scrape(). - Exposed from the top-level crawley package. Tests (31 -> 133, ~92% core coverage) - Hermetic pytest suite against a local HTTP server, covering: the scraping API, extractors, url matchers/finder, async HTTP layer, cookies, the AsyncPool, crawler flow + events/login/error handling, Fast/OffLine/Smart crawlers and scrapers, relational + document + NoSQL persistence (with fake drivers), the DSL parser/compiler, the INI config parser, the CLI commands (startproject/syncdb/migratedb/run), utils and the SMTP sender. Tooling - coverage config (omitting the headless-only PySide6 GUI), pytest-cov in the dev extra, CI runs with coverage reporting. - README documents the modern scraping API; new examples/modern_scraping.py.

Politeness & robustness - RetryPolicy (crawley.http.retry): retries on network errors and retryable HTTP statuses (429/5xx) with exponential backoff + jitter, honouring the Retry-After header; configurable via crawler attrs max_retries / retry_backoff / retry_statuses. - HostRateLimiter (crawley.http.throttle): per-host minimum delay between requests and optional per-host concurrency cap (crawl_delay / max_concurrency_per_host). - RobotsPolicy (crawley.http.robots): fetches and caches robots.txt per host, skips disallowed urls (new on_robots_blocked event) and applies the Crawl-delay directive. Opt-in via respect_robots. - Wired into RequestManager (retry + throttle) and BaseCrawler; FastCrawler reuses the same policies. Documentation (MkDocs Material + mkdocstrings) - mkdocs.yml and docs/: home, installation, scraping API, crawlers & scrapers, politeness, persistence, CLI & framework, and an auto-generated API reference. Builds clean with `mkdocs build --strict`. - New docs extra; docs deploy workflow; README updated (politeness section, docs pointer); .gitignore refreshed (site/, caches, coverage). Tests (133 -> 154) - test_retry, test_throttle, test_robots covering the new policies and their integration; conftest server now serves robots.txt, a flaky endpoint and a 503-with-Retry-After endpoint. Core coverage ~92%.

Examples - New examples/ scripts that double as documentation, each parametrized by a base_url (real site by default) so they run standalone and under tests: 01 scraping API, 02 crawler with pagination, 03 polite crawler (robots/rate-limit/retries), 04 persistence to JSON, 05 concurrent fetch. - examples/README.md indexes them; README and docs link to them. Tests (154 -> 166) - tests/test_examples.py loads every example by path and runs it against a new hermetic `quotes_server` fixture (a paginated quotes site), asserting the scraped results; plus an import smoke test for each. Housekeeping - Remove the legacy Sphinx `doc/` tree (superseded by the MkDocs `docs/`). - ruff: exclude the legacy sample projects under examples/. `ruff check .`, `pytest` and `mkdocs build --strict` all pass.

Crawler - Visited-url de-duplication (unique_urls, on by default): skips already-seen urls, avoiding redundant fetches and preventing crawl loops even with unbounded max_depth. The seen-set resets on each start(). Library hygiene - Attach a NullHandler to the "crawley" logger so importing the package never emits "No handlers could be found" warnings; applications stay in control of logging configuration. Docs & meta - Document unique_urls in the crawler reference. - Add CHANGELOG.md summarizing the 0.3.0 modernization. Tests (166 -> 169) - test_dedupe: de-dup on/off and a cycle (loop-a <-> loop-b) that only terminates thanks to de-duplication; new loop endpoints in the test server.

Typing - Type hints on the public-facing modules (scraping, extractors, http response/retry/throttle/robots, toolbox) using PEP 563 deferred annotations. - PEP 561 `py.typed` marker shipped in the wheel so downstream code gets the types. - mypy config ([tool.mypy], follow_imports=silent over the typed modules) and a `mypy` step in CI; passes clean. Project - Add the GPL-3.0 LICENSE text. - README points to the published docs site; CHANGELOG notes the typing/LICENSE additions.

claude added 6 commits June 4, 2026 02:15

jmg merged commit 706eecc into master Jun 5, 2026
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize crawley to Python 3 (asyncio + httpx)#20

Modernize crawley to Python 3 (asyncio + httpx)#20
jmg merged 6 commits into
masterfrom
claude/web-crawler-python3-modernize-N21pe

jmg commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmg commented Jun 5, 2026

Highlights

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants