Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,29 @@ A full port of the legacy Python 2 framework to a modern Python 3 (3.9+) stack.
`retry_statuses`).
- **Visited-url de-duplication** (`unique_urls`) preventing redundant fetches
and crawl loops.
- **Callback-driven `Spider`** (`crawley.spider`): `Request` with
callback / `meta` / `cb_kwargs` / `errback`, `response.follow()`, `Item`, and
fingerprint-based de-duplication — for list→detail crawls.
- **Item pipelines** (`crawley.pipelines`): `ItemPipeline` + `DropItem`.
- **`CrawlSpider`** with `Rule` / `LinkExtractor` (allow/deny/restrict) and
**`SitemapSpider`** (`crawley.spiders`).
- **JavaScript rendering** via Playwright (`render_js = True`, extra
`crawley[js]`).
- **Stats collector** (`crawley.stats.StatsCollector`): per-crawl counters
(requests/responses/status/errors/items/elapsed), logged on finish.
- **On-disk HTTP cache** (`http_cache = True`) for development.
- **`FormRequest.from_response`** to pre-fill and submit forms.
- **Downloader middlewares** (`crawley.middlewares.DownloaderMiddleware`):
`process_request` / `process_response` / `process_exception` chains on the
`Spider` (sync or async).
- **AutoThrottle** (`autothrottle = True`): adapt the per-host delay to the
observed response latency.
- Documentation site (MkDocs Material + mkdocstrings) and a set of runnable,
test-covered `examples/`.
- **Type hints** on the public modules and a PEP 561 `py.typed` marker so
downstream code gets type information; `mypy` runs in CI.
- A `LICENSE` file (GPL-3.0).
- A hermetic `pytest` suite (~170 tests, ~92% core coverage).
- A hermetic `pytest` suite (~185 tests, ~90% core coverage).

### Changed
- Concurrency moved from `eventlet` green pools to **asyncio** (`AsyncPool`);
Expand Down
33 changes: 33 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,39 @@ The same shortcuts (`response.css`, `response.css_first`, `response.extract`,

---

## Spiders (callbacks, items, rules, JS)

For full crawls there's a Scrapy-style `Spider`: yield `Request`s (or
`response.follow(...)`) to navigate and dicts/`Item`s to emit data, with item
pipelines, rule-based crawling and optional JavaScript rendering. See
[`docs/spiders.md`](docs/spiders.md).

```python
from crawley.spider import Spider

class BlogSpider(Spider):
start_urls = ["https://example.com/blog/"]

def parse(self, response): # default callback
for href in response.css("a.post::attr(href)"):
yield response.follow(href, callback=self.parse_post)
nxt = response.css_first("a.next::attr(href)")
if nxt:
yield response.follow(nxt) # follows pagination

def parse_post(self, response):
yield {"title": response.css_first("h1").text, "url": response.url}

BlogSpider().run()
```

- **Item pipelines**: `crawley.pipelines.ItemPipeline` + `DropItem`.
- **Rule-based**: `CrawlSpider` + `Rule(LinkExtractor(allow=..., deny=...))`.
- **Sitemaps**: `SitemapSpider(sitemap_urls=[...])`.
- **JavaScript**: `render_js = True` (install `crawley[js]` + `playwright install`).

---

## Quick start (as a framework / CLI)

### 1. Start a new project
Expand Down
37 changes: 37 additions & 0 deletions crawley/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,19 @@
"parse",
"Document",
"Element",
"Spider",
"Request",
"FormRequest",
"Item",
"DropItem",
"ItemPipeline",
"LinkExtractor",
"Rule",
"CrawlSpider",
"SitemapSpider",
"StatsCollector",
"DownloaderMiddleware",
"AutoThrottle",
"__version__",
]

Expand Down Expand Up @@ -66,6 +79,30 @@ def __getattr__(name):
from crawley.http.response import Response

return Response
if name in ("Spider", "Request", "FormRequest", "Item"):
from crawley import spider

return getattr(spider, name)
if name in ("DropItem", "ItemPipeline"):
from crawley import pipelines

return getattr(pipelines, name)
if name in ("LinkExtractor", "Rule", "CrawlSpider", "SitemapSpider"):
from crawley import spiders

return getattr(spiders, name)
if name == "StatsCollector":
from crawley.stats import StatsCollector

return StatsCollector
if name == "DownloaderMiddleware":
from crawley.middlewares import DownloaderMiddleware

return DownloaderMiddleware
if name == "AutoThrottle":
from crawley.http.autothrottle import AutoThrottle

return AutoThrottle
if name in ("fetch", "afetch", "afetch_all", "scrape", "parse", "Document", "Element"):
from crawley import scraping

Expand Down
82 changes: 82 additions & 0 deletions crawley/crawlers/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,12 +104,40 @@ class BaseCrawler(metaclass=CrawlerMeta):
unique_urls = True
"""Skip urls that have already been visited during the crawl."""

render_js = False
"""Render pages with a headless browser (Playwright). Needs ``crawley[js]``."""

playwright_options = {}
"""Extra options for the Playwright manager (browser_type, headless, ...)."""

http_cache = False
"""Cache responses on disk (development helper)."""

http_cache_dir = ".crawley_cache"
"""Directory used by the on-disk HTTP cache."""

autothrottle = False
"""Adapt the per-host delay to the observed response latency."""

autothrottle_target_concurrency = 1.0
"""Target number of concurrent requests per host for AutoThrottle."""

autothrottle_start_delay = 1.0
"""Initial per-host delay used by AutoThrottle (seconds)."""

autothrottle_max_delay = 60.0
"""Maximum per-host delay AutoThrottle may set (seconds)."""

def __init__(self, sessions=None, settings=None):
self.sessions = sessions if sessions is not None else []
self.debug = getattr(settings, "SHOW_DEBUG_INFO", True)
self.settings = settings
self._seen = set()

from crawley.stats import StatsCollector

self.stats = StatsCollector()

extractor_class = self.extractor or XPathExtractor
self.extractor = extractor_class()

Expand All @@ -132,18 +160,61 @@ def __init__(self, sessions=None, settings=None):
enabled=self.respect_robots,
)

self._autothrottle = None
if self.autothrottle:
from crawley.http.autothrottle import AutoThrottle

self._autothrottle = AutoThrottle(
target_concurrency=self.autothrottle_target_concurrency,
start_delay=self.autothrottle_start_delay,
max_delay=self.autothrottle_max_delay,
)
self.rate_limiter.delay = self.autothrottle_start_delay

self.request_manager = self._make_request_manager()

self._initialize_scrapers()

def _record_latency(self, url, response):
"""Feed the response latency to AutoThrottle (if enabled)."""
if self._autothrottle is None:
return
latency = getattr(response, "latency", None)
if latency is None:
return
host = urllib.parse.urlparse(url).netloc
self.rate_limiter.set_delay(host, self._autothrottle.adjust(host, latency))

def _make_cache(self):
if not self.http_cache:
return None
from crawley.http.cache import HttpCache

return HttpCache(self.http_cache_dir, enabled=True)

def _make_request_manager(self):
cache = self._make_cache()
if self.render_js:
from crawley.http.playwright import PlaywrightRequestManager

return PlaywrightRequestManager(
settings=self.settings,
headers=self.headers,
delay=self.requests_delay,
deviation=self.requests_deviation,
retry_policy=self.retry_policy,
rate_limiter=self.rate_limiter,
cache=cache,
**self.playwright_options,
)
return RequestManager(
settings=self.settings,
headers=self.headers,
delay=self.requests_delay,
deviation=self.requests_deviation,
retry_policy=self.retry_policy,
rate_limiter=self.rate_limiter,
cache=cache,
)

def _initialize_scrapers(self):
Expand Down Expand Up @@ -210,18 +281,25 @@ async def _fetch(self, url, depth_level=0):
self._seen.add(url)

if self.respect_robots and not await self._robots_allowed(url):
self.stats.inc("robots_blocked")
self.on_robots_blocked(url)
return

if self.debug:
log.info("crawling -> %s", url)

self.stats.inc("requests")
try:
response = await self._get_response(url)
except Exception as ex: # noqa: BLE001 - delegated to error handler
self.stats.inc("request_errors")
self.on_request_error(url, ex)
return

self.stats.inc("responses")
self.stats.inc("status/%s" % response.status_code)
self._record_latency(url, response)

urls = self._manage_scrapers(response)

if not urls:
Expand Down Expand Up @@ -258,6 +336,7 @@ async def start(self):
"""Run the crawler (coroutine)."""
self.pool = AsyncPool(self.max_concurrency_level)
self._seen = set()
self.stats.open()

self.on_start()
await self._login()
Expand All @@ -270,6 +349,9 @@ async def start(self):
finally:
await self.request_manager.aclose()

self.stats.close()
if self.debug:
log.info("stats: %s", self.stats.get_stats())
self.on_finish()

def run(self):
Expand Down
1 change: 1 addition & 0 deletions crawley/crawlers/fast.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ def _make_request_manager(self):
headers=self.headers,
retry_policy=self.retry_policy,
rate_limiter=self.rate_limiter,
cache=self._make_cache(),
)
10 changes: 7 additions & 3 deletions crawley/extractors.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from __future__ import annotations

from io import StringIO
from io import BytesIO
from typing import Any

from lxml import etree
Expand All @@ -33,7 +33,9 @@ class XPathExtractor(BaseExtractor):

def get_object(self, data: str) -> Any:
parser = etree.HTMLParser()
return etree.parse(StringIO(data), parser)
# Parse from bytes so documents carrying an XML encoding declaration
# (e.g. sitemaps) don't raise "Unicode strings with encoding declaration".
return etree.parse(BytesIO(data.encode("utf-8")), parser)


class CSSExtractor(BaseExtractor):
Expand All @@ -45,7 +47,9 @@ class CSSExtractor(BaseExtractor):

def get_object(self, data: str) -> Any:
parser = etree.HTMLParser()
return etree.parse(StringIO(data), parser)
# Parse from bytes so documents carrying an XML encoding declaration
# (e.g. sitemaps) don't raise "Unicode strings with encoding declaration".
return etree.parse(BytesIO(data.encode("utf-8")), parser)


class RawExtractor(BaseExtractor):
Expand Down
40 changes: 40 additions & 0 deletions crawley/http/autothrottle.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
"""Adaptive per-host throttling.

Adjusts the delay applied to each host based on the observed response latency,
aiming to keep roughly ``target_concurrency`` requests in flight per host (the
same idea as Scrapy's AutoThrottle).
"""

from __future__ import annotations

from typing import Optional


class AutoThrottle:
"""Compute a per-host delay from observed latencies."""

def __init__(
self,
target_concurrency: float = 1.0,
start_delay: float = 1.0,
max_delay: float = 60.0,
enabled: bool = True,
) -> None:
self.target_concurrency = max(target_concurrency, 0.01)
self.start_delay = start_delay
self.max_delay = max_delay
self.enabled = enabled
self._delays: dict[str, float] = {}

def adjust(self, host: str, latency: Optional[float]) -> float:
"""Update and return the new delay for *host* given a *latency*."""
previous = self._delays.get(host, self.start_delay)
if latency is None:
return previous

target = latency / self.target_concurrency
# Smooth towards the target and clamp to [0, max_delay].
new_delay = (previous + target) / 2
new_delay = min(max(new_delay, 0.0), self.max_delay)
self._delays[host] = new_delay
return new_delay
Loading
Loading