Make delay() non-blocking so it composes with concurrency by freekmurze · Pull Request #507 · spatie/crawler

freekmurze · 2026-06-02T20:51:25Z

Fixes the concurrency bug demonstrated in #506.

The short version

When you combine concurrency() and delay(), the delay quietly cancels out the concurrency. This PR makes delay() non-blocking so the two compose the way you'd expect.

What is happening now

You configure the crawler like this:

Crawler::create($url)
    ->concurrency(20) // up to 20 requests at the same time
    ->delay(100)      // pause 100ms between requests
    ->start();

The whole crawl runs on one PHP thread inside Pool::promise()->wait(). The delay() is a usleep() called from the fulfilled/failed callbacks. usleep() is a hard freeze of that one thread.

So when one request finishes and hits its 100ms sleep, the freeze stops everything, including the network progress of the other 19 requests that were downloading in parallel. The delays can't overlap with the parallel work, they stack up one after another.

gantt
    title BEFORE - every delay freezes the one thread, so nothing overlaps (1 block = 100ms)
    dateFormat X
    axisFormat %S
    section One thread
    req 1 download :active, 0, 3
    delay (usleep) :crit, 3, 4
    req 2 download :active, 4, 7
    delay (usleep) :crit, 7, 8
    req 3 download :active, 8, 11

With 21 pages and a 100ms delay, that is roughly 21 * 100ms = 2.1s of pure frozen time added on top of the crawl. The more delay you add, the less concurrency() does.

Why it's wrong

delay() is meant to pace when requests start (be polite, avoid rate limits). It is not meant to block requests that are already in flight. Today it does the second thing, which means concurrency() and delay() are effectively mutually exclusive. You can't say "crawl in parallel, but gently".

How we do better

Stop sleeping inside the callbacks. Hand the delay to Guzzle as a per-request delay option instead. The async cURL handler honors that option inside its own event loop: it defers when a request starts, but it keeps pumping the other in-flight requests while it waits. No frozen thread.

Now delay only holds back its own request. Below, concurrency(2) keeps two requests in flight. When req 1 finishes and frees a slot, req 3 is pulled and waits its own 100ms delay, but req 2 keeps downloading the whole time (the delay never freezes it):

gantt
    title AFTER - each request waits its own delay; others keep downloading (1 block = 100ms)
    dateFormat X
    axisFormat %S
    section req 1
    delay        :crit, 0, 1
    download     :active, 1, 3
    section req 2
    delay        :crit, 0, 1
    download     :active, 1, 4
    section req 3 (next)
    waits delay  :crit, 3, 4
    download     :active, 4, 6

req 3's red delay block (300-400ms) sits directly over req 2's still-running download. The delay paces when each request starts; it no longer stalls work that is already happening, so a delay shorter than the network time is absorbed instead of added.

Throttles (FixedDelayThrottle / AdaptiveThrottle) keep the existing blocking path on purpose: an adaptive throttle needs the response time to compute its next delay, so it can only run after a request completes.

The change

flowchart LR
    A["delay() = usleep()<br/>in the response callback"] -->|freezes the<br/>single thread| B["pool stalls,<br/>concurrency lost"]
    C["delay() = Guzzle per-request<br/>'delay' request option"] -->|handled in the<br/>event loop| D["other requests<br/>keep flowing"]

Crawler::startCrawlingQueue() passes the delay as a per-request delay option on the pool (only when no throttle is set).
Crawler::applyDelay() now only sleeps for throttles.
FakeHandler honors the delay option too (mirrors Guzzle's real handlers), so tests that rely on delay timing keep working.

Tests

Added a regression test (does not let delay block concurrent in-flight requests, based on the one in Add failing test for delay blocking concurrency #506) that fails on the old code (~2.8s) and passes here (~1.2s).
The two timeLimit tests asserted an exact crawl count (4) that depended on the old "count, then sleep" ordering and on a 1-second-resolution clock. With non-blocking delay the request start is deferred before the URL is counted, so the exact number completed inside a fixed budget shifts by about one. They now assert the meaningful behavior (the limit is enforced, the crawl is bounded, and a per-execution limit eventually finishes) rather than a brittle exact number.
Full suite green (228 passed).

A configured delay() was applied as a blocking usleep() inside the fulfilled/failed pool callbacks. Because the whole crawl is driven by a single PHP thread inside Pool::promise()->wait(), that sleep froze the event loop and stalled every other in-flight request, effectively serializing the crawl and cancelling out concurrency(). Apply the delay through Guzzle's per-request `delay` option instead. The async cURL handler honors it inside its event loop: the request start is deferred while other in-flight requests keep progressing. Throttles keep using the blocking applyDelay() path, since an adaptive throttle needs the response time to compute its next delay. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make delay() non-blocking so it composes with concurrency#507

Make delay() non-blocking so it composes with concurrency#507
freekmurze wants to merge 1 commit into
mainfrom
fix/non-blocking-delay

freekmurze commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

freekmurze commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The short version

What is happening now

Why it's wrong

How we do better

The change

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

freekmurze commented Jun 2, 2026 •

edited

Loading