Skip to content

Make delay() non-blocking so it composes with concurrency#507

Open
freekmurze wants to merge 1 commit into
mainfrom
fix/non-blocking-delay
Open

Make delay() non-blocking so it composes with concurrency#507
freekmurze wants to merge 1 commit into
mainfrom
fix/non-blocking-delay

Conversation

@freekmurze

@freekmurze freekmurze commented Jun 2, 2026

Copy link
Copy Markdown
Member

Fixes the concurrency bug demonstrated in #506.

The short version

When you combine concurrency() and delay(), the delay quietly cancels out the concurrency. This PR makes delay() non-blocking so the two compose the way you'd expect.

What is happening now

You configure the crawler like this:

Crawler::create($url)
    ->concurrency(20) // up to 20 requests at the same time
    ->delay(100)      // pause 100ms between requests
    ->start();

The whole crawl runs on one PHP thread inside Pool::promise()->wait(). The delay() is a usleep() called from the fulfilled/failed callbacks. usleep() is a hard freeze of that one thread.

So when one request finishes and hits its 100ms sleep, the freeze stops everything, including the network progress of the other 19 requests that were downloading in parallel. The delays can't overlap with the parallel work, they stack up one after another.

gantt
    title BEFORE - every delay freezes the one thread, so nothing overlaps (1 block = 100ms)
    dateFormat X
    axisFormat %S
    section One thread
    req 1 download :active, 0, 3
    delay (usleep) :crit, 3, 4
    req 2 download :active, 4, 7
    delay (usleep) :crit, 7, 8
    req 3 download :active, 8, 11
Loading

With 21 pages and a 100ms delay, that is roughly 21 * 100ms = 2.1s of pure frozen time added on top of the crawl. The more delay you add, the less concurrency() does.

Why it's wrong

delay() is meant to pace when requests start (be polite, avoid rate limits). It is not meant to block requests that are already in flight. Today it does the second thing, which means concurrency() and delay() are effectively mutually exclusive. You can't say "crawl in parallel, but gently".

How we do better

Stop sleeping inside the callbacks. Hand the delay to Guzzle as a per-request delay option instead. The async cURL handler honors that option inside its own event loop: it defers when a request starts, but it keeps pumping the other in-flight requests while it waits. No frozen thread.

Now delay only holds back its own request. Below, concurrency(2) keeps two requests in flight. When req 1 finishes and frees a slot, req 3 is pulled and waits its own 100ms delay, but req 2 keeps downloading the whole time (the delay never freezes it):

gantt
    title AFTER - each request waits its own delay; others keep downloading (1 block = 100ms)
    dateFormat X
    axisFormat %S
    section req 1
    delay        :crit, 0, 1
    download     :active, 1, 3
    section req 2
    delay        :crit, 0, 1
    download     :active, 1, 4
    section req 3 (next)
    waits delay  :crit, 3, 4
    download     :active, 4, 6
Loading

req 3's red delay block (300-400ms) sits directly over req 2's still-running download. The delay paces when each request starts; it no longer stalls work that is already happening, so a delay shorter than the network time is absorbed instead of added.

Throttles (FixedDelayThrottle / AdaptiveThrottle) keep the existing blocking path on purpose: an adaptive throttle needs the response time to compute its next delay, so it can only run after a request completes.

The change

flowchart LR
    A["delay() = usleep()<br/>in the response callback"] -->|freezes the<br/>single thread| B["pool stalls,<br/>concurrency lost"]
    C["delay() = Guzzle per-request<br/>'delay' request option"] -->|handled in the<br/>event loop| D["other requests<br/>keep flowing"]
Loading
  • Crawler::startCrawlingQueue() passes the delay as a per-request delay option on the pool (only when no throttle is set).
  • Crawler::applyDelay() now only sleeps for throttles.
  • FakeHandler honors the delay option too (mirrors Guzzle's real handlers), so tests that rely on delay timing keep working.

Tests

  • Added a regression test (does not let delay block concurrent in-flight requests, based on the one in Add failing test for delay blocking concurrency #506) that fails on the old code (~2.8s) and passes here (~1.2s).
  • The two timeLimit tests asserted an exact crawl count (4) that depended on the old "count, then sleep" ordering and on a 1-second-resolution clock. With non-blocking delay the request start is deferred before the URL is counted, so the exact number completed inside a fixed budget shifts by about one. They now assert the meaningful behavior (the limit is enforced, the crawl is bounded, and a per-execution limit eventually finishes) rather than a brittle exact number.
  • Full suite green (228 passed).

A configured delay() was applied as a blocking usleep() inside the
fulfilled/failed pool callbacks. Because the whole crawl is driven by a
single PHP thread inside Pool::promise()->wait(), that sleep froze the
event loop and stalled every other in-flight request, effectively
serializing the crawl and cancelling out concurrency().

Apply the delay through Guzzle's per-request `delay` option instead. The
async cURL handler honors it inside its event loop: the request start is
deferred while other in-flight requests keep progressing. Throttles keep
using the blocking applyDelay() path, since an adaptive throttle needs
the response time to compute its next delay.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant