Make delay() non-blocking so it composes with concurrency#507
Open
freekmurze wants to merge 1 commit into
Open
Make delay() non-blocking so it composes with concurrency#507freekmurze wants to merge 1 commit into
freekmurze wants to merge 1 commit into
Conversation
A configured delay() was applied as a blocking usleep() inside the fulfilled/failed pool callbacks. Because the whole crawl is driven by a single PHP thread inside Pool::promise()->wait(), that sleep froze the event loop and stalled every other in-flight request, effectively serializing the crawl and cancelling out concurrency(). Apply the delay through Guzzle's per-request `delay` option instead. The async cURL handler honors it inside its event loop: the request start is deferred while other in-flight requests keep progressing. Throttles keep using the blocking applyDelay() path, since an adaptive throttle needs the response time to compute its next delay. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the concurrency bug demonstrated in #506.
The short version
When you combine
concurrency()anddelay(), the delay quietly cancels out the concurrency. This PR makesdelay()non-blocking so the two compose the way you'd expect.What is happening now
You configure the crawler like this:
The whole crawl runs on one PHP thread inside
Pool::promise()->wait(). Thedelay()is ausleep()called from the fulfilled/failed callbacks.usleep()is a hard freeze of that one thread.So when one request finishes and hits its 100ms sleep, the freeze stops everything, including the network progress of the other 19 requests that were downloading in parallel. The delays can't overlap with the parallel work, they stack up one after another.
gantt title BEFORE - every delay freezes the one thread, so nothing overlaps (1 block = 100ms) dateFormat X axisFormat %S section One thread req 1 download :active, 0, 3 delay (usleep) :crit, 3, 4 req 2 download :active, 4, 7 delay (usleep) :crit, 7, 8 req 3 download :active, 8, 11With 21 pages and a 100ms delay, that is roughly
21 * 100ms = 2.1sof pure frozen time added on top of the crawl. The more delay you add, the lessconcurrency()does.Why it's wrong
delay()is meant to pace when requests start (be polite, avoid rate limits). It is not meant to block requests that are already in flight. Today it does the second thing, which meansconcurrency()anddelay()are effectively mutually exclusive. You can't say "crawl in parallel, but gently".How we do better
Stop sleeping inside the callbacks. Hand the delay to Guzzle as a per-request
delayoption instead. The async cURL handler honors that option inside its own event loop: it defers when a request starts, but it keeps pumping the other in-flight requests while it waits. No frozen thread.Now
delayonly holds back its own request. Below,concurrency(2)keeps two requests in flight. When req 1 finishes and frees a slot, req 3 is pulled and waits its own 100ms delay, but req 2 keeps downloading the whole time (the delay never freezes it):gantt title AFTER - each request waits its own delay; others keep downloading (1 block = 100ms) dateFormat X axisFormat %S section req 1 delay :crit, 0, 1 download :active, 1, 3 section req 2 delay :crit, 0, 1 download :active, 1, 4 section req 3 (next) waits delay :crit, 3, 4 download :active, 4, 6req 3's red
delayblock (300-400ms) sits directly over req 2's still-running download. The delay paces when each request starts; it no longer stalls work that is already happening, so a delay shorter than the network time is absorbed instead of added.Throttles (
FixedDelayThrottle/AdaptiveThrottle) keep the existing blocking path on purpose: an adaptive throttle needs the response time to compute its next delay, so it can only run after a request completes.The change
flowchart LR A["delay() = usleep()<br/>in the response callback"] -->|freezes the<br/>single thread| B["pool stalls,<br/>concurrency lost"] C["delay() = Guzzle per-request<br/>'delay' request option"] -->|handled in the<br/>event loop| D["other requests<br/>keep flowing"]Crawler::startCrawlingQueue()passes the delay as a per-requestdelayoption on the pool (only when no throttle is set).Crawler::applyDelay()now only sleeps for throttles.FakeHandlerhonors thedelayoption too (mirrors Guzzle's real handlers), so tests that rely on delay timing keep working.Tests
does not let delay block concurrent in-flight requests, based on the one in Add failing test for delay blocking concurrency #506) that fails on the old code (~2.8s) and passes here (~1.2s).timeLimittests asserted an exact crawl count (4) that depended on the old "count, then sleep" ordering and on a 1-second-resolution clock. With non-blocking delay the request start is deferred before the URL is counted, so the exact number completed inside a fixed budget shifts by about one. They now assert the meaningful behavior (the limit is enforced, the crawl is bounded, and a per-execution limit eventually finishes) rather than a brittle exact number.