fix(ci): patch run-tests.php getTimer() to return float instead of comma-formatted string by Leiyks · Pull Request #3856 · DataDog/dd-trace-php

Leiyks · 2026-05-05T12:07:42Z

What

Add a second sed patch to $(BUILD_DIR)/run-tests.php at build time, replacing number_format( with round( inside getTimer()

Why

PHP's run-tests.php getTimer() returns number_format($elapsed, 4), which produces a comma-formatted string (e.g. "1,500.0000") when a test takes ≥1000 seconds. PHP 8.0+ raises E_WARNING: A non-numeric value encountered when that string is used in += arithmetic inside record(). The main test runner process treats this as a fatal worker error and exits with code 2.

This started surfacing ~1 week ago because commit fdbad29 moved DD_EXPERIMENTAL_PROPAGATE_PROCESS_TAGS_ENABLED=0 from ALL_TEST_ENV_OVERRIDE (covers all test targets, including valgrind) into ENV_OVERRIDE (covers only the non-valgrind run). The valgrind run now executes with process tags enabled, making live-debugger and sidecar-related tests 5–20× slower under valgrind's instrumentation, pushing some past the 1000s threshold.

How

We already patch run-tests.php at build time for an unrelated dl() issue. This adds a second patch in the same $(BUILD_DIR)/run-tests.php Makefile target:

sed -i 's/return number_format($$this->rootSuite/return round($$this->rootSuite/' $(BUILD_DIR)/run-tests.php

round($elapsed, 4) returns a float (e.g. 1999.5678) instead of a comma-formatted string, which is safe for += arithmetic. The JUnit XML time='...' attribute also benefits — 1999.5678 is more spec-compliant than 1,999.5678.

Verified the sed pattern matches both PHP 8.4.19 (line 3441) and PHP 8.5.0 (line 3433).

Testing

CI: test_extension_ci: [8.4] and test_extension_ci: [8.5] valgrind runs no longer abort with E_WARNING: A non-numeric value encountered in run-tests.php

datadog-datadog-prod-us1-2 · 2026-05-05T12:19:29Z

Tests

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 36 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | benchmarks-tracer

See error
Your requirements could not be resolved due to security advisories affecting certain package versions.

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.4]

See error
Connection timed out when trying to access the request-replayer service, leading to multiple test failures.

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [8.0]

See error
Fatal error: Uncaught Exception: wait for replay timeout in /go/src/github.com/DataDog/apm-reliability/dd-trace-php/tmp/build_extension/tests/ext/includes/request_replayer.inc:112

View all 36 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🔄 Datadog auto-retried 1 job - 0 passed on retry

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 60.74% (+0.00%)

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 4617c36 | Docs | Datadog PR Page | Give us feedback!}

pr-commenter · 2026-05-07T15:17:56Z

Benchmarks [ tracer ]

Benchmark execution time: 2026-06-03 12:16:25

Comparing candidate commit 88460ed in PR branch leiyks/fix-ci-run-tests-number-format with baseline commit 095538e in branch master.

Found 1 performance improvements and 1 performance regressions! Performance is the same for 192 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:MessagePackSerializationBench/benchMessagePackSerialization-opcache

🟥 execution_time [+4.821µs; +6.379µs] or [+4.670%; +6.180%]

scenario:TraceFlushBench/benchFlushTrace-opcache

🟩 execution_time [-59.200µs; -34.700µs] or [-4.835%; -2.834%]

…= 7.2) Lists tests slower than 10s with their durations at the end of the run. The --show-slow flag only exists in PHP 7.2+, so it's gated by a runtime PHP_VERSION_ID check to avoid breaking PHP 7.0/7.1 jobs. Used to verify the hypothesis that recent CI failures are caused by tests exceeding 1000s under valgrind (triggering the number_format E_WARNING in run-tests.php's record() method).

…lgrind tests PHP's run-tests.php getTimer() returns number_format($elapsed, 4) which produces a comma-formatted string (e.g. "1,500.0000") when a test takes >=1000 seconds. PHP 8.0+ raises E_WARNING when that string is used in += arithmetic inside record(), causing the test runner to abort with exit code 2. Patch getTimer() at build time so it returns a float (via round()) instead.

The CI .base_test before_script runs wait-for-service-ready.sh, but WAIT_FOR only listed test-agent — request-replayer was never gated on, even though it is in the agent_httpbin_service() service block used by test_extension_ci, ASAN test_c, and ASAN test_c with multiple observers. Add a request-replayer case to detect_service_type (probing /replay, a read-only endpoint that returns valid JSON when php -S is up), and add request-replayer:80 to WAIT_FOR for the three affected job blocks. PHP Language Tests is unchanged: it only declares test-agent in services.

curl -sf fails on 4xx, which would falsely report not-ready if /replay returns 4xx when no dump exists yet (the natural startup state). Drop -f so any HTTP response proves php -S is up and executing index.php; connection failure remains the only signal of an unhealthy service.

Filters the SLOW TEST SUMMARY to tests that meaningfully approach the per-test 300s timeout, trimming noise from the long tail.

On PHP 8.4 the test sandbox integration registers early enough that the worker bundles it into the app-started payload's `integrations` field (see libdd-telemetry build_app_started), so no separate app-integrations-change event is emitted and the existing assertion loop times out. Accept both delivery paths: the dedicated app-integrations-change event, and the integrations array embedded in app-started. The expected output shape is identical because Rust Integration struct field order matches.

…ix flaky PHP 8.0 valgrind The test guarded a sleep(20) with `if (ini_get(...) != 0.5)`, but the wrapped sleep() blocks SIGVTALRM during the call. Under valgrind on PHP 8.0, the heavier signal traffic can cut the sleep short before the sidecar fetches the config from the request-replayer and propagates it to INI globals, leaving every dynamic value at its default. Replace the single sleep() with a 100ms polling loop bounded at 20s total. Each usleep() iteration goes through the same SIGVTALRM-blocking wrapper, which calls ddtrace_check_for_new_config_now() on unblock under valgrind, giving the test repeated, well-defined opportunities to apply the config while still finishing early on the happy path.

The previous warmup was a no-op: $start = microtime(true); ... usleep(floor(microtime(true) - $start) * 100000); floor() of a sub-second float yields 0, so usleep(0) did nothing. Also, the sleep happened *after* the span had already been sampled with an empty env, so even a real delay wouldn't have helped. Replace it with dd_trace_internal_fn('await_agent_info') called before opening the span — same pattern that client_side_stats*.phpt already uses to gate on the sidecar having read /info from the agent.

…c on timeout When a waitFor* helper exhausts maxIteration, dump to stdout: - The session token in use - The HTTP status code returned by the replayer for that session - The HTTP status + body length when the same endpoint is queried without the session token This closes the visibility gap on flaky CI failures where the test output currently shows only a generic 'HTTP request failed!' warning followed by 'process timed out'. With the diagnostic, the next failing run will tell us whether the replayer returned 4xx/5xx, whether the expected data was present under a different session bucket, or whether the connection failed at the transport layer. Also suppresses the per-retry file_get_contents warning (the @-prefix) so flaky polling no longer pollutes test stdout with N warning lines during retries — only the final structured diagnostic appears on genuine timeouts. No-op on passing tests (the diagnostic path runs only just before throw).

…before run-tests external kill Previous instrumentation only emitted the diagnostic when the iteration count reached 500. Under valgrind each iteration is 10-20x slower than nominal, so the iteration-cap path may take 250s+ wall-clock — past run-tests.php's per-test external kill timer. The diagnostic never fires because the process is killed first. Add an explicit wall-clock budget (60s default) that aborts the wait and dumps the diagnostic well before the external kill. Pass elapsed seconds through to the diagnostic so we can see how much wall-time was actually spent waiting.

The first diagnostic that fired on this PR revealed two open questions it couldn't answer: 1. Of the bytes the replayer returns for the test session, which URIs are present? (If the response is all telemetry, the test filter strips it and replayRequest() returns empty even though /replay returned 200 with a non-empty body.) 2. Are individual /replay calls slow, or is the slowness elsewhere in the loop? Extend the diagnostic to: - JSON-decode each /replay response and emit a URI=count summary, so we can tell whether the test's expected payload kind (e.g. /v0.4/traces) actually arrived. - Time each of the diagnostic's own /replay calls (fetch_ms) so we can compare against the wall-time the wait loop spent — if the diagnostic's own calls are fast (say <100ms) but the loop took 30s per attempt, the bottleneck is not the HTTP call.

…rs fail Two probes used to diagnose the remaining valgrind flake classes: 1. await_probe_installation (live_debugger.inc): add a 30s wall-clock cap alongside the 15000-iteration cap so this fires before run-tests external timeout under valgrind-stretched iterations. On incomplete install (loop exited without seeing enough hook IDs), dump get_loaded_remote_configs to show which RC configs ARE loaded on the extension side, plus elapsed wall time and remaining probe count. No-op for tests where the probe installs in time. 2. client_side_stats_trace_filters.phpt: before the 'in traces:' / 'in stats:' dumps, snapshot dd_trace_internal_fn('get_agent_info') to show whether the filter_tags / filter_tags_regex / ignore_resources rules from the PUT /set-agent-info in SKIPIF actually propagated to the extension. If they didn't, the trace filter regression has its root cause upstream of the filter-application code. Switches EXPECT -> EXPECTF so the agent_info JSON is matched with a %s wildcard regardless of contents (passing runs still pass).

…uting EXPECT output The PROBE-DIAG/RR-DIAG/FILTER-DIAG dumps were added to investigate flaky valgrind failures. They served their purpose (identified replayer session-token isolation leaks and probe-install regressions) but a false-positive in the await_probe iteration heuristic flipped 2 otherwise-passing tests to failing (debugger_metric_probe on 7.4, debugger_log_probe_process_tags on 8.1). PHPT merges stderr into stdout, so routing diagnostics to stderr would not help. Keep the wall-clock cap added in 2f7181f — that prevents run-tests.php's external per-test kill from firing before our throw on valgrind-stretched runs, which was the real fix.

Picks up libdatadog leiyks/instrument-info-fetcher (170c989b7): "test(sidecar): log every /info poll when an endpoint test_token is set" When fetch_endpoint.test_token.is_some() (i.e. we are in a PHPT/test context), the AgentInfoFetcher now emits one warn! line per /info poll into sidecar.log capturing endpoint URL, token, elapsed_ms, and the fetch result (SameState | NewState hash_prefix=... | Err(...)). Production behaviour (test_token == None) is unchanged. No polling interval or HTTP timeout was touched. This is diagnostic-only and exists so the next CI run gives per-test, per-poll evidence to pin down which failure mode is hitting agent_info-dependent flakes (dd_trace_agent_env, dynamic_config_*, client_side_stats_*, ffe/*, debugger_*) instead of fixing on hypothesis.

Picks up libdatadog leiyks/instrument-info-fetcher tip db205a39c: fix(sidecar): poll /info every 1s instead of 60s when test_token is set Per the per-poll instrumentation added in 170c989b7, 6 of the 13 failing tests on the previous pipeline (sha 88460ed) hit the same first-poll race: the AgentInfoFetcher's first poll lands on `{}` (the replayer's pre-PUT response, sha256 prefix 44136fa355b3678a), caches that state, and the next poll is 60s away — well past any test's await_agent_info budget. The fast-retry path drops that gap to 1s while `test_token` is set; production is unaffected. Tests this directly fixes (empty `{}` on first poll, observed in sidecar.log of pipeline 88460ed): - ffe/remote_config_lifecycle (PHP 7.4, 8.0) - live-debugger/log_probe (PHP 7.4) - live-debugger/span_decoration_probe (PHP 7.4) - remote-config/dynamic_config_update (PHP 8.0) - live-debugger/enable_dynamic_config (PHP 8.3) The remaining 7 failures in that pipeline are distinct issues (replayer /replay data missing, /v0.6/stats wait timeout, sidecar trace flush timeout, dd_trace_agent_env env-still-empty-despite-non-empty-info) and are NOT addressed by this commit.

Leiyks force-pushed the leiyks/fix-ci-run-tests-number-format branch 3 times, most recently from a52010c to fe39d11 Compare May 7, 2026 13:55

Leiyks marked this pull request as ready for review May 7, 2026 15:01

Leiyks requested a review from a team as a code owner May 7, 2026 15:01

Leiyks force-pushed the leiyks/fix-ci-run-tests-number-format branch 7 times, most recently from a6282d7 to 3f2de5f Compare May 18, 2026 10:58

Leiyks added 8 commits May 19, 2026 15:04

ci: raise --show-slow threshold from 10s to 30s

ef06ce7

Filters the SLOW TEST SUMMARY to tests that meaningfully approach the per-test 300s timeout, trimming noise from the long tail.

Leiyks force-pushed the leiyks/fix-ci-run-tests-number-format branch from 5bbd9e6 to d26b162 Compare May 19, 2026 13:05

Leiyks and others added 7 commits June 1, 2026 13:15

Merge branch 'master' into leiyks/fix-ci-run-tests-number-format

575fe6b

Leiyks requested review from a team as code owners June 3, 2026 10:58

Leiyks requested review from dd-oleksii and greghuels and removed request for a team, dd-oleksii and greghuels June 3, 2026 10:58

Leiyks marked this pull request as draft June 3, 2026 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): patch run-tests.php getTimer() to return float instead of comma-formatted string#3856

fix(ci): patch run-tests.php getTimer() to return float instead of comma-formatted string#3856
Leiyks wants to merge 16 commits into
masterfrom
leiyks/fix-ci-run-tests-number-format

Leiyks commented May 5, 2026

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 5, 2026 •

edited

Loading

Uh oh!

pr-commenter Bot commented May 7, 2026 •

edited

Loading

Explanation

More details about the CI and significant changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Leiyks commented May 5, 2026

What

Why

How

Testing

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

pr-commenter Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [ tracer ]

Explanation

More details about the CI and significant changes

scenario:MessagePackSerializationBench/benchMessagePackSerialization-opcache

scenario:TraceFlushBench/benchFlushTrace-opcache

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-datadog-prod-us1-2 Bot commented May 5, 2026 •

edited

Loading

pr-commenter Bot commented May 7, 2026 •

edited

Loading