Skip to content

fix(ci): patch run-tests.php getTimer() to return float instead of comma-formatted string#3856

Draft
Leiyks wants to merge 16 commits into
masterfrom
leiyks/fix-ci-run-tests-number-format
Draft

fix(ci): patch run-tests.php getTimer() to return float instead of comma-formatted string#3856
Leiyks wants to merge 16 commits into
masterfrom
leiyks/fix-ci-run-tests-number-format

Conversation

@Leiyks
Copy link
Copy Markdown
Contributor

@Leiyks Leiyks commented May 5, 2026

What

  • Add a second sed patch to $(BUILD_DIR)/run-tests.php at build time, replacing number_format( with round( inside getTimer()

Why

PHP's run-tests.php getTimer() returns number_format($elapsed, 4), which produces a comma-formatted string (e.g. "1,500.0000") when a test takes ≥1000 seconds. PHP 8.0+ raises E_WARNING: A non-numeric value encountered when that string is used in += arithmetic inside record(). The main test runner process treats this as a fatal worker error and exits with code 2.

This started surfacing ~1 week ago because commit fdbad29 moved DD_EXPERIMENTAL_PROPAGATE_PROCESS_TAGS_ENABLED=0 from ALL_TEST_ENV_OVERRIDE (covers all test targets, including valgrind) into ENV_OVERRIDE (covers only the non-valgrind run). The valgrind run now executes with process tags enabled, making live-debugger and sidecar-related tests 5–20× slower under valgrind's instrumentation, pushing some past the 1000s threshold.

How

We already patch run-tests.php at build time for an unrelated dl() issue. This adds a second patch in the same $(BUILD_DIR)/run-tests.php Makefile target:

sed -i 's/return number_format($$this->rootSuite/return round($$this->rootSuite/' $(BUILD_DIR)/run-tests.php

round($elapsed, 4) returns a float (e.g. 1999.5678) instead of a comma-formatted string, which is safe for += arithmetic. The JUnit XML time='...' attribute also benefits — 1999.5678 is more spec-compliant than 1,999.5678.

Verified the sed pattern matches both PHP 8.4.19 (line 3441) and PHP 8.5.0 (line 3433).

Testing

  • CI: test_extension_ci: [8.4] and test_extension_ci: [8.5] valgrind runs no longer abort with E_WARNING: A non-numeric value encountered in run-tests.php

@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 Bot commented May 5, 2026

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 36 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | benchmarks-tracer   View in Datadog   GitLab

See error Your requirements could not be resolved due to security advisories affecting certain package versions.

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.4]   View in Datadog   GitLab

See error Connection timed out when trying to access the request-replayer service, leading to multiple test failures.

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [8.0]   View in Datadog   GitLab

See error Fatal error: Uncaught Exception: wait for replay timeout in /go/src/github.com/DataDog/apm-reliability/dd-trace-php/tmp/build_extension/tests/ext/includes/request_replayer.inc:112

View all 36 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🔄 Datadog auto-retried 1 job - 0 passed on retry View in Datadog

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 60.74% (+0.00%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 4617c36 | Docs | Datadog PR Page | Give us feedback!

@Leiyks Leiyks force-pushed the leiyks/fix-ci-run-tests-number-format branch 3 times, most recently from a52010c to fe39d11 Compare May 7, 2026 13:55
@Leiyks Leiyks marked this pull request as ready for review May 7, 2026 15:01
@Leiyks Leiyks requested a review from a team as a code owner May 7, 2026 15:01
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 7, 2026

Benchmarks [ tracer ]

Benchmark execution time: 2026-06-03 12:16:25

Comparing candidate commit 88460ed in PR branch leiyks/fix-ci-run-tests-number-format with baseline commit 095538e in branch master.

Found 1 performance improvements and 1 performance regressions! Performance is the same for 192 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:MessagePackSerializationBench/benchMessagePackSerialization-opcache

  • 🟥 execution_time [+4.821µs; +6.379µs] or [+4.670%; +6.180%]

scenario:TraceFlushBench/benchFlushTrace-opcache

  • 🟩 execution_time [-59.200µs; -34.700µs] or [-4.835%; -2.834%]

@Leiyks Leiyks force-pushed the leiyks/fix-ci-run-tests-number-format branch 7 times, most recently from a6282d7 to 3f2de5f Compare May 18, 2026 10:58
Leiyks added 8 commits May 19, 2026 15:04
…= 7.2)

Lists tests slower than 10s with their durations at the end of the run.
The --show-slow flag only exists in PHP 7.2+, so it's gated by a runtime
PHP_VERSION_ID check to avoid breaking PHP 7.0/7.1 jobs.

Used to verify the hypothesis that recent CI failures are caused by tests
exceeding 1000s under valgrind (triggering the number_format E_WARNING
in run-tests.php's record() method).
…lgrind tests

PHP's run-tests.php getTimer() returns number_format($elapsed, 4) which
produces a comma-formatted string (e.g. "1,500.0000") when a test takes
>=1000 seconds. PHP 8.0+ raises E_WARNING when that string is used in +=
arithmetic inside record(), causing the test runner to abort with exit code 2.

Patch getTimer() at build time so it returns a float (via round()) instead.
The CI .base_test before_script runs wait-for-service-ready.sh, but
WAIT_FOR only listed test-agent — request-replayer was never gated on,
even though it is in the agent_httpbin_service() service block used by
test_extension_ci, ASAN test_c, and ASAN test_c with multiple observers.

Add a request-replayer case to detect_service_type (probing /replay, a
read-only endpoint that returns valid JSON when php -S is up), and add
request-replayer:80 to WAIT_FOR for the three affected job blocks.

PHP Language Tests is unchanged: it only declares test-agent in services.
curl -sf fails on 4xx, which would falsely report not-ready if /replay
returns 4xx when no dump exists yet (the natural startup state). Drop
-f so any HTTP response proves php -S is up and executing index.php;
connection failure remains the only signal of an unhealthy service.
Filters the SLOW TEST SUMMARY to tests that meaningfully approach
the per-test 300s timeout, trimming noise from the long tail.
On PHP 8.4 the test sandbox integration registers early enough that the
worker bundles it into the app-started payload's `integrations` field
(see libdd-telemetry build_app_started), so no separate
app-integrations-change event is emitted and the existing assertion
loop times out.

Accept both delivery paths: the dedicated app-integrations-change event,
and the integrations array embedded in app-started. The expected output
shape is identical because Rust Integration struct field order matches.
…ix flaky PHP 8.0 valgrind

The test guarded a sleep(20) with `if (ini_get(...) != 0.5)`, but the
wrapped sleep() blocks SIGVTALRM during the call. Under valgrind on
PHP 8.0, the heavier signal traffic can cut the sleep short before the
sidecar fetches the config from the request-replayer and propagates it
to INI globals, leaving every dynamic value at its default.

Replace the single sleep() with a 100ms polling loop bounded at 20s
total. Each usleep() iteration goes through the same SIGVTALRM-blocking
wrapper, which calls ddtrace_check_for_new_config_now() on unblock under
valgrind, giving the test repeated, well-defined opportunities to apply
the config while still finishing early on the happy path.
The previous warmup was a no-op:

  $start = microtime(true);
  ...
  usleep(floor(microtime(true) - $start) * 100000);

floor() of a sub-second float yields 0, so usleep(0) did nothing. Also,
the sleep happened *after* the span had already been sampled with an
empty env, so even a real delay wouldn't have helped.

Replace it with dd_trace_internal_fn('await_agent_info') called before
opening the span — same pattern that client_side_stats*.phpt already
uses to gate on the sidecar having read /info from the agent.
@Leiyks Leiyks force-pushed the leiyks/fix-ci-run-tests-number-format branch from 5bbd9e6 to d26b162 Compare May 19, 2026 13:05
Leiyks and others added 7 commits June 1, 2026 13:15
…c on timeout

When a waitFor* helper exhausts maxIteration, dump to stdout:
  - The session token in use
  - The HTTP status code returned by the replayer for that session
  - The HTTP status + body length when the same endpoint is queried
    without the session token

This closes the visibility gap on flaky CI failures where the test
output currently shows only a generic 'HTTP request failed!' warning
followed by 'process timed out'. With the diagnostic, the next failing
run will tell us whether the replayer returned 4xx/5xx, whether the
expected data was present under a different session bucket, or whether
the connection failed at the transport layer.

Also suppresses the per-retry file_get_contents warning (the @-prefix)
so flaky polling no longer pollutes test stdout with N warning lines
during retries — only the final structured diagnostic appears on
genuine timeouts.

No-op on passing tests (the diagnostic path runs only just before throw).
…before run-tests external kill

Previous instrumentation only emitted the diagnostic when the
iteration count reached 500. Under valgrind each iteration is 10-20x
slower than nominal, so the iteration-cap path may take 250s+ wall-clock
— past run-tests.php's per-test external kill timer. The diagnostic
never fires because the process is killed first.

Add an explicit wall-clock budget (60s default) that aborts the wait
and dumps the diagnostic well before the external kill. Pass elapsed
seconds through to the diagnostic so we can see how much wall-time was
actually spent waiting.
The first diagnostic that fired on this PR revealed two open questions
it couldn't answer:
  1. Of the bytes the replayer returns for the test session, which URIs
     are present? (If the response is all telemetry, the test filter
     strips it and replayRequest() returns empty even though /replay
     returned 200 with a non-empty body.)
  2. Are individual /replay calls slow, or is the slowness elsewhere in
     the loop?

Extend the diagnostic to:
  - JSON-decode each /replay response and emit a URI=count summary,
    so we can tell whether the test's expected payload kind (e.g.
    /v0.4/traces) actually arrived.
  - Time each of the diagnostic's own /replay calls (fetch_ms) so we
    can compare against the wall-time the wait loop spent — if the
    diagnostic's own calls are fast (say <100ms) but the loop took 30s
    per attempt, the bottleneck is not the HTTP call.
…rs fail

Two probes used to diagnose the remaining valgrind flake classes:

1. await_probe_installation (live_debugger.inc): add a 30s wall-clock
   cap alongside the 15000-iteration cap so this fires before run-tests
   external timeout under valgrind-stretched iterations. On incomplete
   install (loop exited without seeing enough hook IDs), dump
   get_loaded_remote_configs to show which RC configs ARE loaded on
   the extension side, plus elapsed wall time and remaining probe count.
   No-op for tests where the probe installs in time.

2. client_side_stats_trace_filters.phpt: before the 'in traces:' /
   'in stats:' dumps, snapshot dd_trace_internal_fn('get_agent_info')
   to show whether the filter_tags / filter_tags_regex / ignore_resources
   rules from the PUT /set-agent-info in SKIPIF actually propagated to
   the extension. If they didn't, the trace filter regression has its
   root cause upstream of the filter-application code. Switches EXPECT
   -> EXPECTF so the agent_info JSON is matched with a %s wildcard
   regardless of contents (passing runs still pass).
…uting EXPECT output

The PROBE-DIAG/RR-DIAG/FILTER-DIAG dumps were added to investigate flaky valgrind
failures. They served their purpose (identified replayer session-token isolation
leaks and probe-install regressions) but a false-positive in the await_probe
iteration heuristic flipped 2 otherwise-passing tests to failing
(debugger_metric_probe on 7.4, debugger_log_probe_process_tags on 8.1).
PHPT merges stderr into stdout, so routing diagnostics to stderr would not help.

Keep the wall-clock cap added in 2f7181f — that prevents run-tests.php's
external per-test kill from firing before our throw on valgrind-stretched runs,
which was the real fix.
Picks up libdatadog leiyks/instrument-info-fetcher (170c989b7):
"test(sidecar): log every /info poll when an endpoint test_token is set"

When fetch_endpoint.test_token.is_some() (i.e. we are in a PHPT/test
context), the AgentInfoFetcher now emits one warn! line per /info poll
into sidecar.log capturing endpoint URL, token, elapsed_ms, and the
fetch result (SameState | NewState hash_prefix=... | Err(...)).

Production behaviour (test_token == None) is unchanged. No polling
interval or HTTP timeout was touched. This is diagnostic-only and
exists so the next CI run gives per-test, per-poll evidence to pin
down which failure mode is hitting agent_info-dependent flakes
(dd_trace_agent_env, dynamic_config_*, client_side_stats_*, ffe/*,
debugger_*) instead of fixing on hypothesis.
@Leiyks Leiyks requested review from a team as code owners June 3, 2026 10:58
@Leiyks Leiyks requested review from dd-oleksii and greghuels and removed request for a team, dd-oleksii and greghuels June 3, 2026 10:58
Picks up libdatadog leiyks/instrument-info-fetcher tip db205a39c:
  fix(sidecar): poll /info every 1s instead of 60s when test_token is set

Per the per-poll instrumentation added in 170c989b7, 6 of the 13 failing
tests on the previous pipeline (sha 88460ed) hit the same first-poll
race: the AgentInfoFetcher's first poll lands on `{}` (the replayer's
pre-PUT response, sha256 prefix 44136fa355b3678a), caches that state,
and the next poll is 60s away — well past any test's await_agent_info
budget. The fast-retry path drops that gap to 1s while `test_token` is
set; production is unaffected.

Tests this directly fixes (empty `{}` on first poll, observed in
sidecar.log of pipeline 88460ed):
  - ffe/remote_config_lifecycle (PHP 7.4, 8.0)
  - live-debugger/log_probe (PHP 7.4)
  - live-debugger/span_decoration_probe (PHP 7.4)
  - remote-config/dynamic_config_update (PHP 8.0)
  - live-debugger/enable_dynamic_config (PHP 8.3)

The remaining 7 failures in that pipeline are distinct issues (replayer
/replay data missing, /v0.6/stats wait timeout, sidecar trace flush
timeout, dd_trace_agent_env env-still-empty-despite-non-empty-info)
and are NOT addressed by this commit.
@Leiyks Leiyks marked this pull request as draft June 3, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant