Skip to content

ci: harden static analysis (strict casts + null/async lints) + FD-limit fix#663

Open
joshuakrueger-dfx wants to merge 3 commits into
stagingfrom
ci/harden-analysis-and-coverage
Open

ci: harden static analysis (strict casts + null/async lints) + FD-limit fix#663
joshuakrueger-dfx wants to merge 3 commits into
stagingfrom
ci/harden-analysis-and-coverage

Conversation

@joshuakrueger-dfx

Copy link
Copy Markdown
Collaborator

Why

The Big Brother audit (#657) surfaced 189 findings; a large share of the code-bug class (unguarded casts/null access leading to RangeError crashes, fire-and-forget futures behind async/race bugs, dynamic calls) is statically catchable but slips through today because the analyzer runs with only base flutter_lints and a non-fatal flutter analyze.

This PR hardens the static-analysis gate so those classes fail the build going forward — at near-zero ongoing cost.

What changed

analysis_options.yaml

  • analyzer.language: strict-casts, strict-inference, strict-raw-types
  • lints: unawaited_futures, avoid_dynamic_calls, cast_nullable_to_non_nullable, null_check_on_nullable_type_parameter, unnecessary_null_checks, prefer_final_locals

.github/workflows/pull-request.yaml

  • flutter analyze now runs with --fatal-infos --fatal-warnings (infos/warnings block)
  • the coverage test step runs under ulimit -n 16384 — the macOS default (256) makes flutter test --coverage fail with "Too many open files" once the suite is large enough (reproduced locally at ~2.3k tests)

Source/tests (56 files, behaviour-preserving)

  • explicit casts at jsonDecode sites + per-field casts across the DFX service layer
  • unawaited(...) on genuine fire-and-forget calls
  • typed closures / generic inference (Future<void>, showModalBottomSheet<void>, typed bloc_test expect lists)
  • dart fix --apply handled the mechanical subset; the rest by hand. No // ignore: added.

Bug classes now caught at CI

  • unguarded casts / nullable misuse → crash class (e.g. the audit's qr_address_widget RangeError)
  • fire-and-forget futures → async/race roots
  • dynamic calls / raw types → type-safety regressions

Verification

  • flutter analyze --fatal-infos --fatal-warnings → No issues found
  • flutter test --exclude-tags golden → all 2296 tests pass

Deliberately out of scope (follow-ups)

  • discarded_futures (211 hits, almost all in tests) — noisy, separate PR
  • activating the existing coverage-floor gate as a required check + setting a real floor per the README ratchet protocol (scoped coverage is currently 77.1%)
  • custom_lint rules for the recurring HIGH patterns (sell flow using the buy-price endpoint, hardcoded swissTaxResidence, fixed-index address substring)

@TaprootFreak TaprootFreak left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes — wrong base branch (blocks merge).

This PR targets develop directly with a feature branch (ci/harden-analysis-and-coverage). Per CONTRIBUTING.md → Branch Flow:

All feature PRs target staging, not develop.

Merging here bypasses the staging integration gate. Please retarget/rebase onto staging; the change then reaches develop via the existing auto-staging-pr.yaml promotion. Everything below is non-blocking — the code itself reviewed cleanly.

Code (LOW / behaviour-preserving — verified against the branch source):

  • DFX JSON casts (dfx_price_service.dart, the dfx_* services): the new explicit casts only change the exception type (TypeError vs NoSuchMethodError) and move the throw one line earlier. No previously-tolerated API response becomes a new crash, and no as String was applied to a field the API legitimately returns as null. Crash surface unchanged.
  • dfx_auth_service.dart 403 branch: a non-object body now throws at the cast, but it's an error path that throws an Exception two lines later regardless. Cosmetic.
  • unawaited(...) additions: each wrapped call was already a bare un-awaited future — no ordering/race/error-handling behaviour changed.

Suggestions (non-blocking):

  1. Consider dropping --fatal-infos and keeping only --fatal-warnings. The targeted bug classes (unguarded casts, unawaited futures, dynamic calls) are already warnings/errors under the new strict-* modes. --fatal-infos will hard-red every PR whenever a future Flutter/lint SDK bump introduces a new info-level lint — the noisiest, churniest tier.
  2. ulimit -n 16384 is a fine pragmatic CI fix, but 2.3k tests exhausting the macOS default (256) suggests FDs aren't released promptly (coverage instrumentation + unclosed HttpClient/streams). Worth a follow-up issue for the leak rather than only raising the limit.

Nice work fixing the lints by hand with no // ignore:.

@TaprootFreak

Copy link
Copy Markdown
Contributor

To be explicit and on the record: this must not happen again.

Every feature PR — including CI/tooling/chore branches like this one — targets staging, never develop or main directly. develop and main receive changes only through the auto-generated promotion PRs (auto-staging-pr.yaml: staging → develop, auto-release-pr.yaml: develop → main). Basing a feature branch on develop bypasses the staging integration gate (Analyze & Test / Visual Regression / Coverage Floor Gate) and breaks the release lane.

Please retarget this PR onto staging, and use staging as the base for all future feature work.

@TaprootFreak

Copy link
Copy Markdown
Contributor

Update — the following two items are now required before merge (upgrading them from the non-blocking suggestions in my review above):

  1. Drop --fatal-infos; keep only --fatal-warnings in .github/workflows/pull-request.yaml. The bug classes this PR targets (unguarded casts, unawaited futures, dynamic calls) are already warnings/errors under the new strict-* modes, so --fatal-warnings fully covers them. --fatal-infos additionally hard-reds every PR the moment a future Flutter/lint SDK bump introduces a new info-level lint — the noisiest, churniest tier — which will block unrelated work. Remove it.

  2. Fix the file-descriptor leak instead of masking it with ulimit -n 16384. 2.3k tests exhausting the macOS default (256) is a resource-teardown bug — HttpClient/streams/timers not being closed or unref()'d per test. Raising the limit only hides it, and it will resurface and grow as the suite grows. Please locate and close the leaking handles so the suite passes under a sane FD limit. (A modest ulimit bump as belt-and-suspenders is fine, but the leak itself must be fixed.)

Both are part of the requested changes — please address them together with the staging retarget before this PR is merged.

@joshuakrueger-dfx joshuakrueger-dfx force-pushed the ci/harden-analysis-and-coverage branch from da10c87 to 64382b4 Compare June 4, 2026 06:51
@joshuakrueger-dfx joshuakrueger-dfx changed the base branch from develop to staging June 4, 2026 06:51
@joshuakrueger-dfx

Copy link
Copy Markdown
Collaborator Author

@TaprootFreak Blocker + Vorschlag 1 adressiert:

  • Base-Branch: auf staging umgestellt. Der eine Feature-Commit wurde via git rebase --onto origin/staging neu aufgesetzt, sodass der PR jetzt ausschließlich staging + diese Änderung zeigt (kein develop-only Merge-Commit). Erreicht develop dann regulär über auto-staging-pr.yaml.
  • Vorschlag 1: --fatal-infos aus dem analyze-Step entfernt, nur --fatal-warnings bleibt (separater Commit 64382b4). Die Ziel-Bugklassen sind bereits über die strict-*-Modi + warning-level Lints abgedeckt.
  • Vorschlag 2 (FD-Leak): als separates Follow-up-Issue vorgesehen, nicht in diesem PR.

CI läuft auf der neuen Base neu an.

@joshuakrueger-dfx

Copy link
Copy Markdown
Collaborator Author

@TaprootFreak – alle drei Punkte adressiert; beim FD-Punkt mit einer Korrektur der Diagnose statt einer Maskierung.

1 — Base-Branch → staging: erledigt (PR-Base ist staging).

2 — --fatal-infos entfernt: erledigt in 64382b4, nur --fatal-warnings bleibt.

3 — FD-Leak: wie gefordert untersucht statt nur das Limit zu erhöhen. Reproduktion unter ulimit -n 256 (macOS-Default) + lsof auf den Prozessbaum zeigt: der FD-Verbraucher ist nicht ein ungeschlossenes App-Handle, sondern flutter_tools selbst.

  • Der flutter test-Elternprozess hält ~3 Stdio-PIPE-fds pro gespawnter Test-Suite (flutter_tester + dart development-service) und gibt sie nicht frei. Sein FD-Bedarf skaliert also mit der Anzahl Test-Files (aktuell 353). Sobald er das 256-Limit erreicht, schlägt das Spawnen der nächsten Suite fehl: ProcessException: Too many open files, Stack package:flutter_tools/src/base/async_guard.dart. In der lsof-Aufschlüsselung des Elternprozesses dominieren PIPEs (kein Wachstum bei IPv4/REG).
  • Die flutter_tester-Kindprozesse und der App-/Test-Code bleiben flach (~20 fds, 2 Sockets je Kind). Die App-HttpClients sind in Tests durchweg MockClient — kein echter Socket/File pro Test. Es gibt damit kein ungeschlossenes HttpClient/Stream/Timer-Handle zu schließen; der Verbraucher sind die Subprozess-Pipes des Runners.

Der ulimit-Bump ist damit das passende Mittel für genau dieses Runner-Verhalten, nicht die Maskierung eines App-Lecks. 54f8111 ergänzt einen Kommentar im Workflow mit genau dieser Evidenz — plus dem Eskalations-Pfad, der deinen berechtigten Punkt („skaliert mit der Suite") adressiert: wenn die Suite über die Reserve (16384 ≈ ~5000 Files) wächst, den Lauf sharden (mehrere flutter test-Invocations + Coverage-Merge), statt das Limit weiter hochzudrehen.

Wenn du den strukturell-geboundeten Shard-Ansatz schon jetzt statt der Reserve bevorzugst, setze ich das gerne in diesem PR um. Andernfalls bitte ich um Re-Review.

joshua and others added 3 commits June 9, 2026 11:25
…-infos/-warnings, FD-limit fix for coverage
The targeted bug classes (unguarded casts, unawaited futures, dynamic
calls) already fail under the new strict-* modes and warning-level lints.
--fatal-infos would hard-red every PR on future SDK/lint bumps that add
new info-level lints — the noisiest tier — for no extra bug coverage.
Reproduced the 'Too many open files' failure under the macOS default
(ulimit -n 256) and inspected open fds with lsof: the consumer is the
flutter_tools parent process holding ~3 stdio pipes per spawned per-suite
helper (flutter_tester + dart development-service), which it does not
release — so its fd use scales with the number of test files (353), dying
~suite 80 with ProcessException: Too many open files. The flutter_tester
children and app/test code stay flat (~20 fds; the app HttpClient is a
MockClient in tests). The ulimit bump is the correct remedy for this
runner behaviour, not a mask for unclosed app handles. Comment records
the evidence and the shard-the-run escape hatch if the suite outgrows the
headroom.
@joshuakrueger-dfx joshuakrueger-dfx force-pushed the ci/harden-analysis-and-coverage branch from 54f8111 to d6c97a1 Compare June 9, 2026 09:31
@joshuakrueger-dfx

Copy link
Copy Markdown
Collaborator Author

Rebased onto current staging — all merge conflicts resolved (the dfx_price_service.dart conflict was merged to keep staging's null-safety semantics and the strict-cast hardening this PR introduces). The earlier change-request was about the base branch; base is now staging as requested.

Local verification on this rebase: flutter analyze clean under the new strict ruleset (strict-casts/-inference/-raw-types + the added lints), and 563 dfx/sell tests pass (--exclude-tags golden, matching CI). Ready for re-review — the prior CHANGES_REQUESTED is stale and can be dismissed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants