Skip to content

Pipeline hardening: silent-failure findings from PR #2 review (cleaning/imputation phase) #3

@DougManuel

Description

@DougManuel

Findings from the pre-merge review of #2 (silent-failure audit + code review). These are the hardening tasks for the cleaning/imputation phase. Key framing: in a {targets} pipeline, a warning during a successful build is shown once and never again — the target caches as "up to date" — so warn-and-continue is effectively silent.

Critical

  1. load_study_data() warn-and-skips missing cycle files with no count reconciliation — 10 of 11 cycles loading builds a "successful" study_data on a truncated period range. Add a loaded-vs-configured check that stops on shortfall.
  2. No verification that rec_with_table() delivered the expected columns per cycle; bind_rows() fills silently-skipped derived variables with NA. This is the exact upstream failure class found during v3 integration (feeder gaps → all-NA). Add a per-cycle expected-columns check and an all-NA-by-cycle audit for critical-role variables.
  3. prepare_for_mice()'s claim that tagged NA(a)/NA(c) are protected is false for numeric variables: tagged_na() is NA to mice(), so structural missingness gets imputed (e.g. quit times invented for never-smokers). Build an explicit where matrix (TRUE only for NA(b) cells), restore tags on write-back, and drop "NA(a)"/"NA(c)" factor levels from the modelling frame. Also check mice_result$loggedEvents (silently dropped/collinear variables) — currently suppressed by printFlag = FALSE.
  4. APC builders and fit_binomial_apc() accept zero-event numerators — an all-NA age_first_cigarette yields a converged model encoding "nobody ever started smoking". Stop on empty numerator and sum(d) == 0.

Important
5. validate_cycle_coverage() defaults to warn-only for critical gaps (strict_validation: false in all profiles), downgrades to a warning when cfg$cchs_cycles is NULL, and exempts DerivedVar/empty-variableStart rows (exactly the variables rec_with_table silently drops). Make strict the default; surface skipped variables; consume the gap report somewhere.
6. interior_knots() silently drops out-of-range knots, and fit_apc_model() stores the configured (not used) knots in attr(fit, "knots") — Stage 9 basis reconstruction will mismatch the coefficients. Store actually-used knots; message on drops.
7. No fit$converged / fit$boundary check after glm() despite the docstring anticipating boundary failures.
8. Age restriction is skipped without any message when the age column is absent (split the condition: config-NULL = skip, missing column = stop).
9. Sex subsetting via data[[sex_col]] == 1 injects all-NA phantom rows when sex is NA and gives empty strata on coding mismatch; use which() and assert non-empty strata.
10. Assign imputation-predictor roles in cshm-variables.csv — no variable currently carries the role, so Stage 5 selects zero variables and silently no-ops.
11. Smaller: expand_denominator() drops NA-cohort persons via bare next (count them); check_skewness() makes all-NA variables vanish from the summary; select_vars_by_role(...)[1] yields NA when a role is unassigned (assert); type-conversion loop in load_study_data() doesn't guard as.numeric coercion NAs.

Also from review, design notes: Table 1 code selects by predictor/model-stratifier roles while CLAUDE.md documents table1/table1-stratifier as the wiring — reconcile code or docs. Initiation numerator periods are not clamped to [period_min, period_max] while the denominator is — confirm against the intended Holford construction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions