Findings from the pre-merge review of #2 (silent-failure audit + code review). These are the hardening tasks for the cleaning/imputation phase. Key framing: in a {targets} pipeline, a warning during a successful build is shown once and never again — the target caches as "up to date" — so warn-and-continue is effectively silent.
Critical
load_study_data() warn-and-skips missing cycle files with no count reconciliation — 10 of 11 cycles loading builds a "successful" study_data on a truncated period range. Add a loaded-vs-configured check that stops on shortfall.
- No verification that
rec_with_table() delivered the expected columns per cycle; bind_rows() fills silently-skipped derived variables with NA. This is the exact upstream failure class found during v3 integration (feeder gaps → all-NA). Add a per-cycle expected-columns check and an all-NA-by-cycle audit for critical-role variables.
prepare_for_mice()'s claim that tagged NA(a)/NA(c) are protected is false for numeric variables: tagged_na() is NA to mice(), so structural missingness gets imputed (e.g. quit times invented for never-smokers). Build an explicit where matrix (TRUE only for NA(b) cells), restore tags on write-back, and drop "NA(a)"/"NA(c)" factor levels from the modelling frame. Also check mice_result$loggedEvents (silently dropped/collinear variables) — currently suppressed by printFlag = FALSE.
- APC builders and
fit_binomial_apc() accept zero-event numerators — an all-NA age_first_cigarette yields a converged model encoding "nobody ever started smoking". Stop on empty numerator and sum(d) == 0.
Important
5. validate_cycle_coverage() defaults to warn-only for critical gaps (strict_validation: false in all profiles), downgrades to a warning when cfg$cchs_cycles is NULL, and exempts DerivedVar/empty-variableStart rows (exactly the variables rec_with_table silently drops). Make strict the default; surface skipped variables; consume the gap report somewhere.
6. interior_knots() silently drops out-of-range knots, and fit_apc_model() stores the configured (not used) knots in attr(fit, "knots") — Stage 9 basis reconstruction will mismatch the coefficients. Store actually-used knots; message on drops.
7. No fit$converged / fit$boundary check after glm() despite the docstring anticipating boundary failures.
8. Age restriction is skipped without any message when the age column is absent (split the condition: config-NULL = skip, missing column = stop).
9. Sex subsetting via data[[sex_col]] == 1 injects all-NA phantom rows when sex is NA and gives empty strata on coding mismatch; use which() and assert non-empty strata.
10. Assign imputation-predictor roles in cshm-variables.csv — no variable currently carries the role, so Stage 5 selects zero variables and silently no-ops.
11. Smaller: expand_denominator() drops NA-cohort persons via bare next (count them); check_skewness() makes all-NA variables vanish from the summary; select_vars_by_role(...)[1] yields NA when a role is unassigned (assert); type-conversion loop in load_study_data() doesn't guard as.numeric coercion NAs.
Also from review, design notes: Table 1 code selects by predictor/model-stratifier roles while CLAUDE.md documents table1/table1-stratifier as the wiring — reconcile code or docs. Initiation numerator periods are not clamped to [period_min, period_max] while the denominator is — confirm against the intended Holford construction.
Findings from the pre-merge review of #2 (silent-failure audit + code review). These are the hardening tasks for the cleaning/imputation phase. Key framing: in a {targets} pipeline, a warning during a successful build is shown once and never again — the target caches as "up to date" — so warn-and-continue is effectively silent.
Critical
load_study_data()warn-and-skips missing cycle files with no count reconciliation — 10 of 11 cycles loading builds a "successful"study_dataon a truncated period range. Add a loaded-vs-configured check that stops on shortfall.rec_with_table()delivered the expected columns per cycle;bind_rows()fills silently-skipped derived variables with NA. This is the exact upstream failure class found during v3 integration (feeder gaps → all-NA). Add a per-cycle expected-columns check and an all-NA-by-cycle audit for critical-role variables.prepare_for_mice()'s claim that tagged NA(a)/NA(c) are protected is false for numeric variables:tagged_na()isNAtomice(), so structural missingness gets imputed (e.g. quit times invented for never-smokers). Build an explicitwherematrix (TRUE only for NA(b) cells), restore tags on write-back, and drop"NA(a)"/"NA(c)"factor levels from the modelling frame. Also checkmice_result$loggedEvents(silently dropped/collinear variables) — currently suppressed byprintFlag = FALSE.fit_binomial_apc()accept zero-event numerators — an all-NAage_first_cigaretteyields a converged model encoding "nobody ever started smoking". Stop on empty numerator andsum(d) == 0.Important
5.
validate_cycle_coverage()defaults to warn-only for critical gaps (strict_validation: falsein all profiles), downgrades to a warning whencfg$cchs_cyclesis NULL, and exempts DerivedVar/empty-variableStartrows (exactly the variablesrec_with_tablesilently drops). Make strict the default; surface skipped variables; consume the gap report somewhere.6.
interior_knots()silently drops out-of-range knots, andfit_apc_model()stores the configured (not used) knots inattr(fit, "knots")— Stage 9 basis reconstruction will mismatch the coefficients. Store actually-used knots; message on drops.7. No
fit$converged/fit$boundarycheck afterglm()despite the docstring anticipating boundary failures.8. Age restriction is skipped without any message when the age column is absent (split the condition: config-NULL = skip, missing column = stop).
9. Sex subsetting via
data[[sex_col]] == 1injects all-NA phantom rows when sex is NA and gives empty strata on coding mismatch; usewhich()and assert non-empty strata.10. Assign
imputation-predictorroles incshm-variables.csv— no variable currently carries the role, so Stage 5 selects zero variables and silently no-ops.11. Smaller:
expand_denominator()drops NA-cohort persons via barenext(count them);check_skewness()makes all-NA variables vanish from the summary;select_vars_by_role(...)[1]yields NA when a role is unassigned (assert); type-conversion loop inload_study_data()doesn't guardas.numericcoercion NAs.Also from review, design notes: Table 1 code selects by
predictor/model-stratifierroles while CLAUDE.md documentstable1/table1-stratifieras the wiring — reconcile code or docs. Initiation numerator periods are not clamped to[period_min, period_max]while the denominator is — confirm against the intended Holford construction.