Skip to content

Where-matrix imputation, weighted Tables 1a/1b, and protocol v0.3.1#6

Merged
DougManuel merged 6 commits into
mainfrom
protocol
Jun 10, 2026
Merged

Where-matrix imputation, weighted Tables 1a/1b, and protocol v0.3.1#6
DougManuel merged 6 commits into
mainfrom
protocol

Conversation

@DougManuel

Copy link
Copy Markdown
Collaborator

Summary

Implements the missing-data and descriptive-statistics design approved for protocol v0.3.x: where-matrix MICE imputation restricted to item non-response, survey-weighted Tables 1a/1b averaged across imputations, and an expanded auxiliary covariate set. Protocol and implementation are in one PR so the prespecification and the code can be reviewed together.

Protocol (v0.3.0 → v0.3.1)

  • §3.4.1 rewritten: where-matrix MICE (m = 5) on NA(b) item non-response only; design variables in the imputation model for congeniality with the APC analysis; derived variables recomputed from imputed feeders; Tables 1a/1b as a single table with unweighted n and survey-weighted statistics.
  • Appendix D (docs/protocol/appendix-imputation.qmd, also in the site sidebar): the tagged-NA taxonomy, CSHM-specific cycle-level gaps (2019-20/2022 PUMF; the 2015-16 SMKG035 recode), the model specification with three documented improvements over the DemPoRTv2 implementation, the Table 1a/1b presentation, and a four-phase diagnostic and sensitivity plan (including delta-adjustment for cycle-absent variables, with the RDC Master files as an external check).
  • v0.3.1: auxiliary covariates added — marital status, alcohol, BMI, self-rated general and mental health, life stress, community belonging, six chronic conditions, energy expenditure — all verified harmonized in cchsflow v3 through the 2019-20 PUMF (no upstream work required). Income to 2019-20 documented as a pending cchsflow extension.

Implementation

  • R/imputation.Rimpute_data() builds an explicit where matrix marking only NA(b) cells (haven tag "b" / "NA(b)" factor level). Structural NA(a)/NA(c) survive untouched: the write-back copies only where-matrix cells, and the stage stops if MICE declines to impute a flagged cell (constant/collinear) instead of silently degrading a tag to plain NA. Factor NA(x) levels are excluded from the modelling frame so polyreg cannot assign them. All m completed datasets are retained with an imputed-cells audit and surfaced loggedEvents.
  • Descriptive engineweight_var adds survey-weighted percent and weighted median/IQR (midpoint-ECDF interpolating quantile; equal-weight median equals the type-7 median). get_cshm_desc_data_mi() averages Table 1b across imputations. Cells render as unweighted n with weighted percent / weighted median (IQR). Row selection now uses the table1 role as documented.
  • Worksheetimputation-predictor roles assigned; 15 auxiliary variables + 16 feeders added (72 rows). Coverage validation passes (known 2022 smoking gap only).
  • _targets.R — weighted Table 1a; MI-averaged Table 1b; APC fitted on imputation 1 with the Rubin-pooling upgrade path documented in Appendix D.

Review already applied

An agent code review of the implementation found and fixed one critical issue before this PR: when MICE flags a target variable constant/collinear it leaves its where-TRUE cells unimputed, and the original write-back would have replaced the NA(b) tag with untagged NA — now a hard failure pointing at the logged events. Also fixed: the weighted quantile now genuinely interpolates (docstring/implementation mismatch), and an NA-guard in the continuous cell formatter.

Testing

  • 85 tests pass, including 41 new: the where-matrix contract (only NA(b) cells written; tags round-trip via haven::na_tag; factor levels preserved; imputed categorical values never structural-missing labels; complete columns byte-identical), the MICE declined-cell guard, weighted quantile edge cases, and multi-role worksheet parsing.
  • End-to-end validation (1% samples of CCHS 2001, 2015-16, 2019-20: clean → weighted Table 1a → impute m=2 → MI-averaged Table 1b) is running; sanity tables will be posted as a PR comment. The auxiliary covariates' first full-pipeline exercise will be the next tar_make(); their cycle coverage is validated, and they are long-established cchsflow variables rather than the freshly repaired smoking chain.

Known limitations

  • Income absent from the auxiliary set pending the cchsflow 2019-20 extension.
  • APC stage uses imputation 1; per-imputation fits with Rubin pooling are specified in Appendix D and gated on the FMI diagnostics (Phase 2.3).
  • Imputation diagnostics report (trace plots, density plots, FMI table) is the next implementation item on the Appendix D checklist.

Specify the imputation approach in section 3.4.1: where-matrix MICE
restricted to item non-response (NA(b)), m = 5; survey design
variables (cycle, sampling weight) and sociodemographic predictors in
the imputation model for congeniality with the APC analysis; derived
variables recomputed from imputed feeders. Tables 1a/1b specified as
a single table with unweighted n and survey-weighted statistics
(1a disclosing missing data by NA type; 1b averaged across the m
completed datasets), with a fully unweighted appendix variant.

Add Appendix D (docs/protocol/appendix-imputation.qmd), adapting the
DemPoRTv2 imputation assessment framework: CSHM-specific missing-data
context (tagged-NA taxonomy, PUMF cycle-level gaps for 2019-20 and
2022, the 2015-16 SMKG035 recode), the model specification with the
three improvements over the DemPoRTv2 implementation, the Table 1a/1b
presentation, the four-phase diagnostic and sensitivity plan
(including delta-adjustment for cycle-absent variables with the RDC
Master files as an external check), and an implementation checklist
tied to issues #3 and #4.
Stage 5 (R/imputation.R): MICE with an explicit where matrix marking
only NA(b) cells; tagged NA(a)/NA(c) survive untouched because the
write-back copies only where-matrix cells into the original data, and
the stage stops if MICE declines to impute a flagged cell (constant/
collinear) rather than silently degrading an NA(b) tag to plain NA.
Factor NA(x) levels are removed from the modelling frame and restored
implicitly. All m completed datasets are retained, with an
imputed-cells audit and surfaced loggedEvents.

Descriptive engine: weight_var argument adds survey-weighted percent
(categories and NA-type rows) and weighted median/IQR via a
midpoint-ECDF interpolating quantile; unchanged output when NULL.
get_cshm_desc_data_mi() averages Table 1b statistics across the m
imputations. Table cells render unweighted n with weighted percent /
weighted median (IQR). Table rows are now selected by the table1 role
as documented (was predictor); SDCFIMM/SDCGCGT gain the table1 role.

Worksheet: imputation-predictor roles per protocol Appendix D, plus
15 auxiliary covariates (marital status, alcohol, BMI, general and
mental health, stress, belonging, six chronic conditions, energy
expenditure — all cchsflow-v3-covered to 2019-20) with 16 feeder
rows; 72 rows total. Coverage validation passes (known 2022 gap only).

_targets.R: weighted Table 1a; MI-averaged Table 1b; APC fits
imputation 1 (Appendix D documents the Rubin-pooling upgrade path).

Tests: 41 new (where-matrix contract, tagged-NA round trip, MICE
declined-cell guard, weighted quantiles, multi-role fixtures);
85 pass total.
Amend section 3.4.1 and Appendix D with the expanded auxiliary
covariate set (marital status, alcohol, BMI, self-rated general and
mental health, life stress, community belonging, chronic conditions,
energy expenditure — all harmonized in cchsflow v3 through 2019-20),
serving imputation quality and the study-base description for planned
related studies. Income to 2019-20 documented as a pending upstream
cchsflow extension. Add Appendix D to the site sidebar.

Rewrite the Stage 5 workflow page for the where-matrix design (NA(b)
only; structural missingness preserved; analysis_data is now a list of
m completed datasets) and align the manuscript methods text and
analysis_data usage with the implemented approach.
Copilot AI review requested due to automatic review settings June 10, 2026 19:30

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the protocol v0.3.x missing-data/descriptive-statistics design by adding where-matrix MICE imputation restricted to NA(b) item non-response, producing survey-weighted descriptive Tables 1a/1b (with Table 1b averaged across imputations), and expanding the worksheet’s auxiliary covariate set and documentation to match the protocol.

Changes:

  • Add where-matrix MICE imputation that preserves structural missingness and returns all m completed datasets + an audit/logged events.
  • Extend descriptive-statistics engine and table rendering to support survey-weighted percent and weighted median/IQR, and average Table 1b across imputations.
  • Update targets pipeline, tests, worksheets, manuscript, and protocol/workflow documentation to reflect the new imputation artifact and Table 1 design.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
worksheets/cshm-variables.csv Adds/updates roles (incl. imputation-predictor, table1) and auxiliary covariates for imputation + Table 1.
tests/testthat/test-imputation.R New tests for where-matrix contract, tag preservation, and declined-cell guard.
tests/testthat/test-descriptive-tables.R Updates role parsing expectations and adds coverage for table1 role selection.
R/imputation.R Implements where-matrix MICE, write-back restricted to NA(b), multi-dataset output, and logged-events surfacing.
R/get-descriptive-data.R Adds weighted quantile + weight-aware descriptive stats generation (engine layer).
R/descriptive-data.R Adds weight support and MI-averaging wrapper for Table 1b.
R/create-descriptive-tables.R Updates cell formatting/footnotes to display unweighted n with weighted percent/median-IQR when available.
manuscript/manuscript.qmd Updates analysis-data loading to account for new analysis_data structure (list of datasets).
docs/workflow/5-imputation.qmd Updates workflow documentation to describe where-matrix MICE and the new analysis_data artifact structure.
docs/protocol/full-protocol.qmd Updates protocol text/version history to v0.3.1 and describes where-matrix MI + weighted Table 1s.
docs/protocol/appendix-imputation.qmd Adds Appendix D detailing missingness taxonomy, imputation spec, and diagnostics/sensitivity plan.
_targets.R Switches Table 1a/1b targets to weighted + MI-averaged variants and uses imputation 1 for APC data prep.
_quarto.yml Adds Appendix D to the protocol sidebar.
Comments suppressed due to low confidence (1)

R/get-descriptive-data.R:44

  • When weight_var is supplied, new_row gains wtd_* columns, but descriptive_data is initialized without them. Because the code uses base rbind(), this will error with “numbers of columns of arguments do not match” on the first append. Initialize the weighted columns on descriptive_data when use_weights is TRUE (or always) so rbind() has consistent columns.
  descriptive_data <- data.frame(
    variable   = c(),
    cat        = c(),
    median     = c(),
    percentile25 = c(),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/imputation.R Outdated
Comment on lines 94 to 100
# Convergence check: mice() does not error on pathological chains, so a
# quick guard on non-finite chain means catches degenerate fits.
if (any(!is.finite(mice_result$chainMean), na.rm = FALSE) &&
any(is.nan(mice_result$chainMean))) {
warning("Non-finite MICE chain means detected — inspect convergence ",
"(mice::plot) before using these imputations.", call. = FALSE)
}
Comment thread docs/protocol/appendix-imputation.qmd Outdated
|-----|---------|-----------|
| NA(a) | Not applicable (e.g., initiation age for never-smokers) | Structural — never imputed |
| NA(b) | Don't know / refused / not stated | Item non-response — imputed under MAR |
| NA(c) | Not asked in this survey cycle | Cycle-level absence — imputed under a stronger MAR assumption, with sensitivity analysis |
Imputation (R/imputation.R):
- Predictor matrix excludes structural-NA variables as predictors (they
  may be imputed but cannot predict): mice drops rows with missing
  predictors from each conditional model, so a structural-NA predictor
  such as time-since-quit made most NA(b) cells in other variables
  unpredictable. Synthetic validation: all 700 flagged cells fill, zero
  untagged NA, all NA(a)/(c) tags preserved.
- Chain-mean convergence check scoped to actually-imputed variables and
  promoted to stop() (the unscoped version warned on every healthy run
  and could never detect the NaN-free degenerate case).
- Dropping a prespecified design predictor (weight, cycle, sex, age)
  from the imputation model is now an error, not a one-shot warning.
- pack_years_der recomputed from imputed feeders via
  cchsflow::calculate_pack_years after each completed dataset,
  delivering the protocol's derived-variable claim.

Descriptives:
- table-1.qmd renders rows by the table1 role (the 15 auxiliaries were
  computed but silently dropped at render), adds the Health status
  section, weights the cycle appendix, and adds the fully unweighted
  appendix variant promised by the protocol (new unweighted targets).
- Survey weights validated at the engine boundary (stop on missing or
  non-positive); weighted_quantile rejects bad weights; zero-row table
  lookups stop as wiring bugs instead of rendering 'No data';
  MI-averaged n rounded for display; MI row-alignment invariant.

Docs: Appendix D aligned with the implementation (NA(c) not imputed in
the base pipeline with the Phase 4 evaluation path; predictor-matrix
design documented; factor-level mechanism wording; checklist updated),
stage 5/6 docs and CLAUDE.md role table corrected.

Tests: declined-cell guard exercised (either layered stop accepted),
structural-predictor exclusion, pack_years recomputation, weighted
percent/median integration, MI averaging with exact expectations,
vacuous assertion replaced; cchsflow attached in setup.R mirroring
tar_option_set. 102 tests pass.
@DougManuel

Copy link
Copy Markdown
Collaborator Author

PR-skill review completed — findings and resolutions

Four review agents (code, silent-failures, tests, comment/protocol accuracy) ran over the full diff. Everything below is fixed in 539171d unless marked as a follow-up issue.

Critical findings, all resolved

  1. Structural-NA predictors made most NA(b) cells unpredictable (found via the E2E sanity tables: only 2 of 292 male age_first_cigarette cells filled). mice excludes rows with missing predictors from each conditional model, so time-since-quit (NA(a) for current and never-smokers) blocked prediction for most ever-smokers. Fixed with a predictor matrix: structural-NA variables may be imputed but never serve as predictors. Synthetic validation at scale: all 700 flagged cells fill, zero untagged NA, every NA(a)/(c) tag preserved. (The corrupted E2E run predated the declined-cell guard — current code stops rather than corrupting.)
  2. The 15 auxiliary covariates were computed but silently dropped at rendertable-1.qmd still filtered rows by the predictor role. Now renders by table1 role with the Health status section; cycle appendix weighted; fully unweighted appendix variant added (closing the protocol promise).
  3. pack_years_der recomputation was promised by the protocol but implemented nowhere (a false past-tense methods claim in the manuscript). Implemented: recomputed from imputed feeders via cchsflow::calculate_pack_years per completed dataset, with a test (never-smokers → 0, stale values overwritten).
  4. The chain-mean convergence guard was a guaranteed false positive on every healthy run (variables with zero imputable cells always have NaN chain means) and could never catch the NaN-free degenerate case. Now scoped to actually-imputed variables and promoted to stop().
  5. Protocol/code contradiction on NA(c): the appendix taxonomy said cycle-absent values are imputed; the implementation (deliberately) does not. Appendix D now states the base-pipeline position — NA(c) is not imputed; extending the where matrix is the documented mechanism if the Phase 4 delta-adjustment evaluation supports it.

Important findings, resolved

  • Dropping a prespecified design predictor (weight/cycle/sex/age) from the imputation model is now an error, not a once-shown warning ({targets} caches successful builds, so warnings are effectively silent).
  • Survey weights validated at the engine boundary (stop on missing/non-positive — weighted percents would otherwise silently renormalize over a different population than the displayed n); weighted_quantile rejects bad weights.
  • Zero-row table lookups now stop as wiring bugs instead of rendering "No data"; MI-averaged n rounds for display; an MI row-alignment invariant guards the averaging.
  • Documentation accuracy sweep: factor-level mechanism wording, list-not-long-format storage, dev/draft m=maxit=1, checklist updated to reflect delivered items, CLAUDE.md role table, stage 5/6 docs.

Tests

102 pass (was 85): declined-cell guard exercised, structural-predictor exclusion, pack-years recomputation, weighted percent/median integration with hand-computed expectations, MI averaging exactness, the vacuous assertion replaced, and setup.R attaches cchsflow to mirror production.

Follow-up issues (out of PR scope)

Filed separately: tagging cycle-absent plain NA as NA(c) at load time (fix_na_c pattern — currently cycle-absent mass is invisible to the NA-row accounting), the statscan profile config dead-drop, a failing imputation_health target consuming logged_events, and the universe-nested predictor refinement.

A fresh end-to-end run (1% × 3 cycles, now including the auxiliary covariates) is in progress; sanity tables will be posted here when it completes.

…changes

The PR #6 review fixes changed prespecified methods in Appendix D after
the v0.3.1 stamp; record them in the version history rather than edit
silently: NA(c) is not imputed in the base pipeline (extending the
where matrix is the documented mechanism, gated on Phase 4), and
structural-missingness variables do not serve as imputation predictors.
Correct the 0.3.0/0.3.1 entry dates (2026-06-10, not 06-11).
Any PR touching docs/protocol/ must bump version-summary.version in
full-protocol.qmd and add a dated version-history entry. The GitHub
Actions check compares the version against the PR base and fails on a
silent edit; CLAUDE.md documents the convention for agent sessions.
Same gate-as-pipeline-citizen pattern as the coverage validator.
@DougManuel

Copy link
Copy Markdown
Collaborator Author

End-to-end validation (post-fix): the accounting closes

Fresh run on 1% samples of CCHS 2001, 2015-16, 2019-20 PUMF — clean → weighted Table 1a → impute (m = 2) → MI-averaged weighted Table 1b — now including all 15 auxiliary covariates in the imputation model and tables (770 table rows vs 315 before).

The decisive check (male stratum, age_first_cigarette), before-fix vs after-fix:

Pre-imputation (1a) Post-imputation (1b), broken run Post-imputation (1b), this run
Observed n 997 999 (+2 of 292) 1289 (= 997 + 292, every NA(b) filled)
NA(b) row 292 (20.3%) 0 (cells silently untagged-NA) 0 (cells genuinely imputed)
NA(a) row (never-smokers) 151 151 151 — structural missingness untouched
Weighted median (IQR) 16 (13–17) 16 (13–17.75) — distribution preserved

All 20 imputation targets filled (4,742 cells total), including the auxiliaries: drinks last week 1,119, energy expenditure 944, COPD 378, community belonging 154. The predictor matrix excluded 13 structural-missingness variables from predicting (they are still imputed); the complete design/demographic/auxiliary core — cycle, weight, sex, age, province, marital status, drinks, general health, stress, belonging, hypertension, diabetes, energy expenditure — does the predicting. Several auxiliaries land in the exclusion only because cycle-absence currently arrives as plain NA rather than NA(c) (#7); the exclusion is correct either way.

SMKDSTY_original distributions are identical in 1a/1b apart from row ordering (1 imputed cell), and weighted vs unweighted percents diverge sensibly throughout (e.g., never-smokers 34.8% unweighted vs 40.4% weighted among men).

Ready for review and merge from my side.

@DougManuel DougManuel merged commit 5f01329 into main Jun 10, 2026
1 check passed
@DougManuel DougManuel deleted the protocol branch June 10, 2026 23:05
@DougManuel DougManuel restored the protocol branch June 10, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants