Where-matrix imputation, weighted Tables 1a/1b, and protocol v0.3.1 by DougManuel · Pull Request #6 · Big-Life-Lab/cshm-dev

DougManuel · 2026-06-10T19:30:40Z

Summary

Implements the missing-data and descriptive-statistics design approved for protocol v0.3.x: where-matrix MICE imputation restricted to item non-response, survey-weighted Tables 1a/1b averaged across imputations, and an expanded auxiliary covariate set. Protocol and implementation are in one PR so the prespecification and the code can be reviewed together.

Protocol (v0.3.0 → v0.3.1)

§3.4.1 rewritten: where-matrix MICE (m = 5) on NA(b) item non-response only; design variables in the imputation model for congeniality with the APC analysis; derived variables recomputed from imputed feeders; Tables 1a/1b as a single table with unweighted n and survey-weighted statistics.
Appendix D (docs/protocol/appendix-imputation.qmd, also in the site sidebar): the tagged-NA taxonomy, CSHM-specific cycle-level gaps (2019-20/2022 PUMF; the 2015-16 SMKG035 recode), the model specification with three documented improvements over the DemPoRTv2 implementation, the Table 1a/1b presentation, and a four-phase diagnostic and sensitivity plan (including delta-adjustment for cycle-absent variables, with the RDC Master files as an external check).
v0.3.1: auxiliary covariates added — marital status, alcohol, BMI, self-rated general and mental health, life stress, community belonging, six chronic conditions, energy expenditure — all verified harmonized in cchsflow v3 through the 2019-20 PUMF (no upstream work required). Income to 2019-20 documented as a pending cchsflow extension.

Implementation

R/imputation.R — impute_data() builds an explicit where matrix marking only NA(b) cells (haven tag "b" / "NA(b)" factor level). Structural NA(a)/NA(c) survive untouched: the write-back copies only where-matrix cells, and the stage stops if MICE declines to impute a flagged cell (constant/collinear) instead of silently degrading a tag to plain NA. Factor NA(x) levels are excluded from the modelling frame so polyreg cannot assign them. All m completed datasets are retained with an imputed-cells audit and surfaced loggedEvents.
Descriptive engine — weight_var adds survey-weighted percent and weighted median/IQR (midpoint-ECDF interpolating quantile; equal-weight median equals the type-7 median). get_cshm_desc_data_mi() averages Table 1b across imputations. Cells render as unweighted n with weighted percent / weighted median (IQR). Row selection now uses the table1 role as documented.
Worksheet — imputation-predictor roles assigned; 15 auxiliary variables + 16 feeders added (72 rows). Coverage validation passes (known 2022 smoking gap only).
_targets.R — weighted Table 1a; MI-averaged Table 1b; APC fitted on imputation 1 with the Rubin-pooling upgrade path documented in Appendix D.

Review already applied

An agent code review of the implementation found and fixed one critical issue before this PR: when MICE flags a target variable constant/collinear it leaves its where-TRUE cells unimputed, and the original write-back would have replaced the NA(b) tag with untagged NA — now a hard failure pointing at the logged events. Also fixed: the weighted quantile now genuinely interpolates (docstring/implementation mismatch), and an NA-guard in the continuous cell formatter.

Testing

85 tests pass, including 41 new: the where-matrix contract (only NA(b) cells written; tags round-trip via haven::na_tag; factor levels preserved; imputed categorical values never structural-missing labels; complete columns byte-identical), the MICE declined-cell guard, weighted quantile edge cases, and multi-role worksheet parsing.
End-to-end validation (1% samples of CCHS 2001, 2015-16, 2019-20: clean → weighted Table 1a → impute m=2 → MI-averaged Table 1b) is running; sanity tables will be posted as a PR comment. The auxiliary covariates' first full-pipeline exercise will be the next tar_make(); their cycle coverage is validated, and they are long-established cchsflow variables rather than the freshly repaired smoking chain.

Known limitations

Income absent from the auxiliary set pending the cchsflow 2019-20 extension.
APC stage uses imputation 1; per-imputation fits with Rubin pooling are specified in Appendix D and gated on the FMI diagnostics (Phase 2.3).
Imputation diagnostics report (trace plots, density plots, FMI table) is the next implementation item on the Appendix D checklist.

Specify the imputation approach in section 3.4.1: where-matrix MICE restricted to item non-response (NA(b)), m = 5; survey design variables (cycle, sampling weight) and sociodemographic predictors in the imputation model for congeniality with the APC analysis; derived variables recomputed from imputed feeders. Tables 1a/1b specified as a single table with unweighted n and survey-weighted statistics (1a disclosing missing data by NA type; 1b averaged across the m completed datasets), with a fully unweighted appendix variant. Add Appendix D (docs/protocol/appendix-imputation.qmd), adapting the DemPoRTv2 imputation assessment framework: CSHM-specific missing-data context (tagged-NA taxonomy, PUMF cycle-level gaps for 2019-20 and 2022, the 2015-16 SMKG035 recode), the model specification with the three improvements over the DemPoRTv2 implementation, the Table 1a/1b presentation, the four-phase diagnostic and sensitivity plan (including delta-adjustment for cycle-absent variables with the RDC Master files as an external check), and an implementation checklist tied to issues #3 and #4.

Stage 5 (R/imputation.R): MICE with an explicit where matrix marking only NA(b) cells; tagged NA(a)/NA(c) survive untouched because the write-back copies only where-matrix cells into the original data, and the stage stops if MICE declines to impute a flagged cell (constant/ collinear) rather than silently degrading an NA(b) tag to plain NA. Factor NA(x) levels are removed from the modelling frame and restored implicitly. All m completed datasets are retained, with an imputed-cells audit and surfaced loggedEvents. Descriptive engine: weight_var argument adds survey-weighted percent (categories and NA-type rows) and weighted median/IQR via a midpoint-ECDF interpolating quantile; unchanged output when NULL. get_cshm_desc_data_mi() averages Table 1b statistics across the m imputations. Table cells render unweighted n with weighted percent / weighted median (IQR). Table rows are now selected by the table1 role as documented (was predictor); SDCFIMM/SDCGCGT gain the table1 role. Worksheet: imputation-predictor roles per protocol Appendix D, plus 15 auxiliary covariates (marital status, alcohol, BMI, general and mental health, stress, belonging, six chronic conditions, energy expenditure — all cchsflow-v3-covered to 2019-20) with 16 feeder rows; 72 rows total. Coverage validation passes (known 2022 gap only). _targets.R: weighted Table 1a; MI-averaged Table 1b; APC fits imputation 1 (Appendix D documents the Rubin-pooling upgrade path). Tests: 41 new (where-matrix contract, tagged-NA round trip, MICE declined-cell guard, weighted quantiles, multi-role fixtures); 85 pass total.

Amend section 3.4.1 and Appendix D with the expanded auxiliary covariate set (marital status, alcohol, BMI, self-rated general and mental health, life stress, community belonging, chronic conditions, energy expenditure — all harmonized in cchsflow v3 through 2019-20), serving imputation quality and the study-base description for planned related studies. Income to 2019-20 documented as a pending upstream cchsflow extension. Add Appendix D to the site sidebar. Rewrite the Stage 5 workflow page for the where-matrix design (NA(b) only; structural missingness preserved; analysis_data is now a list of m completed datasets) and align the manuscript methods text and analysis_data usage with the implemented approach.

Copilot

Pull request overview

This PR implements the protocol v0.3.x missing-data/descriptive-statistics design by adding where-matrix MICE imputation restricted to NA(b) item non-response, producing survey-weighted descriptive Tables 1a/1b (with Table 1b averaged across imputations), and expanding the worksheet’s auxiliary covariate set and documentation to match the protocol.

Changes:

Add where-matrix MICE imputation that preserves structural missingness and returns all m completed datasets + an audit/logged events.
Extend descriptive-statistics engine and table rendering to support survey-weighted percent and weighted median/IQR, and average Table 1b across imputations.
Update targets pipeline, tests, worksheets, manuscript, and protocol/workflow documentation to reflect the new imputation artifact and Table 1 design.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
worksheets/cshm-variables.csv	Adds/updates roles (incl. `imputation-predictor`, `table1`) and auxiliary covariates for imputation + Table 1.
tests/testthat/test-imputation.R	New tests for where-matrix contract, tag preservation, and declined-cell guard.
tests/testthat/test-descriptive-tables.R	Updates role parsing expectations and adds coverage for `table1` role selection.
R/imputation.R	Implements where-matrix MICE, write-back restricted to NA(b), multi-dataset output, and logged-events surfacing.
R/get-descriptive-data.R	Adds weighted quantile + weight-aware descriptive stats generation (engine layer).
R/descriptive-data.R	Adds weight support and MI-averaging wrapper for Table 1b.
R/create-descriptive-tables.R	Updates cell formatting/footnotes to display unweighted n with weighted percent/median-IQR when available.
manuscript/manuscript.qmd	Updates analysis-data loading to account for new `analysis_data` structure (list of datasets).
docs/workflow/5-imputation.qmd	Updates workflow documentation to describe where-matrix MICE and the new `analysis_data` artifact structure.
docs/protocol/full-protocol.qmd	Updates protocol text/version history to v0.3.1 and describes where-matrix MI + weighted Table 1s.
docs/protocol/appendix-imputation.qmd	Adds Appendix D detailing missingness taxonomy, imputation spec, and diagnostics/sensitivity plan.
_targets.R	Switches Table 1a/1b targets to weighted + MI-averaged variants and uses imputation 1 for APC data prep.
_quarto.yml	Adds Appendix D to the protocol sidebar.

Comments suppressed due to low confidence (1)

R/get-descriptive-data.R:44

When weight_var is supplied, new_row gains wtd_* columns, but descriptive_data is initialized without them. Because the code uses base rbind(), this will error with “numbers of columns of arguments do not match” on the first append. Initialize the weighted columns on descriptive_data when use_weights is TRUE (or always) so rbind() has consistent columns.

  descriptive_data <- data.frame(
    variable   = c(),
    cat        = c(),
    median     = c(),
    percentile25 = c(),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  # Convergence check: mice() does not error on pathological chains, so a
+  # quick guard on non-finite chain means catches degenerate fits.
+  if (any(!is.finite(mice_result$chainMean), na.rm = FALSE) &&
+      any(is.nan(mice_result$chainMean))) {
+    warning("Non-finite MICE chain means detected — inspect convergence ",
+            "(mice::plot) before using these imputations.", call. = FALSE)
  }


+|-----|---------|-----------|
+| NA(a) | Not applicable (e.g., initiation age for never-smokers) | Structural — never imputed |
+| NA(b) | Don't know / refused / not stated | Item non-response — imputed under MAR |
+| NA(c) | Not asked in this survey cycle | Cycle-level absence — imputed under a stronger MAR assumption, with sensitivity analysis |


Imputation (R/imputation.R): - Predictor matrix excludes structural-NA variables as predictors (they may be imputed but cannot predict): mice drops rows with missing predictors from each conditional model, so a structural-NA predictor such as time-since-quit made most NA(b) cells in other variables unpredictable. Synthetic validation: all 700 flagged cells fill, zero untagged NA, all NA(a)/(c) tags preserved. - Chain-mean convergence check scoped to actually-imputed variables and promoted to stop() (the unscoped version warned on every healthy run and could never detect the NaN-free degenerate case). - Dropping a prespecified design predictor (weight, cycle, sex, age) from the imputation model is now an error, not a one-shot warning. - pack_years_der recomputed from imputed feeders via cchsflow::calculate_pack_years after each completed dataset, delivering the protocol's derived-variable claim. Descriptives: - table-1.qmd renders rows by the table1 role (the 15 auxiliaries were computed but silently dropped at render), adds the Health status section, weights the cycle appendix, and adds the fully unweighted appendix variant promised by the protocol (new unweighted targets). - Survey weights validated at the engine boundary (stop on missing or non-positive); weighted_quantile rejects bad weights; zero-row table lookups stop as wiring bugs instead of rendering 'No data'; MI-averaged n rounded for display; MI row-alignment invariant. Docs: Appendix D aligned with the implementation (NA(c) not imputed in the base pipeline with the Phase 4 evaluation path; predictor-matrix design documented; factor-level mechanism wording; checklist updated), stage 5/6 docs and CLAUDE.md role table corrected. Tests: declined-cell guard exercised (either layered stop accepted), structural-predictor exclusion, pack_years recomputation, weighted percent/median integration, MI averaging with exact expectations, vacuous assertion replaced; cchsflow attached in setup.R mirroring tar_option_set. 102 tests pass.

DougManuel · 2026-06-10T21:07:23Z

PR-skill review completed — findings and resolutions

Four review agents (code, silent-failures, tests, comment/protocol accuracy) ran over the full diff. Everything below is fixed in 539171d unless marked as a follow-up issue.

Critical findings, all resolved

Structural-NA predictors made most NA(b) cells unpredictable (found via the E2E sanity tables: only 2 of 292 male age_first_cigarette cells filled). mice excludes rows with missing predictors from each conditional model, so time-since-quit (NA(a) for current and never-smokers) blocked prediction for most ever-smokers. Fixed with a predictor matrix: structural-NA variables may be imputed but never serve as predictors. Synthetic validation at scale: all 700 flagged cells fill, zero untagged NA, every NA(a)/(c) tag preserved. (The corrupted E2E run predated the declined-cell guard — current code stops rather than corrupting.)
The 15 auxiliary covariates were computed but silently dropped at render — table-1.qmd still filtered rows by the predictor role. Now renders by table1 role with the Health status section; cycle appendix weighted; fully unweighted appendix variant added (closing the protocol promise).
pack_years_der recomputation was promised by the protocol but implemented nowhere (a false past-tense methods claim in the manuscript). Implemented: recomputed from imputed feeders via cchsflow::calculate_pack_years per completed dataset, with a test (never-smokers → 0, stale values overwritten).
The chain-mean convergence guard was a guaranteed false positive on every healthy run (variables with zero imputable cells always have NaN chain means) and could never catch the NaN-free degenerate case. Now scoped to actually-imputed variables and promoted to stop().
Protocol/code contradiction on NA(c): the appendix taxonomy said cycle-absent values are imputed; the implementation (deliberately) does not. Appendix D now states the base-pipeline position — NA(c) is not imputed; extending the where matrix is the documented mechanism if the Phase 4 delta-adjustment evaluation supports it.

Important findings, resolved

Dropping a prespecified design predictor (weight/cycle/sex/age) from the imputation model is now an error, not a once-shown warning ({targets} caches successful builds, so warnings are effectively silent).
Survey weights validated at the engine boundary (stop on missing/non-positive — weighted percents would otherwise silently renormalize over a different population than the displayed n); weighted_quantile rejects bad weights.
Zero-row table lookups now stop as wiring bugs instead of rendering "No data"; MI-averaged n rounds for display; an MI row-alignment invariant guards the averaging.
Documentation accuracy sweep: factor-level mechanism wording, list-not-long-format storage, dev/draft m=maxit=1, checklist updated to reflect delivered items, CLAUDE.md role table, stage 5/6 docs.

Tests

102 pass (was 85): declined-cell guard exercised, structural-predictor exclusion, pack-years recomputation, weighted percent/median integration with hand-computed expectations, MI averaging exactness, the vacuous assertion replaced, and setup.R attaches cchsflow to mirror production.

Follow-up issues (out of PR scope)

Filed separately: tagging cycle-absent plain NA as NA(c) at load time (fix_na_c pattern — currently cycle-absent mass is invisible to the NA-row accounting), the statscan profile config dead-drop, a failing imputation_health target consuming logged_events, and the universe-nested predictor refinement.

A fresh end-to-end run (1% × 3 cycles, now including the auxiliary covariates) is in progress; sanity tables will be posted here when it completes.

…changes The PR #6 review fixes changed prespecified methods in Appendix D after the v0.3.1 stamp; record them in the version history rather than edit silently: NA(c) is not imputed in the base pipeline (extending the where matrix is the documented mechanism, gated on Phase 4), and structural-missingness variables do not serve as imputation predictors. Correct the 0.3.0/0.3.1 entry dates (2026-06-10, not 06-11).

Any PR touching docs/protocol/ must bump version-summary.version in full-protocol.qmd and add a dated version-history entry. The GitHub Actions check compares the version against the PR base and fails on a silent edit; CLAUDE.md documents the convention for agent sessions. Same gate-as-pipeline-citizen pattern as the coverage validator.

DougManuel · 2026-06-10T23:02:34Z

End-to-end validation (post-fix): the accounting closes

Fresh run on 1% samples of CCHS 2001, 2015-16, 2019-20 PUMF — clean → weighted Table 1a → impute (m = 2) → MI-averaged weighted Table 1b — now including all 15 auxiliary covariates in the imputation model and tables (770 table rows vs 315 before).

The decisive check (male stratum, age_first_cigarette), before-fix vs after-fix:

	Pre-imputation (1a)	Post-imputation (1b), broken run	Post-imputation (1b), this run
Observed n	997	999 (+2 of 292)	1289 (= 997 + 292, every NA(b) filled)
NA(b) row	292 (20.3%)	0 (cells silently untagged-NA)	0 (cells genuinely imputed)
NA(a) row (never-smokers)	151	151	151 — structural missingness untouched
Weighted median (IQR)	16 (13–17)	—	16 (13–17.75) — distribution preserved

All 20 imputation targets filled (4,742 cells total), including the auxiliaries: drinks last week 1,119, energy expenditure 944, COPD 378, community belonging 154. The predictor matrix excluded 13 structural-missingness variables from predicting (they are still imputed); the complete design/demographic/auxiliary core — cycle, weight, sex, age, province, marital status, drinks, general health, stress, belonging, hypertension, diabetes, energy expenditure — does the predicting. Several auxiliaries land in the exclusion only because cycle-absence currently arrives as plain NA rather than NA(c) (#7); the exclusion is correct either way.

SMKDSTY_original distributions are identical in 1a/1b apart from row ordering (1 imputed cell), and weighted vs unweighted percents diverge sensibly throughout (e.g., never-smokers 34.8% unweighted vs 40.4% weighted among men).

Ready for review and merge from my side.

DougManuel added 3 commits June 10, 2026 11:58

Copilot AI review requested due to automatic review settings June 10, 2026 19:30

Copilot started reviewing on behalf of DougManuel June 10, 2026 19:30 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

This was referenced Jun 10, 2026

Tag cycle-absent values as NA(c) at load time (fix_na_c pattern); enforce the tagged-NA convention #7

Open

statscan profile is a config dead-drop; imputation_health target; universe-nested predictors #8

Open

DougManuel added 2 commits June 10, 2026 17:16

DougManuel merged commit 5f01329 into main Jun 10, 2026
1 check passed

DougManuel deleted the protocol branch June 10, 2026 23:05

DougManuel restored the protocol branch June 10, 2026 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where-matrix imputation, weighted Tables 1a/1b, and protocol v0.3.1#6

Where-matrix imputation, weighted Tables 1a/1b, and protocol v0.3.1#6
DougManuel merged 6 commits into
mainfrom
protocol

DougManuel commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

DougManuel commented Jun 10, 2026

Uh oh!

DougManuel commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DougManuel commented Jun 10, 2026

Summary

Protocol (v0.3.0 → v0.3.1)

Implementation

Review already applied

Testing

Known limitations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

DougManuel commented Jun 10, 2026

PR-skill review completed — findings and resolutions

Critical findings, all resolved

Important findings, resolved

Tests

Follow-up issues (out of PR scope)

Uh oh!

DougManuel commented Jun 10, 2026

End-to-end validation (post-fix): the accounting closes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants