Where-matrix imputation, weighted Tables 1a/1b, and protocol v0.3.1#6
Conversation
Specify the imputation approach in section 3.4.1: where-matrix MICE restricted to item non-response (NA(b)), m = 5; survey design variables (cycle, sampling weight) and sociodemographic predictors in the imputation model for congeniality with the APC analysis; derived variables recomputed from imputed feeders. Tables 1a/1b specified as a single table with unweighted n and survey-weighted statistics (1a disclosing missing data by NA type; 1b averaged across the m completed datasets), with a fully unweighted appendix variant. Add Appendix D (docs/protocol/appendix-imputation.qmd), adapting the DemPoRTv2 imputation assessment framework: CSHM-specific missing-data context (tagged-NA taxonomy, PUMF cycle-level gaps for 2019-20 and 2022, the 2015-16 SMKG035 recode), the model specification with the three improvements over the DemPoRTv2 implementation, the Table 1a/1b presentation, the four-phase diagnostic and sensitivity plan (including delta-adjustment for cycle-absent variables with the RDC Master files as an external check), and an implementation checklist tied to issues #3 and #4.
Stage 5 (R/imputation.R): MICE with an explicit where matrix marking only NA(b) cells; tagged NA(a)/NA(c) survive untouched because the write-back copies only where-matrix cells into the original data, and the stage stops if MICE declines to impute a flagged cell (constant/ collinear) rather than silently degrading an NA(b) tag to plain NA. Factor NA(x) levels are removed from the modelling frame and restored implicitly. All m completed datasets are retained, with an imputed-cells audit and surfaced loggedEvents. Descriptive engine: weight_var argument adds survey-weighted percent (categories and NA-type rows) and weighted median/IQR via a midpoint-ECDF interpolating quantile; unchanged output when NULL. get_cshm_desc_data_mi() averages Table 1b statistics across the m imputations. Table cells render unweighted n with weighted percent / weighted median (IQR). Table rows are now selected by the table1 role as documented (was predictor); SDCFIMM/SDCGCGT gain the table1 role. Worksheet: imputation-predictor roles per protocol Appendix D, plus 15 auxiliary covariates (marital status, alcohol, BMI, general and mental health, stress, belonging, six chronic conditions, energy expenditure — all cchsflow-v3-covered to 2019-20) with 16 feeder rows; 72 rows total. Coverage validation passes (known 2022 gap only). _targets.R: weighted Table 1a; MI-averaged Table 1b; APC fits imputation 1 (Appendix D documents the Rubin-pooling upgrade path). Tests: 41 new (where-matrix contract, tagged-NA round trip, MICE declined-cell guard, weighted quantiles, multi-role fixtures); 85 pass total.
Amend section 3.4.1 and Appendix D with the expanded auxiliary covariate set (marital status, alcohol, BMI, self-rated general and mental health, life stress, community belonging, chronic conditions, energy expenditure — all harmonized in cchsflow v3 through 2019-20), serving imputation quality and the study-base description for planned related studies. Income to 2019-20 documented as a pending upstream cchsflow extension. Add Appendix D to the site sidebar. Rewrite the Stage 5 workflow page for the where-matrix design (NA(b) only; structural missingness preserved; analysis_data is now a list of m completed datasets) and align the manuscript methods text and analysis_data usage with the implemented approach.
There was a problem hiding this comment.
Pull request overview
This PR implements the protocol v0.3.x missing-data/descriptive-statistics design by adding where-matrix MICE imputation restricted to NA(b) item non-response, producing survey-weighted descriptive Tables 1a/1b (with Table 1b averaged across imputations), and expanding the worksheet’s auxiliary covariate set and documentation to match the protocol.
Changes:
- Add where-matrix MICE imputation that preserves structural missingness and returns all m completed datasets + an audit/logged events.
- Extend descriptive-statistics engine and table rendering to support survey-weighted percent and weighted median/IQR, and average Table 1b across imputations.
- Update targets pipeline, tests, worksheets, manuscript, and protocol/workflow documentation to reflect the new imputation artifact and Table 1 design.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| worksheets/cshm-variables.csv | Adds/updates roles (incl. imputation-predictor, table1) and auxiliary covariates for imputation + Table 1. |
| tests/testthat/test-imputation.R | New tests for where-matrix contract, tag preservation, and declined-cell guard. |
| tests/testthat/test-descriptive-tables.R | Updates role parsing expectations and adds coverage for table1 role selection. |
| R/imputation.R | Implements where-matrix MICE, write-back restricted to NA(b), multi-dataset output, and logged-events surfacing. |
| R/get-descriptive-data.R | Adds weighted quantile + weight-aware descriptive stats generation (engine layer). |
| R/descriptive-data.R | Adds weight support and MI-averaging wrapper for Table 1b. |
| R/create-descriptive-tables.R | Updates cell formatting/footnotes to display unweighted n with weighted percent/median-IQR when available. |
| manuscript/manuscript.qmd | Updates analysis-data loading to account for new analysis_data structure (list of datasets). |
| docs/workflow/5-imputation.qmd | Updates workflow documentation to describe where-matrix MICE and the new analysis_data artifact structure. |
| docs/protocol/full-protocol.qmd | Updates protocol text/version history to v0.3.1 and describes where-matrix MI + weighted Table 1s. |
| docs/protocol/appendix-imputation.qmd | Adds Appendix D detailing missingness taxonomy, imputation spec, and diagnostics/sensitivity plan. |
| _targets.R | Switches Table 1a/1b targets to weighted + MI-averaged variants and uses imputation 1 for APC data prep. |
| _quarto.yml | Adds Appendix D to the protocol sidebar. |
Comments suppressed due to low confidence (1)
R/get-descriptive-data.R:44
- When
weight_varis supplied,new_rowgainswtd_*columns, butdescriptive_datais initialized without them. Because the code uses baserbind(), this will error with “numbers of columns of arguments do not match” on the first append. Initialize the weighted columns ondescriptive_datawhenuse_weightsis TRUE (or always) sorbind()has consistent columns.
descriptive_data <- data.frame(
variable = c(),
cat = c(),
median = c(),
percentile25 = c(),
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Convergence check: mice() does not error on pathological chains, so a | ||
| # quick guard on non-finite chain means catches degenerate fits. | ||
| if (any(!is.finite(mice_result$chainMean), na.rm = FALSE) && | ||
| any(is.nan(mice_result$chainMean))) { | ||
| warning("Non-finite MICE chain means detected — inspect convergence ", | ||
| "(mice::plot) before using these imputations.", call. = FALSE) | ||
| } |
| |-----|---------|-----------| | ||
| | NA(a) | Not applicable (e.g., initiation age for never-smokers) | Structural — never imputed | | ||
| | NA(b) | Don't know / refused / not stated | Item non-response — imputed under MAR | | ||
| | NA(c) | Not asked in this survey cycle | Cycle-level absence — imputed under a stronger MAR assumption, with sensitivity analysis | |
Imputation (R/imputation.R): - Predictor matrix excludes structural-NA variables as predictors (they may be imputed but cannot predict): mice drops rows with missing predictors from each conditional model, so a structural-NA predictor such as time-since-quit made most NA(b) cells in other variables unpredictable. Synthetic validation: all 700 flagged cells fill, zero untagged NA, all NA(a)/(c) tags preserved. - Chain-mean convergence check scoped to actually-imputed variables and promoted to stop() (the unscoped version warned on every healthy run and could never detect the NaN-free degenerate case). - Dropping a prespecified design predictor (weight, cycle, sex, age) from the imputation model is now an error, not a one-shot warning. - pack_years_der recomputed from imputed feeders via cchsflow::calculate_pack_years after each completed dataset, delivering the protocol's derived-variable claim. Descriptives: - table-1.qmd renders rows by the table1 role (the 15 auxiliaries were computed but silently dropped at render), adds the Health status section, weights the cycle appendix, and adds the fully unweighted appendix variant promised by the protocol (new unweighted targets). - Survey weights validated at the engine boundary (stop on missing or non-positive); weighted_quantile rejects bad weights; zero-row table lookups stop as wiring bugs instead of rendering 'No data'; MI-averaged n rounded for display; MI row-alignment invariant. Docs: Appendix D aligned with the implementation (NA(c) not imputed in the base pipeline with the Phase 4 evaluation path; predictor-matrix design documented; factor-level mechanism wording; checklist updated), stage 5/6 docs and CLAUDE.md role table corrected. Tests: declined-cell guard exercised (either layered stop accepted), structural-predictor exclusion, pack_years recomputation, weighted percent/median integration, MI averaging with exact expectations, vacuous assertion replaced; cchsflow attached in setup.R mirroring tar_option_set. 102 tests pass.
PR-skill review completed — findings and resolutionsFour review agents (code, silent-failures, tests, comment/protocol accuracy) ran over the full diff. Everything below is fixed in 539171d unless marked as a follow-up issue. Critical findings, all resolved
Important findings, resolved
Tests102 pass (was 85): declined-cell guard exercised, structural-predictor exclusion, pack-years recomputation, weighted percent/median integration with hand-computed expectations, MI averaging exactness, the vacuous assertion replaced, and Follow-up issues (out of PR scope)Filed separately: tagging cycle-absent plain NA as NA(c) at load time ( A fresh end-to-end run (1% × 3 cycles, now including the auxiliary covariates) is in progress; sanity tables will be posted here when it completes. |
…changes The PR #6 review fixes changed prespecified methods in Appendix D after the v0.3.1 stamp; record them in the version history rather than edit silently: NA(c) is not imputed in the base pipeline (extending the where matrix is the documented mechanism, gated on Phase 4), and structural-missingness variables do not serve as imputation predictors. Correct the 0.3.0/0.3.1 entry dates (2026-06-10, not 06-11).
Any PR touching docs/protocol/ must bump version-summary.version in full-protocol.qmd and add a dated version-history entry. The GitHub Actions check compares the version against the PR base and fails on a silent edit; CLAUDE.md documents the convention for agent sessions. Same gate-as-pipeline-citizen pattern as the coverage validator.
End-to-end validation (post-fix): the accounting closesFresh run on 1% samples of CCHS 2001, 2015-16, 2019-20 PUMF — clean → weighted Table 1a → impute (m = 2) → MI-averaged weighted Table 1b — now including all 15 auxiliary covariates in the imputation model and tables (770 table rows vs 315 before). The decisive check (male stratum,
All 20 imputation targets filled (4,742 cells total), including the auxiliaries: drinks last week 1,119, energy expenditure 944, COPD 378, community belonging 154. The predictor matrix excluded 13 structural-missingness variables from predicting (they are still imputed); the complete design/demographic/auxiliary core — cycle, weight, sex, age, province, marital status, drinks, general health, stress, belonging, hypertension, diabetes, energy expenditure — does the predicting. Several auxiliaries land in the exclusion only because cycle-absence currently arrives as plain NA rather than NA(c) (#7); the exclusion is correct either way.
Ready for review and merge from my side. |
Summary
Implements the missing-data and descriptive-statistics design approved for protocol v0.3.x: where-matrix MICE imputation restricted to item non-response, survey-weighted Tables 1a/1b averaged across imputations, and an expanded auxiliary covariate set. Protocol and implementation are in one PR so the prespecification and the code can be reviewed together.
Protocol (v0.3.0 → v0.3.1)
docs/protocol/appendix-imputation.qmd, also in the site sidebar): the tagged-NA taxonomy, CSHM-specific cycle-level gaps (2019-20/2022 PUMF; the 2015-16 SMKG035 recode), the model specification with three documented improvements over the DemPoRTv2 implementation, the Table 1a/1b presentation, and a four-phase diagnostic and sensitivity plan (including delta-adjustment for cycle-absent variables, with the RDC Master files as an external check).Implementation
R/imputation.R—impute_data()builds an explicitwherematrix marking only NA(b) cells (haven tag "b" /"NA(b)"factor level). Structural NA(a)/NA(c) survive untouched: the write-back copies only where-matrix cells, and the stage stops if MICE declines to impute a flagged cell (constant/collinear) instead of silently degrading a tag to plain NA. FactorNA(x)levels are excluded from the modelling frame so polyreg cannot assign them. All m completed datasets are retained with an imputed-cells audit and surfacedloggedEvents.weight_varadds survey-weighted percent and weighted median/IQR (midpoint-ECDF interpolating quantile; equal-weight median equals the type-7 median).get_cshm_desc_data_mi()averages Table 1b across imputations. Cells render as unweighted n with weighted percent / weighted median (IQR). Row selection now uses thetable1role as documented.imputation-predictorroles assigned; 15 auxiliary variables + 16 feeders added (72 rows). Coverage validation passes (known 2022 smoking gap only)._targets.R— weighted Table 1a; MI-averaged Table 1b; APC fitted on imputation 1 with the Rubin-pooling upgrade path documented in Appendix D.Review already applied
An agent code review of the implementation found and fixed one critical issue before this PR: when MICE flags a target variable constant/collinear it leaves its where-TRUE cells unimputed, and the original write-back would have replaced the NA(b) tag with untagged NA — now a hard failure pointing at the logged events. Also fixed: the weighted quantile now genuinely interpolates (docstring/implementation mismatch), and an NA-guard in the continuous cell formatter.
Testing
haven::na_tag; factor levels preserved; imputed categorical values never structural-missing labels; complete columns byte-identical), the MICE declined-cell guard, weighted quantile edge cases, and multi-role worksheet parsing.tar_make(); their cycle coverage is validated, and they are long-established cchsflow variables rather than the freshly repaired smoking chain.Known limitations