CSHM pipeline, protocol, and documentation: cchsflow v3 variable setup#2
Merged
Conversation
…el and reproducing-Ontario-study docs
…v3 smoking variables Survey config now uses Option A: each variable has pumf/master sub-entries with var, min, and max fields. New survey_var() and survey_bound() accessors in R/config-utils.R resolve the active data source automatically. Smoking variables updated to unified cchsflow v3 names: - SMKDSTY -> SMKDSTY_original (CEP-002 year-based naming) - SMK_09A_cont/SMK_09C -> time_quit_smoking_daily - Added age_start_daily, cigs_per_day, pack_years config keys - Demoted 11 intermediate variables in cshm-variables.csv APC cessation logic corrected for SMKDSTY_original categories: scope changed from c(1,2,3,4) to c(1,2,4) — excludes always-occasional smokers (cat 3) who never smoked daily. Stage 1 workflow QMD now generates two tables: variable definitions and cycle coverage matrix (PUMF/Master per year).
Sync worksheets/cshm-variables.csv to the merged cchsflow v3 smoking variables: verbatim v3 databaseStart/variableStart, full transitive DerivedVar feeder closure (41 rows), corrected source columns, and SMKDVSTP notes (no longer a v3 feeder). Drop coverage claims for variables absent from the 2019-20 PUMF (SMKG040 family). Wire the pipeline to an in-repo snapshot of cchsflow v3 variable_details.csv (worksheets/cchsflow-variable-details.csv) so the repo runs without a sibling cchsflow checkout. The snapshot carries local fixes for seven worksheet defects found during validation (cchsflow #184, #185; cchsflow-data #3). Trim CSHM extension rows to GEOGPRV and WTS_M now that v3 covers age and sex for 2001-2023. Attach cchsflow via tar_option_set in _targets.R (v3 derivation functions need its Depends on the search path), record the local cchsflow v3 install and here in renv.lock, and update CLAUDE.md and the variable-setup workflow doc accordingly. Validated by harmonizing 1% samples of CCHS 2001, 2015-16, and 2019-20: all unified smoking variables derive with plausible non-missing rates; known gaps are SMKDSTY_original for 2022 PUMF and age_start_smoking for 2019-20+ PUMF (warnings, not errors). CLAUDE.md, docs/workflow/1-variable-setup.qmd, and renv.lock also carry earlier uncommitted edits from the in-progress repo restructure.
StatCan shipped the 2015-16 PUMF with valid skip and not stated pooled
into SMKG035 code 11 ("age 50 or older"; 46.3% weighted per their own
data dictionary, absent from the errata). The uniform midpoint rule
turned this into age 55 at first cigarette for ~44k never-smokers.
The snapshot now maps cchs2015_2016_p code 11 to NA(a):
age_first_cigarette for 2015-16 returns to 58.8% non-missing
(median 16) from a corrupt 100%.
Forensics: Big-Life-Lab/cchsflow-data#3. Upstream fix in
Big-Life-Lab/cchsflow#186 alongside the cat5/SMKG040 repairs already
captured in the previous snapshot commit.
Retire the interim smoking implementation (R/smoking.R, R/process_smoking_initiation.R) to R/legacy/ for reference; the pipeline now derives smoking variables through cchsflow v3. Drop the stale test for the retired function (its replacement is covered by test-apc-data.R). Remove the old config/ YAML and variable CSVs (replaced by config.yml profiles and the worksheets/ structure) and the superseded project-specification and protocol stubs (replaced by docs/protocol/).
…idation Port the descriptive-statistics engine and worksheet helpers from DemPoRT (get-descriptive-data.R, create-descriptive-tables.R, variables-sheet-utils.R, variable-details-sheet-utils.R) and add the CSHM stage functions: clean_study_data(), impute_data() (MICE), prepare_apc_data()/fit_apc_model(), rate-table and prevalence validation stubs, and the pre-flight cycle-coverage validator. load_study_data() gains data_source filtering, raw_data_file_map support for cchsflow-data release files, and the as.data.frame() guard for the rec_with_table() tibble bug. Add testthat coverage for APC data preparation and descriptive tables, the LinkML role-vocabulary schema (cshm-variables.yaml), the PUMF object renaming script, and the RDC config template.
…fold Reorganize the docs site around three purposes: the prespecified study protocol (docs/protocol/), one workflow page per pipeline stage (docs/workflow/ stages 1-8), and Divio-style how-to / explanation / reference sections. Add the manuscript scaffold rendered to Word via the docstyle extension, with all numbers drawn inline from pipeline targets. Update README, CONTRIBUTING, LICENSE, site config, and styles for the restructure; vendor the docstyle and fontawesome Quarto extensions; ignore manuscript render output, resources/, and machine-local Claude settings; commit shared project settings (.claude/settings.json).
Drop the stale docs/_extensions/docstyle.bak copy and the manuscript/.quarto project cache; ignore the cache going forward.
clean_study_data() now excludes respondents below cfg$age_exclusion_min using the continuous age variable, replacing the age-group-code mechanism whose survey key (age_grouped) was removed in the config restructure. Drop the age_grouped column from the APC test helper and remove the test for the retired map_variable_data() (now in R/legacy). Full suite: 43 pass, 0 fail.
The manuscript is rendered separately to Word via docstyle and reads pipeline targets that fresh clones will not have; the site render now skips it. Full site renders cleanly.
…ents From the pre-merge review (four review agents over PR #2): - clean_study_data() role filter compared whole comma-separated role strings against single roles and matched nothing, so the skewness check and truncation silently processed zero variables. Now uses select_vars_by_role(). - Cycle-1 survey year corrected to 2001 (CCHS 1.1 collected Sept 2000-Nov 2001); config value and test both said 2002 while the inline comment and cycle label said 2001. Shifts cycle-1 cohort assignment by one year — flagged for confirmation. - survey_cycle_code() unknown-name guard was dead code (subscript error fired first); now checks names() membership. - Comment/config accuracy: never-smoker NA(a) vs 50+ midpoint 55 in apc-model.R; PUMF initiation floor comments now note the 5-11 (midpoint 8) category excluded by the floor of 13 (open decision); age max 85 not 80; ethnicity mapping SDCGCGT (SDC_RACEM/SDCFRAC do not exist); CLAUDE.md default-profile data path, legacy file path, draft profile; _targets.R store comments; schema roles.csv pointer; smoking-histories.R and validation.R headers; %||% comment. Tests: 43 pass, 0 fail.
This was referenced Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the CSHM project from its initial scaffold to a working, shareable state: a {targets} pipeline that loads and harmonizes all 11 CCHS PUMF cycles (2001–2022) through cchsflow v3, a prespecified study protocol, per-stage workflow documentation, and a manuscript scaffold whose numbers draw from pipeline targets.
What's here
Pipeline (stages 1–8 active, 9–10 stubbed)
cshm-variables.csv(41 variables with roles and PUMF/Master source tags), an in-repo snapshot of cchsflow v3variable_details.csv, and CSHM extension rows (GEOGPRV, WTS_M for 2019-20/2022)load_study_data()with config-profile data sources, pre-flight cycle-coverage validation, cleaning (age floor, skewness-based truncation), MICE imputation, descriptive tables (ported from DemPoRT), and APC data preparation / model fittingdefault/draft/dev/prod/statscan(RDC paths gitignored)Smoking variables — cchsflow v3 final
age_first_cigarette,age_start_smoking,time_quit_smoking[_daily],cigs_per_day,pack_years_der,SMKDSTY_original) derive end-to-end; validated on 1% samples of 2001, 2015-16, and 2019-20Documentation (renders cleanly with
quarto render)docs/protocol/— prespecified protocol and one-page summarydocs/workflow/— one page per pipeline stage, generated from the worksheetsKnown limitations (documented in the protocol and worksheets)
SMKDSTY_original(and downstreamcigs_per_day,pack_years_der) unavailable for the 2022 PUMF (SMK_05Dmoved to Master-only);age_start_smokingunavailable for 2019-20+ PUMF (SMKG040dropped). Both surface as validation warnings, not errors; Master access at the RDC fills the gaps.v3checkout until fix(v3): regenerate NAMESPACE and repair smoking worksheet derivations cchsflow#186 merges (the v3 branch currently fails R CMD INSTALL from GitHub).