fix(v3): regenerate NAMESPACE and repair smoking worksheet derivations#186
Open
DougManuel wants to merge 4 commits into
Open
fix(v3): regenerate NAMESPACE and repair smoking worksheet derivations#186DougManuel wants to merge 4 commits into
DougManuel wants to merge 4 commits into
Conversation
Regenerate NAMESPACE and man/ with roxygen2: removes ~40 stale exports left from the v3 refactor (alcohol, BMI, ADL, worksheet tooling) that made the package fail R CMD INSTALL, and adds the missing importFrom(dplyr, case_when). Repair nine defects in the smoking rows of variable_details.csv (and variables.csv where feeder lists are mirrored): - Rename phantom Func:: references (calculate_SMKG040, calculate_SMKG203/207_continuous and _from_combined) to the actual calculate_*_cont functions. - Match DerivedVar feeder lists to function signatures: SMKG040_cont from [SMKG203_cont, SMKG207_cont]; SMKG203_cont/SMKG207_cont 2015+ from [SMK_005|SMK_030, SMKG040_cont]; cigs_per_day feeder order [SMKDSTY_original, SMK_204, SMK_208] (feeders pass positionally); pack_years_der occasional slots take SMK_05B/SMK_05C as in v2. - Split time_quit_smoking_daily into PUMF and Master blocks so the Master-only SMK_09C feeder no longer blocks PUMF derivation. - Correct the SMKDSTY_cat5 2015+ map to the post-2015 SMKDVSTY codes (1 daily, 2 occasional, 3 former daily, 4-5 former occ/experimental, 6 never); the pre-2015 collapse rules had been applied, classifying former-daily smokers as occasional and never-smokers as NA(b). - Drop the false cchs2019_2020_p claim from SMKG040/SMKG040_cont (raw SMKG040 exists only in 2015-16 and 2017-18 PUMF). - Add a cchs2015_2016_p exception mapping SMKG035 code 11 to NA(a): the official 2015-16 PUMF pools valid skip and not stated into code 11 (StatCan data dictionary p. 123; not in the errata). Rebuild data/variable_details.RData and data/variables.RData from the corrected CSVs. Validated by harmonizing 1% samples of CCHS 2001, 2015-16, and 2019-20 PUMF end-to-end through rec_with_table(): all unified smoking variables derive with plausible non-missing rates and category distributions matching the raw data. Fixes #184. Fixes #185.
DougManuel
added a commit
to Big-Life-Lab/cshm-dev
that referenced
this pull request
Jun 10, 2026
…ixes Pull in three further cchsflow v3 worksheet repairs (now in upstream PR Big-Life-Lab/cchsflow#186): - SMKDSTY_cat5 2015+ recoded to the post-2015 SMKDVSTY codes; former daily smokers are no longer classified as occasional, never-smokers no longer NA(b), and time_quit_smoking_daily now derives for 2015+ cycles (25% non-missing in 2015-16, matching the former-daily share). - SMKG040/SMKG040_cont no longer claim cchs2019_2020_p (raw SMKG040 absent from the 2019-20 PUMF); matching claim dropped from cshm-variables.csv. - SMKG01C_cont maps cchs2015_2016_p code 11 to NA(a): StatCan shipped the 2015-16 PUMF with valid skip and not stated pooled into SMKG035 code 11, which the uniform midpoint rule turned into age 55 at first cigarette for ~44k never-smokers. age_first_cigarette for 2015-16 returns to 58.8% non-missing (median 16) from a corrupt 100%. Forensics: Big-Life-Lab/cchsflow-data#3.
DougManuel
added a commit
to Big-Life-Lab/cshm-dev
that referenced
this pull request
Jun 10, 2026
StatCan shipped the 2015-16 PUMF with valid skip and not stated pooled
into SMKG035 code 11 ("age 50 or older"; 46.3% weighted per their own
data dictionary, absent from the errata). The uniform midpoint rule
turned this into age 55 at first cigarette for ~44k never-smokers.
The snapshot now maps cchs2015_2016_p code 11 to NA(a):
age_first_cigarette for 2015-16 returns to 58.8% non-missing
(median 16) from a corrupt 100%.
Forensics: Big-Life-Lab/cchsflow-data#3. Upstream fix in
Big-Life-Lab/cchsflow#186 alongside the cat5/SMKG040 repairs already
captured in the previous snapshot commit.
There was a problem hiding this comment.
Pull request overview
This PR fixes cchsflow v3 installability by regenerating roxygen outputs (notably NAMESPACE) and repairs worksheet-driven smoking derivations so rec_with_table() can successfully resolve Func:: names and apply correct feeder ordering/signatures when harmonizing via the CSV worksheets.
Changes:
- Regenerated roxygen outputs (
NAMESPACE,man/*) to remove stale exports and restore missing imports (e.g.,dplyr::case_when), addressing the v3R CMD INSTALLfailure. - Updated smoking-related worksheet metadata (e.g.,
variable_details.csv/variables.csv) to alignFunc::...references andDerivedVar::[...]feeders (including positional feeder order) with the current R implementations used byrec_with_table(). - Added/removed Rd documentation to match the refactored/renamed public APIs (e.g., ADL scoring docs added; older respiratory-condition docs removed).
Reviewed changes
Copilot reviewed 74 out of 86 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
NAMESPACE |
Regenerated exports/imports to remove stale exports and fix install/load failures. |
inst/extdata/variable_details.csv |
Smoking worksheet derivation metadata updated to match actual function names/signatures and correct feeder lists/order for rec_with_table(). |
inst/extdata/variables.csv |
Mirrors worksheet-facing variable metadata updates (notably for smoking-derived variables like cigs_per_day). |
man/set_data_labels.Rd |
Updated documentation content for labeling utility (but currently contains a parameter doc mismatch). |
man/score_adl.Rd |
New/updated ADL scoring documentation generated by roxygen. |
man/score_adl_6.Rd |
New/updated 6-item ADL scoring documentation generated by roxygen. |
man/resp_condition_fun2.Rd |
Removed obsolete documentation for deprecated respiratory-condition variant. |
man/resp_condition_fun3.Rd |
Removed obsolete documentation for deprecated respiratory-condition variant. |
Files not reviewed (9)
- man/CCC_091_fun1.Rd: Language not supported
- man/CCC_091_fun2.Rd: Language not supported
- man/COPD_Emph_der_fun1.Rd: Language not supported
- man/COPD_Emph_der_fun2.Rd: Language not supported
- man/EDUDR04_fun.Rd: Language not supported
- man/PACK_YEARS_CONSTANTS.Rd: Language not supported
- man/active_transport3_fun.Rd: Language not supported
- man/adjust_bmi.Rd: Language not supported
- man/assess_adl.Rd: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+10
to
+16
| \item{variable_details}{A dataframe containing the details of each variable, | ||
| with category labels in the 'catLabel' column.} | ||
|
|
||
| \item{variable_details}{variable_details.csv} | ||
| \item{variables_sheet}{(Optional) A dataframe containing variable labels in | ||
| the 'label' column.} | ||
|
|
||
| \item{variables_sheet}{variables.csv} | ||
| \item{data_to_lab.el}{A dataframe of CCHS data that lacks labels.} |
rafdoodle
approved these changes
Jun 10, 2026
Fixes runtime failures in assess_adl, score_adl, score_adl_6, calculate_binge_drinking, calculate_drinking_risk_short and calculate_drinking_risk_long caused by the never-merged missing-data-helpers.R dependency (silent tryCatch/source masked it). Removes the abandoned splice idiom and library()/source() headers. Validation bounds now come from variable_details.csv via clean_variables(): adds categorical valid sets from typeEnd cat recEnd codes, vectorized else logic, CCHS label-string coercion to tagged NAs, scalar recycling with explicit length validation, and a once-per-session database-config warning. Tests: test-adl 98 passing, test-alcohol 51 passing (previously 24 and 12 errors). Full suite at pre-existing baseline. Fixes #132. Relates to #184, #185.
…atabases recode_call() received the full database_name vector instead of the loop variable, so grepl() matched only the first database's worksheet rows and every dataframe in the list was silently recoded with the first database's rules. Passes the loop variable and adds a per-database regression test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the v3 installability failure (#184) and nine smoking worksheet defects (#185) found while integrating cchsflow v3 into the CSHM pipeline (cshgm-dev), which harmonizes through
rec_with_table()with the CSV worksheets — a path the test suite does not exercise (tests call thecalculate_*()functions directly).Changes
Packaging (#184)
NAMESPACEandman/with roxygen2: removes ~40 stale exports (alcohol, BMI, ADL, worksheet-tooling functions removed in the v3 refactor) that made the package failR CMD INSTALL; adds missingimportFrom(dplyr, case_when).Worksheets (#185), in
inst/extdata/variable_details.csv(+variables.csvwhere feeders are mirrored), withdata/*.RDatarebuilt per conventionFunc::names → actual functions:calculate_SMKG040→calculate_SMKG040_cont;calculate_SMKG203_continuous/_from_combined→calculate_SMKG203_cont;calculate_SMKG207_continuous/_from_combined→calculate_SMKG207_cont.SMKG040_contfeeder lists matched to signatures: PUMF 2001–2014[SMKG203_cont, SMKG207_cont](was the categorical_pre2005/_2005plusvariants).SMKG203_cont/SMKG207_cont2015+ feeders:[SMK_005, SMKG040_cont]/[SMK_030, SMKG040_cont](drop the third feeder that landed inoutput_format; pass_contnot grouped codes). Masterfrom_combinedrows likewise.cigs_per_dayfeeder order:[SMKDSTY_original, SMK_204, SMK_208]— feeders pass positionally; the reversed order gated "status == 1" on cigarettes/day, yielding 0.4% non-missing instead of ~47%.pack_years_der: occasional-smoker slots getSMK_05B/SMK_05C(cigs per day smoked, days per month) instead ofSMK_204/SMK_208, as in v2'spack_years_fun().time_quit_smoking_daily: split the single all-database block into PUMF ([SMKDSTY_cat5, SMK_09A_cont]) and Master (+ SMK_09C) blocks —SMK_09Cis Master-only, so the shared block made the variable underivable on every PUMF cycle despite the function'sSMK_09C = NULLdefault.SMKDSTY_cat52015+ block had the pre-2015 collapse rules:[2,3]→2,[4,5]→4, no3→3, no6→5. Post-2015 SMKDVSTY codes are 1 daily / 2 occasional / 3 former daily / 4 former occasional / 5 experimental / 6 never; the old map classified 28k former-daily smokers (2015-16) as occasional, sent never-smokers to NA(b), and madetime_quit_smoking_dailyderive for nobody in 2015+.SMKG040/SMKG040_cont: drop the falsecchs2019_2020_pclaim — raw SMKG040 exists only in 2015-16/2017-18 PUMF (DDI-confirmed).SMKG01C_cont: cycle-specific exceptioncchs2015_2016_pcode11 → NA(a)— the official 2015-16 PUMF pools valid skip/not stated into SMKG035 code 11 ("age 50+", 45,737 = 46.3% weighted; StatCan data dictionary p. 123, absent from the errata). The uniform11 → 55rule assigned age 55 at first cigarette to ~44k never-smokers. Forensics: Big-Life-Lab/cchsflow-data#3.Not fixed here:
SMKDSTY_cat3has no PUMF value rows for codes 1–5 (#185 item 8) — needs new rows; flagging for a follow-up.Validation
Harmonized 1% samples of CCHS 2001, 2015-16, and 2019-20 PUMF end-to-end through
rec_with_table()(via the CSHM pipeline). All unified smoking variables (age_first_cigarette,age_start_smoking,time_quit_smoking,time_quit_smoking_daily,cigs_per_day,pack_years_der,smoked_100_lifetime,SMKDSTY_original) derive with plausible non-missing rates and category distributions matching the raw data. Remaining gaps are data realities:SMKDSTY_originalfor 2022 PUMF (SMK_05D Master-only) andage_start_smokingfor 2019-20+ PUMF.Suggested follow-up
A build-time check comparing each
DerivedVar::[...]list (length and order) againstformals()of the namedFunc::, and each feeder's database coverage against the derived variable's claims, would have caught defects 1–8 mechanically.Fixes #184. Fixes #185.