Skip to content

fix(v3): regenerate NAMESPACE and repair smoking worksheet derivations#186

Open
DougManuel wants to merge 4 commits into
v3from
fix/v3-smoking-worksheet-sync
Open

fix(v3): regenerate NAMESPACE and repair smoking worksheet derivations#186
DougManuel wants to merge 4 commits into
v3from
fix/v3-smoking-worksheet-sync

Conversation

@DougManuel

Copy link
Copy Markdown
Contributor

Summary

Fixes the v3 installability failure (#184) and nine smoking worksheet defects (#185) found while integrating cchsflow v3 into the CSHM pipeline (cshgm-dev), which harmonizes through rec_with_table() with the CSV worksheets — a path the test suite does not exercise (tests call the calculate_*() functions directly).

Changes

Packaging (#184)

  • Regenerate NAMESPACE and man/ with roxygen2: removes ~40 stale exports (alcohol, BMI, ADL, worksheet-tooling functions removed in the v3 refactor) that made the package fail R CMD INSTALL; adds missing importFrom(dplyr, case_when).

Worksheets (#185), in inst/extdata/variable_details.csv (+ variables.csv where feeders are mirrored), with data/*.RData rebuilt per convention

  1. Phantom Func:: names → actual functions: calculate_SMKG040calculate_SMKG040_cont; calculate_SMKG203_continuous/_from_combinedcalculate_SMKG203_cont; calculate_SMKG207_continuous/_from_combinedcalculate_SMKG207_cont.
  2. SMKG040_cont feeder lists matched to signatures: PUMF 2001–2014 [SMKG203_cont, SMKG207_cont] (was the categorical _pre2005/_2005plus variants).
  3. SMKG203_cont/SMKG207_cont 2015+ feeders: [SMK_005, SMKG040_cont] / [SMK_030, SMKG040_cont] (drop the third feeder that landed in output_format; pass _cont not grouped codes). Master from_combined rows likewise.
  4. cigs_per_day feeder order: [SMKDSTY_original, SMK_204, SMK_208] — feeders pass positionally; the reversed order gated "status == 1" on cigarettes/day, yielding 0.4% non-missing instead of ~47%.
  5. pack_years_der: occasional-smoker slots get SMK_05B/SMK_05C (cigs per day smoked, days per month) instead of SMK_204/SMK_208, as in v2's pack_years_fun().
  6. time_quit_smoking_daily: split the single all-database block into PUMF ([SMKDSTY_cat5, SMK_09A_cont]) and Master (+ SMK_09C) blocks — SMK_09C is Master-only, so the shared block made the variable underivable on every PUMF cycle despite the function's SMK_09C = NULL default.
  7. SMKDSTY_cat5 2015+ block had the pre-2015 collapse rules: [2,3]→2, [4,5]→4, no 3→3, no 6→5. Post-2015 SMKDVSTY codes are 1 daily / 2 occasional / 3 former daily / 4 former occasional / 5 experimental / 6 never; the old map classified 28k former-daily smokers (2015-16) as occasional, sent never-smokers to NA(b), and made time_quit_smoking_daily derive for nobody in 2015+.
  8. SMKG040/SMKG040_cont: drop the false cchs2019_2020_p claim — raw SMKG040 exists only in 2015-16/2017-18 PUMF (DDI-confirmed).
  9. SMKG01C_cont: cycle-specific exception cchs2015_2016_p code 11 → NA(a) — the official 2015-16 PUMF pools valid skip/not stated into SMKG035 code 11 ("age 50+", 45,737 = 46.3% weighted; StatCan data dictionary p. 123, absent from the errata). The uniform 11 → 55 rule assigned age 55 at first cigarette to ~44k never-smokers. Forensics: Big-Life-Lab/cchsflow-data#3.

Not fixed here: SMKDSTY_cat3 has no PUMF value rows for codes 1–5 (#185 item 8) — needs new rows; flagging for a follow-up.

Validation

Harmonized 1% samples of CCHS 2001, 2015-16, and 2019-20 PUMF end-to-end through rec_with_table() (via the CSHM pipeline). All unified smoking variables (age_first_cigarette, age_start_smoking, time_quit_smoking, time_quit_smoking_daily, cigs_per_day, pack_years_der, smoked_100_lifetime, SMKDSTY_original) derive with plausible non-missing rates and category distributions matching the raw data. Remaining gaps are data realities: SMKDSTY_original for 2022 PUMF (SMK_05D Master-only) and age_start_smoking for 2019-20+ PUMF.

Suggested follow-up

A build-time check comparing each DerivedVar::[...] list (length and order) against formals() of the named Func::, and each feeder's database coverage against the derived variable's claims, would have caught defects 1–8 mechanically.

Fixes #184. Fixes #185.

Regenerate NAMESPACE and man/ with roxygen2: removes ~40 stale exports
left from the v3 refactor (alcohol, BMI, ADL, worksheet tooling) that
made the package fail R CMD INSTALL, and adds the missing
importFrom(dplyr, case_when).

Repair nine defects in the smoking rows of variable_details.csv (and
variables.csv where feeder lists are mirrored):

- Rename phantom Func:: references (calculate_SMKG040,
  calculate_SMKG203/207_continuous and _from_combined) to the actual
  calculate_*_cont functions.
- Match DerivedVar feeder lists to function signatures: SMKG040_cont
  from [SMKG203_cont, SMKG207_cont]; SMKG203_cont/SMKG207_cont 2015+
  from [SMK_005|SMK_030, SMKG040_cont]; cigs_per_day feeder order
  [SMKDSTY_original, SMK_204, SMK_208] (feeders pass positionally);
  pack_years_der occasional slots take SMK_05B/SMK_05C as in v2.
- Split time_quit_smoking_daily into PUMF and Master blocks so the
  Master-only SMK_09C feeder no longer blocks PUMF derivation.
- Correct the SMKDSTY_cat5 2015+ map to the post-2015 SMKDVSTY codes
  (1 daily, 2 occasional, 3 former daily, 4-5 former occ/experimental,
  6 never); the pre-2015 collapse rules had been applied, classifying
  former-daily smokers as occasional and never-smokers as NA(b).
- Drop the false cchs2019_2020_p claim from SMKG040/SMKG040_cont (raw
  SMKG040 exists only in 2015-16 and 2017-18 PUMF).
- Add a cchs2015_2016_p exception mapping SMKG035 code 11 to NA(a):
  the official 2015-16 PUMF pools valid skip and not stated into
  code 11 (StatCan data dictionary p. 123; not in the errata).

Rebuild data/variable_details.RData and data/variables.RData from the
corrected CSVs.

Validated by harmonizing 1% samples of CCHS 2001, 2015-16, and
2019-20 PUMF end-to-end through rec_with_table(): all unified smoking
variables derive with plausible non-missing rates and category
distributions matching the raw data.

Fixes #184. Fixes #185.
@DougManuel DougManuel requested review from Copilot and rafdoodle June 10, 2026 14:44
DougManuel added a commit to Big-Life-Lab/cshm-dev that referenced this pull request Jun 10, 2026
…ixes

Pull in three further cchsflow v3 worksheet repairs (now in upstream
PR Big-Life-Lab/cchsflow#186):

- SMKDSTY_cat5 2015+ recoded to the post-2015 SMKDVSTY codes; former
  daily smokers are no longer classified as occasional, never-smokers
  no longer NA(b), and time_quit_smoking_daily now derives for 2015+
  cycles (25% non-missing in 2015-16, matching the former-daily share).
- SMKG040/SMKG040_cont no longer claim cchs2019_2020_p (raw SMKG040
  absent from the 2019-20 PUMF); matching claim dropped from
  cshm-variables.csv.
- SMKG01C_cont maps cchs2015_2016_p code 11 to NA(a): StatCan shipped
  the 2015-16 PUMF with valid skip and not stated pooled into SMKG035
  code 11, which the uniform midpoint rule turned into age 55 at first
  cigarette for ~44k never-smokers. age_first_cigarette for 2015-16
  returns to 58.8% non-missing (median 16) from a corrupt 100%.
  Forensics: Big-Life-Lab/cchsflow-data#3.
DougManuel added a commit to Big-Life-Lab/cshm-dev that referenced this pull request Jun 10, 2026
StatCan shipped the 2015-16 PUMF with valid skip and not stated pooled
into SMKG035 code 11 ("age 50 or older"; 46.3% weighted per their own
data dictionary, absent from the errata). The uniform midpoint rule
turned this into age 55 at first cigarette for ~44k never-smokers.
The snapshot now maps cchs2015_2016_p code 11 to NA(a):
age_first_cigarette for 2015-16 returns to 58.8% non-missing
(median 16) from a corrupt 100%.

Forensics: Big-Life-Lab/cchsflow-data#3. Upstream fix in
Big-Life-Lab/cchsflow#186 alongside the cat5/SMKG040 repairs already
captured in the previous snapshot commit.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes cchsflow v3 installability by regenerating roxygen outputs (notably NAMESPACE) and repairs worksheet-driven smoking derivations so rec_with_table() can successfully resolve Func:: names and apply correct feeder ordering/signatures when harmonizing via the CSV worksheets.

Changes:

  • Regenerated roxygen outputs (NAMESPACE, man/*) to remove stale exports and restore missing imports (e.g., dplyr::case_when), addressing the v3 R CMD INSTALL failure.
  • Updated smoking-related worksheet metadata (e.g., variable_details.csv / variables.csv) to align Func::... references and DerivedVar::[...] feeders (including positional feeder order) with the current R implementations used by rec_with_table().
  • Added/removed Rd documentation to match the refactored/renamed public APIs (e.g., ADL scoring docs added; older respiratory-condition docs removed).

Reviewed changes

Copilot reviewed 74 out of 86 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
NAMESPACE Regenerated exports/imports to remove stale exports and fix install/load failures.
inst/extdata/variable_details.csv Smoking worksheet derivation metadata updated to match actual function names/signatures and correct feeder lists/order for rec_with_table().
inst/extdata/variables.csv Mirrors worksheet-facing variable metadata updates (notably for smoking-derived variables like cigs_per_day).
man/set_data_labels.Rd Updated documentation content for labeling utility (but currently contains a parameter doc mismatch).
man/score_adl.Rd New/updated ADL scoring documentation generated by roxygen.
man/score_adl_6.Rd New/updated 6-item ADL scoring documentation generated by roxygen.
man/resp_condition_fun2.Rd Removed obsolete documentation for deprecated respiratory-condition variant.
man/resp_condition_fun3.Rd Removed obsolete documentation for deprecated respiratory-condition variant.
Files not reviewed (9)
  • man/CCC_091_fun1.Rd: Language not supported
  • man/CCC_091_fun2.Rd: Language not supported
  • man/COPD_Emph_der_fun1.Rd: Language not supported
  • man/COPD_Emph_der_fun2.Rd: Language not supported
  • man/EDUDR04_fun.Rd: Language not supported
  • man/PACK_YEARS_CONSTANTS.Rd: Language not supported
  • man/active_transport3_fun.Rd: Language not supported
  • man/adjust_bmi.Rd: Language not supported
  • man/assess_adl.Rd: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread man/set_data_labels.Rd
Comment on lines +10 to +16
\item{variable_details}{A dataframe containing the details of each variable,
with category labels in the 'catLabel' column.}

\item{variable_details}{variable_details.csv}
\item{variables_sheet}{(Optional) A dataframe containing variable labels in
the 'label' column.}

\item{variables_sheet}{variables.csv}
\item{data_to_lab.el}{A dataframe of CCHS data that lacks labels.}
Fixes runtime failures in assess_adl, score_adl, score_adl_6,
calculate_binge_drinking, calculate_drinking_risk_short and
calculate_drinking_risk_long caused by the never-merged
missing-data-helpers.R dependency (silent tryCatch/source masked it).
Removes the abandoned splice idiom and library()/source() headers.

Validation bounds now come from variable_details.csv via
clean_variables(): adds categorical valid sets from typeEnd cat recEnd
codes, vectorized else logic, CCHS label-string coercion to tagged NAs,
scalar recycling with explicit length validation, and a once-per-session
database-config warning.

Tests: test-adl 98 passing, test-alcohol 51 passing (previously 24 and
12 errors). Full suite at pre-existing baseline.

Fixes #132. Relates to #184, #185.
…atabases

recode_call() received the full database_name vector instead of the loop
variable, so grepl() matched only the first database's worksheet rows and
every dataframe in the list was silently recoded with the first
database's rules. Passes the loop variable and adds a per-database
regression test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants