forked from MethodsForReproducibleHealthResearch/Assignment2
-
Notifications
You must be signed in to change notification settings - Fork 0
Add analysis #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sylvieddl
wants to merge
7
commits into
main
Choose a base branch
from
analysis
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add analysis #1
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
8911f7d
testing
sylvieddl c201368
Analysis and figure
sylvieddl 7d0236a
Update README.md
sylvieddl 63441e2
Update README.md
sylvieddl 13a37bb
Update README.md
sylvieddl 009e42d
Update README.md
sylvieddl 261331c
Add files via upload
sylvieddl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| .Rproj.user | ||
| .Rhistory | ||
| .RData | ||
| .Ruserdata | ||
| .positai |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| Version: 1.0 | ||
|
|
||
| RestoreWorkspace: Default | ||
| SaveWorkspace: Default | ||
| AlwaysSaveHistory: Default | ||
|
|
||
| EnableCodeIndexing: Yes | ||
| UseSpacesForTab: Yes | ||
| NumSpacesForTab: 2 | ||
| Encoding: UTF-8 | ||
|
|
||
| RnwWeave: Sweave | ||
| LaTeX: pdfLaTeX |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,14 @@ | ||
| # Assignment #2 Repository | ||
|
|
||
| This repository includes the simulated data for Assignment #2. Fork this repository and add your analysis as described in the canvas assignment. | ||
|
|
||
| The csv file for `cohort` in the `raw-data` folder includes 5,000 observations with variables `smoke`, `female`, `age`, `cardiac`, and `cost`. | ||
|
|
||
| ## Summary of findings: | ||
| I modeled the association between the cost and cardiac variables using a logistic | ||
| regression and adjusted for age, sex, and smoke. Every 1 unit increase in cost is | ||
| associated with a 1 fold increase in odds of cardiac. While cost appears reasonably | ||
| normally distributed, those with cardiac=1 have much higher cost on average. | ||
|
|
||
| ## AI statement: | ||
| I did not use any generative AI technology to complete this assignment. | ||
|
|
||
|  |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| --- | ||
| title: "EPI 203 Assignment 4: Reproducible Report" | ||
| author: "Sylvie Dobrota Lai" | ||
| date: today | ||
| format: pdf | ||
| editor: visual | ||
| tbl-cap-location: top | ||
| --- | ||
|
|
||
| ```{r} | ||
| #| label: load-packages | ||
| #| include: false | ||
| library(broom.helpers) | ||
| library(tidyverse) | ||
| library(tableone) | ||
| library(gtsummary) | ||
| ``` | ||
|
|
||
| ## Introduction | ||
|
|
||
| Cardiac events such as stroke and heart attack, are common in the United States. Common risk factors for cardiac events include age, smoking history, hypertension, and diabetes. It is estimated that the healthcare costs related to cardiovascular risk factors and cardiac events is expected to reach \$1344 billion in 2050. | ||
|
|
||
| Using the synthetic dataset provided in EPI 203, I tested my hypothesis that higher cost is associated with increased odds of cardiac event. | ||
|
|
||
| ## Methods | ||
|
|
||
| I used a dataset with 5,000 observations with adults 18 and older. The dataset had information on cost, cardiac event, age, smoking status, and biological sex. | ||
|
|
||
| The outcome of interest was cardiac event, modeled as a binary variable (yes/no). The exposure of interest was cost, which was a continuous variable. For describing the distribution of the variables, I used median (IQR) for continuous variables and numbers (percentages) for categorical variables. | ||
|
|
||
| As continuous variables were not normally distributed, I tested for differences between group using the Wilcoxon rank sum test. Pearson's chi-square tests were used for categorical variables. I modeled the association between the cost and cardiac event variables using a logistic regression and adjusted for three confounders: age, sex, and smoking status. | ||
|
|
||
| $$ logit(p(Cardiac_i)) = \beta_0 + \beta_1*Cost_i + \beta_2*Age_i+ \beta_3 *sex + \beta_4*smoke $$ | ||
|
|
||
| I considered a p-value less than or equal to 0.05 as statistically signficant. All analyses were conducted in R version 4.5.2 using the tidyverse and gtsummary packages. | ||
|
|
||
| ## Results | ||
|
|
||
| There were 5,000 participants in the dataset, with 4,2735 having no cardiac event and 275 having a cardiac event. The median cost was 9,376 USD. The median cost for those without cardiac event was 9,350 (IQR 9,072 to 9,622) and the median cost for those with cardiac event was 10,230 (IQR 9,910 to 10,506). This difference was statistically significant (p\<0.001). | ||
|
|
||
| Participants who experienced a cardiac event were significantly more likely to be female and smoke. 48% of those who had cardiac event were smokers, compared to only 11\$ of those who did not. Age did not appear to differ between the two groups (Table 1). | ||
|
|
||
| ```{r} | ||
| #| echo: false | ||
| #| warning: false | ||
| #Load data | ||
| hw2_df <- read_csv("raw-data/cohort.csv") | ||
|
|
||
| #Prepare dataset | ||
| clean_df <- hw2_df %>% | ||
| mutate( | ||
| cardiac=factor(cardiac, levels= c(0,1), labels =c("No", "Yes")), | ||
| smoke=factor(smoke,levels= c(0,1), labels =c("No", "Yes")), | ||
| female=factor(female, levels= c(0,1), labels =c("No", "Yes")), | ||
| ) | ||
|
|
||
| #table 1 code | ||
| clean_df %>% | ||
| tbl_summary(by=cardiac, include = c(age, female, smoke)) %>% | ||
| add_p() %>% | ||
| modify_caption("**Table 1. Participant characteristics by cardiac event**") | ||
| ``` | ||
|
|
||
| While cost appears reasonably normally distributed, those with cardiac event have much higher cost on average (Figure 1). | ||
|
|
||
| ```{r} | ||
| #| echo: false | ||
| #| warning: false | ||
| #|tbl-cap: Table 2 | ||
|
|
||
| # Distribution of cost by cardiac status | ||
| p <- clean_df %>% | ||
| ggplot( aes(x=cost, fill=cardiac)) + | ||
| geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') + | ||
| scale_fill_manual(values=c("#69b3a2", "#404080")) + | ||
| theme_classic() + | ||
| labs(fill="cardiac", | ||
| title="Figure 1. Distribution of cost by cardiac") | ||
|
|
||
| p | ||
|
|
||
| clean_df %>% | ||
| tbl_summary(by=cardiac, include = c(cost)) %>% | ||
| add_p() %>% | ||
| modify_caption("**Table 2. Cost by cardiac event**") | ||
| ``` | ||
|
|
||
| Those that were older had higher cost (Figure 2). | ||
|
|
||
| ```{r} | ||
| #| echo: false | ||
| #| warning: false | ||
| #scatterplot of cost vs age | ||
|
|
||
| fig2<- clean_df %>% | ||
| ggplot(aes(x=cost, y=age)) + | ||
| geom_point()+ | ||
| labs(title="Figure 2. Cost and age") | ||
|
|
||
| fig2 | ||
|
|
||
| ``` | ||
|
|
||
| In the fully adjusted model, every 1 unit increase in cost was associated with a 1-fold increase in odds of cardiac (95% CI: 1.009, 1.011, p\<0.001). All estimates are presented in Table 2. | ||
|
|
||
| ```{r} | ||
| #| echo: false | ||
| #| warning: false | ||
|
|
||
| model <- glm(cardiac~ cost + age + female + smoke, | ||
| data=clean_df, | ||
| family="binomial") | ||
|
|
||
| tbl_regression(model, exponentiate=TRUE) | ||
| ``` | ||
|
|
||
| ## Discussion | ||
|
|
||
| I found that higher cost was associated with higher odds of cardiac event. This is similar to findings in prior studies. | ||
|
|
||
| My analysis had several limitations. I was unable to adjust for several well-established risk factors, such as hypertension, as they were not available in the dataset. Additionally, as this was a cross-sectional analysis, I am unable to establish causality or the direction of the relationship, in other words if high costs lead to cardiac events to cardiac events lead to high costs. Further studies using larger, longitudinal datasets are needed to answer these important questions. | ||
|
|
||
| ## AI Statement | ||
|
|
||
| I did not use any generative AI technology to complete any portion of this work. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # Code for Assignment 2 - EPI 203 | ||
| # Author: Sylvie Dobrota Lai | ||
| # Date: Apr 27, 2026 | ||
|
|
||
| library(tidyverse) | ||
| library(tableone) | ||
|
|
||
| #Load data | ||
| hw2_df <- read_csv("raw-data/cohort.csv") | ||
| head(hw2_df) | ||
|
|
||
| #Prepare variables | ||
| clean_df <- hw2_df %>% | ||
| mutate( | ||
| smoke=as.factor(smoke), | ||
| female=as.factor(female), | ||
| cardiac=as.factor(cardiac) | ||
| ) | ||
|
|
||
| #Table 1 | ||
| myVars<-c("smoke", "female", "age", "cardiac", "cost") | ||
| catVars<-c("smoke", "female", "cardiac") | ||
|
|
||
| table1<- CreateTableOne(vars=myVars, data=hw2_df, factorVars = catVars) | ||
|
|
||
| print(table1, showAllLevels = TRUE) | ||
|
|
||
| # Logistic model, cardiac as outcome | ||
| model <- glm(cardiac~ cost + age + female + smoke, | ||
| data=clean_df, | ||
| family="binomial") | ||
|
|
||
| summary(model) | ||
| exp(coef(model)) | ||
|
|
||
| # Distribution of cost by cardiac status | ||
| p <- clean_df %>% | ||
| ggplot( aes(x=cost, fill=cardiac)) + | ||
| geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') + | ||
| scale_fill_manual(values=c("#69b3a2", "#404080")) + | ||
| theme_classic() + | ||
| labs(fill="cardiac", | ||
| title="Distribution of cost by cardiac") | ||
|
|
||
| p | ||
|
|
||
| ggsave("figures/Figure1.png", plot=p) | ||
|
|
||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comments!