This repository contains R scripts for preparing, quality-checking, harmonizing, and modeling DAACS reading assessment data across institutions and waves. The workflow covers four main stages: preprocessing and QA, LLM-assisted item-harmonization audit, descriptive and item-selection summaries, and IRT modeling. The reading pipeline uses a shared utility script with reusable helper functions for IDs, recoding, QA, missingness diagnostics, and item-level sample-size summaries.
Core data-preparation and QA pipeline.
Scripts in this folder:
-
utils_read_pipeline.R
Shared helper functions used across preprocessing scripts. Includes ID normalization,global_idcreation, item-column helpers, recodes, missingness diagnostics, QA functions, and item-level sample-size summaries. -
read_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R
Cleans and standardizes the already-wide 2022 UMGC1 + UA2 reading dataset, renames legacy item IDs to final QIDs, applies the reading speed filter, creates clean institution-specific and combined wide datasets, and runs QA summaries. Output level: one row per student (global_id). -
read_v2_ua23umgc-2022-23_qa_pipeline.R
Builds clean long and wide reading datasets for UA23, UMGC, and their combined 2022–2023 file from raw institution-level, item-level, and assessment-template files. Main tasks include QID harmonization, duplicate same-completion row removal, retained-response filtering, student timing derivation, and final wide-file QA. Output levels: long files are one row per student-item response; wide files are one row per student. -
read_v2_umgc_ua_22_23_01_combined_qa_stack_and_missingness.R
Stacks the cleaned 2022 and 2022–2023 wide files, harmonizes columns and types, and runs missingness diagnostics on the raw combined file. -
read_v2_umgc_ua_22_23_02_combined_qa_duplicate_diagnostics.R
Identifies duplicate students and duplicate score patterns in the combined stacked reading dataset, summarizes likely duplicate pairs, and prepares auditable artifacts for patch-then-remove deduplication. In matched UMGC/UMGC1 duplicate pairs, the UMGC row is retained as the surviving row and missing selected values may be patched from the matched UMGC1 row. -
read_v2_umgc_ua_22_23_03_combined_qa_finalize_and_describe.R
Applies duplicate removals, replaces selected UMGC rows with patched versions, enforces one row perglobal_id, runs final missingness diagnostics, produces descriptive summaries, generates item-level sample-size tables, and saves the final cleaned combined dataset.
LLM-assisted, human-reviewed item harmonization audit.
read_v2_umgc_ua_22_23_02_combined_llm_harmonization.R
Builds an audit workflow for uncertain item-identity matches after exact matching, fuzzy matching, and manual repair. The script creates candidate pools for review, generates prompt-ready audit tables, validates completed human decisions, and exports approved patch tables for later application in the preprocessing pipeline. This module is an auditable support layer for item-identity harmonization, not a replacement for human review.
Downstream descriptive and item-selection summaries.
-
read_v2_ua_22_23_17-18_describe.R
Creates a traditional-college-age UAlbany subset (ages 17–18), runs missingness diagnostics, produces categorical and numeric descriptives, creates item-level response-count summaries, and exports summary tables for planned item selection. -
read_v2_ua23umgc-2022-23_summary_table_for_item_selection.R
Builds six domain-level summary tables for item selection based on response-count thresholds for IRT, multivariable DIF, and factor-level analyses. This is a downstream reporting script, not part of the core cleaning pipeline.
Final analytic file creation and IRT modeling.
-
read_v2_umgc_ua_22_23_final_dataset.R
Creates the final frozen analytic dataset used for modeling. This step removes the 89 UMGC1 faculty/advisor trial administrations and recodes college touaversusumgc, so the final analytic sample contains student records only. It also exports item-count summaries by item, domain, subgroup, and testlet. -
read_v2_umgc_ua_22_23_model_comparison.R
Runs the main full-bank IRT model comparison on the final frozen dataset: Rasch/1PL, unidimensional 2PL, and 2PL bifactor with one general factor plus passage/testlet-specific factors. The script exports model-fit summaries, theta correlations, difference summaries, and diagnostic plots. -
read_v2_umgc_ua_22_23_model_180trimming.R
Uses the full-bank 2PL model to create item diagnostics, flag potential trimming candidates, build a reviewed trimmed item pool, refit the unidimensional 2PL on the trimmed bank, and compare full-versus-trimmed theta estimates. -
read_v2_umgc_ua_22_23_trimmed_model_comparison.R
Repeats the model comparison on the trimmed item bank using trimmed Rasch, trimmed unidimensional 2PL, and trimmed 2PL bifactor models. The script exports fit summaries, trimmed-bank theta correlations, difference summaries, and comparison plots.
Future domain-sensitivity and DIF scripts can also be added to this folder.
The repository is organized as a staged workflow:
- Clean and QA the 2022 already-wide reading data (
01_preprocessing) - Clean and QA the raw 2022–2023 reading data (
01_preprocessing) - Stack cross-wave wide files and diagnose duplicates (
01_preprocessing) - Finalize the combined cleaned dataset (
01_preprocessing) - Run optional LLM-assisted harmonization audit for uncertain item matches (
02_harmonization_audit) - Produce descriptive and item-selection summaries (
03_descriptives_and_item_selection) - Create the final frozen analytic dataset and run IRT modeling workflows (
04_modeling)
Run these scripts in order:
01_preprocessing/read_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R01_preprocessing/read_v2_ua23umgc-2022-23_qa_pipeline.R01_preprocessing/read_v2_umgc_ua_22_23_01_combined_qa_stack_and_missingness.R01_preprocessing/read_v2_umgc_ua_22_23_02_combined_qa_duplicate_diagnostics.R01_preprocessing/read_v2_umgc_ua_22_23_03_combined_qa_finalize_and_describe.R
02_harmonization_audit/read_v2_umgc_ua_22_23_02_combined_llm_harmonization.R
03_descriptives_and_item_selection/read_v2_ua_22_23_17-18_describe.R03_descriptives_and_item_selection/read_v2_ua23umgc-2022-23_summary_table_for_item_selection.R
04_modeling/read_v2_umgc_ua_22_23_final_dataset.R04_modeling/read_v2_umgc_ua_22_23_model_comparison.R04_modeling/read_v2_umgc_ua_22_23_model_180trimming.R04_modeling/read_v2_umgc_ua_22_23_trimmed_model_comparison.R
Across scripts, the main conventions are:
global_id: unique student identifier used in final outputsage_d24:TCAUSif age < 24AUSif age >= 24
ethnicity: recoded toWhite / Asian / Black / Hispanic / Otherpell: recoded toNo / Yesmilitary: recoded toNo / Yestransfer: continuous transferred creditsreadTime: seconds
For reading items, QID encodes both domain and passage/testlet position.
Structure:
Q+ three-digit global item number- lowercase domain code
- final number indicating testlet/passage position
Domain codes:
s= structurein= inferenceid= ideasp= purposel= language
Examples:
Q001s1= item 1, structure domain, testlet 1Q014id3= item 14, ideas domain, testlet 3Q017p3= item 17, purpose domain, testlet 3
The final number is not a difficulty code.
Typical output folders created by the scripts include:
read_v2_umgc1ua2-anSamp2-2022_qa_outputsread_v2_ua23umgc-2022-23_qa_outputsread_v2_umgc_ua_22_23_combined_qa_outputsread_llm_audit_outputsread_v2_umgc_ua_22_23_final_outputsread_v2_umgc_ua_22_23_model_comparison_outputsread_v2_umgc_ua_22_23_model_180trimming_outputsread_v2_umgc_ua_22_23_trimmed_model_comparison_outputs
- The preprocessing pipeline prioritizes defensible comparability across institutions and waves over maximal retention of raw response history.
- The LLM module is an auditable support layer for item-identity review. Final approved repairs remain human-reviewed.
- The final analytic sample excludes 89 UMGC1 faculty/advisor trial administrations, so the modeling dataset represents student records only.
- The full-bank and trimmed-bank modeling scripts are intended for overall reading-score comparisons; future domain-sensitivity and DIF analyses will extend the modeling stage.