A Julia package for design-based inference in survey sampling. This package provides functionality comparable to R's survey package, with native Julia performance and integration with DataFrames.jl. It is similar in spirit to Survey.jl, but allows for true multi-stage sampling with different designs at each stage.
NestedSurveys.jl implements methods for analyzing data from complex survey designs, including:
- Simple random sampling (SRS)
- Stratified sampling
- One-stage cluster sampling
- Two-stage cluster sampling
- Taylor series variance estimation
- Ratio estimation
- Regression-assisted estimation
The main exported type is SampleSum, which stores both an estimate and its variance. Sampling designs are specified using SI (simple random sampling without replacement), WithReplacement, and WithoutReplacement structs.
For simple random sampling without replacement, use sum with an SI object that specifies the population size.
using NestedSurveys, DataFramesMeta
result = @combine(apisrs, :total = sum(:enroll, SI(N)))The sum function computes the Horvitz-Thompson estimator for the total and its variance, accounting for the finite population correction.
For stratified sampling, compute subtotals within each stratum, then combine them.
strat_result = @chain apistrat begin
@groupby(:stype)
@combine(:subtotal = sum(:enroll, SI(Int(:fpc[1]))))
@combine(:total = sum(:subtotal))
endThe SampleSum type supports addition, so stratified estimates can be combined by summing the subtotal column.
For one-stage cluster sampling, first aggregate within clusters, then use sum with SI on the cluster totals.
gdf = groupby(cal_crime, :county)
@chain gdf begin
@combine(:subtotal = sum(:Burglary))
@combine(:total = sum(:subtotal, SI(N_counties)))
endFor two-stage sampling, apply sum with SI twice: once within primary sampling units (PSUs), then across PSUs.
@chain df begin
@groupby(:county)
@combine(:subtotal = sum(:Burglary, SI(county_sizes[first(:county)])))
@combine(:total = sum(:subtotal, SI(N_counties)))
endThe variance calculation properly accounts for both stages of sampling through the nested application of sum.
For ratio estimation or other nonlinear functions of totals, use taylor with a Function argument for Taylor series linearization.
# Estimate ratio of api.stu to enroll
ratio_result = @combine(apisrs, :total =
taylor(a -> a[1] / a[2], g-> sum(g([:api_stu :enroll]), SI(Int(:fpc[1])))))The taylor function uses automatic differentiation to compute the Taylor series approximation to the variance of nonlinear estimators.
For regression-assisted (calibration) estimation, use sum with a formula, sample data, population data, and design.
assisted_result = sum(@formula(api_stu ~ 1 + enroll), apisrs, (; enroll=[4e6]), SI(Int(apisrs[1, :fpc])))