Skip to content

samanklesaria/NestedSurveys.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NestedSurveys.jl

A Julia package for design-based inference in survey sampling. This package provides functionality comparable to R's survey package, with native Julia performance and integration with DataFrames.jl. It is similar in spirit to Survey.jl, but allows for true multi-stage sampling with different designs at each stage.

Overview

NestedSurveys.jl implements methods for analyzing data from complex survey designs, including:

  • Simple random sampling (SRS)
  • Stratified sampling
  • One-stage cluster sampling
  • Two-stage cluster sampling
  • Taylor series variance estimation
  • Ratio estimation
  • Regression-assisted estimation

The main exported type is SampleSum, which stores both an estimate and its variance. Sampling designs are specified using SI (simple random sampling without replacement), WithReplacement, and WithoutReplacement structs.

Simple Random Sampling

For simple random sampling without replacement, use sum with an SI object that specifies the population size.

using NestedSurveys, DataFramesMeta
result = @combine(apisrs, :total = sum(:enroll, SI(N)))

The sum function computes the Horvitz-Thompson estimator for the total and its variance, accounting for the finite population correction.

Stratified Sampling

For stratified sampling, compute subtotals within each stratum, then combine them.

strat_result = @chain apistrat begin
    @groupby(:stype)
    @combine(:subtotal = sum(:enroll, SI(Int(:fpc[1]))))
    @combine(:total = sum(:subtotal))
end

The SampleSum type supports addition, so stratified estimates can be combined by summing the subtotal column.

One-Stage Cluster Sampling

For one-stage cluster sampling, first aggregate within clusters, then use sum with SI on the cluster totals.

gdf = groupby(cal_crime, :county)
@chain gdf begin
    @combine(:subtotal = sum(:Burglary))
    @combine(:total = sum(:subtotal, SI(N_counties)))
end

Two-Stage Cluster Sampling

For two-stage sampling, apply sum with SI twice: once within primary sampling units (PSUs), then across PSUs.

@chain df begin
    @groupby(:county)
    @combine(:subtotal = sum(:Burglary, SI(county_sizes[first(:county)])))
    @combine(:total = sum(:subtotal, SI(N_counties)))
end

The variance calculation properly accounts for both stages of sampling through the nested application of sum.

Ratio Estimation

For ratio estimation or other nonlinear functions of totals, use taylor with a Function argument for Taylor series linearization.

# Estimate ratio of api.stu to enroll
ratio_result = @combine(apisrs, :total = 
    taylor(a -> a[1] / a[2], g-> sum(g([:api_stu :enroll]), SI(Int(:fpc[1])))))

The taylor function uses automatic differentiation to compute the Taylor series approximation to the variance of nonlinear estimators.

Regression-Assisted Estimation

For regression-assisted (calibration) estimation, use sum with a formula, sample data, population data, and design.

assisted_result = sum(@formula(api_stu ~ 1 + enroll), apisrs, (; enroll=[4e6]), SI(Int(apisrs[1, :fpc])))

About

Design Based Inference in Julia

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages