Skip to contents

Introduction

Complex sampling designs require special treatment for correct variance estimation. metasurvey manages the sampling design automatically through the Survey object, so that users can focus on the analysis rather than technical details.

This vignette covers the following topics:

  1. Creating surveys with different design types
  2. Configuring weights and data engines
  3. Validating the processing pipeline
  4. Cross-checking results with the survey package

Initial Setup

We use the Academic Performance Index (API) dataset from the survey package. It includes stratified, cluster, and simple random sampling versions.

library(metasurvey)
library(survey)
library(data.table)

data(api, package = "survey")
dt_strat <- data.table(apistrat)

Sampling Design Types

Simple Weighted Design

The simplest design uses probability weights without clusters or stratification:

svy_simple <- Survey$new(
  data = dt_strat,
  edition = "2000",
  type = "api",
  psu = NULL,
  engine = "data.table",
  weight = add_weight(annual = "pw")
)

cat_design(svy_simple)
#> [1] "\n  Design: Not initialized (lazy initialization - will be created when needed)\n"

Stratified Cluster Design

Many national surveys use stratified multi-stage sampling. Pass strata and psu to Survey$new():

dt_clus <- data.table(apiclus1)

svy_strat_clus <- Survey$new(
  data    = dt_strat,
  edition = "2000",
  type    = "api",
  psu     = NULL,
  strata  = "stype",
  engine  = "data.table",
  weight  = add_weight(annual = "pw")
)

cat_design(svy_strat_clus)
#> [1] "\n  Design: Not initialized (lazy initialization - will be created when needed)\n"

Cross-validate with the survey package:

design_strat <- svydesign(
  id = ~1, strata = ~stype, weights = ~pw,
  data = dt_strat
)
direct_strat <- svymean(~api00, design_strat)

wf_strat <- workflow(
  list(svy_strat_clus),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "annual"
)

cat("Direct estimate:", round(coef(direct_strat), 2), "\n")
#> Direct estimate: 662.29
cat("Workflow estimate:", round(wf_strat$value, 2), "\n")
#> Workflow estimate: 662.29
cat("Match:", all.equal(
  as.numeric(coef(direct_strat)),
  wf_strat$value,
  tolerance = 1e-6
), "\n")
#> Match: TRUE

For real-world examples of stratified cluster designs with CASEN, PNADc, ENIGH, and DHS data, see vignette("international-surveys").

Design Inspection

# Check design type
cat_design_type(svy_simple, "annual")
#> [1] "None"

# View metadata
get_metadata(svy_simple)
#> Type: API
#> Edition: 2000
#> Periodicity: Annual
#> Engine: data.table
#> Design: 
#>   Design: Not initialized (lazy initialization - will be created when needed)
#> 
#> Steps: None
#> Recipes: None

Multiple Weight Types

Many surveys provide different weights depending on the analysis period (for example, annual vs. monthly). metasurvey associates periodicity labels with weight columns:

set.seed(42)
dt_multi <- copy(dt_strat)
dt_multi[, pw_monthly := pw * runif(.N, 0.9, 1.1)]

svy_multi <- Survey$new(
  data    = dt_multi,
  edition = "2000",
  type    = "api",
  psu     = NULL,
  engine  = "data.table",
  weight  = add_weight(annual = "pw", monthly = "pw_monthly")
)

# Use different weight types in workflow()
annual_est <- workflow(
  list(svy_multi),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "annual"
)

monthly_est <- workflow(
  list(svy_multi),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "monthly"
)

cat("Annual estimate:", round(annual_est$value, 1), "\n")
#> Annual estimate: 662.3
cat("Monthly estimate:", round(monthly_est$value, 1), "\n")
#> Monthly estimate: 662.5

Bootstrap Replicate Weights

For surveys that provide bootstrap replicates (such as Uruguay’s ECH), use add_replicate() inside add_weight():

# Requires external bootstrap replicate CSV files
svy_boot <- load_survey(
  path = "data/main_survey.csv",
  svy_type = "ech",
  svy_edition = "2023",
  svy_weight = add_weight(
    annual = add_replicate(
      weight_var = "pesoano",
      replicate_path = "data/bootstrap_replicates.csv",
      replicate_id = c("numero" = "id"),
      replicate_pattern = "bsrep[0-9]+",
      replicate_type = "bootstrap"
    )
  )
)

When replicate weights are configured, workflow() automatically uses them for variance estimation via survey::svrepdesign().

Engine and Processing Configuration

Data Engine

metasurvey uses data.table by default for fast data manipulation:

# Current engine
get_engine()
#> [1] "data.table"

# Available engines
show_engines()
#> [1] "data.table" "tidyverse"  "dplyr"

Lazy Processing

By default, steps are recorded but not executed until bake_steps() is called. This allows validations to be performed before execution:

# Check current setting
lazy_default()
#> [1] TRUE

# Change for the session (not recommended for most workflows)
# set_lazy_processing(FALSE)

Copy Behavior

You can control whether step operations modify the data in-place or work on copies:

# Current setting
use_copy_default()
#> [1] TRUE

# In-place is faster but modifies the original
# set_use_copy(FALSE)

Variance Estimation

Design-Based Variance

Standard variance estimation using the sampling design:

results <- workflow(
  list(svy_simple),
  survey::svymean(~api00, na.rm = TRUE),
  survey::svytotal(~enroll, na.rm = TRUE),
  estimation_type = "annual"
)

results
#>                        stat        value           se         cv confint_lower
#>                      <char>        <num>        <num>      <num>         <num>
#> 1:   survey::svymean: api00     662.2874 9.585429e+00 0.01447322      643.5003
#> 2: survey::svytotal: enroll 3687177.5324 1.645323e+05 0.04462283  3364700.1537
#>    confint_upper
#>            <num>
#> 1:      681.0745
#> 2:  4009654.9112

Domain Estimation

Estimates for subpopulations can be computed using survey::svyby():

domain_results <- workflow(
  list(svy_simple),
  survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
  estimation_type = "annual"
)

domain_results
#>                              stat  value       se         cv confint_lower
#>                            <char>  <num>    <num>      <num>         <num>
#> 1: survey::svyby: api00 [stype=E] 674.43 12.49343 0.01852443      649.9433
#> 2: survey::svyby: api00 [stype=H] 625.82 15.34078 0.02451309      595.7526
#> 3: survey::svyby: api00 [stype=M] 636.60 16.50239 0.02592270      604.2559
#>    confint_upper  stype
#>            <num> <fctr>
#> 1:      698.9167      E
#> 2:      655.8874      H
#> 3:      668.9441      M

Ratios

ratio_result <- workflow(
  list(svy_simple),
  survey::svyratio(~api00, ~api99),
  estimation_type = "annual"
)

ratio_result
#>                             stat    value         se          cv confint_lower
#>                           <char>    <num>      <num>       <num>         <num>
#> 1: survey::svyratio: api00/api99 1.052261 0.00379243 0.003604079      1.044828
#>    confint_upper
#>            <num>
#> 1:      1.059694

Pipeline Validation

Step-by-Step Verification

When building complex pipelines, it is useful to verify each step independently:

# Step 1: Compute new variable
svy_v <- step_compute(svy_simple,
  api_diff = api00 - api99,
  comment = "API score difference"
)

# Check that the step was recorded
steps <- get_steps(svy_v)
cat("Pending steps:", length(steps), "\n")
#> Pending steps: 1

Cross-Validation with the survey Package

You can compare metasurvey workflow() results with direct calls to the survey package:

# Method 1: Direct survey package
design <- svydesign(id = ~1, weights = ~pw, data = dt_strat)
direct_mean <- svymean(~api00, design)

# Method 2: metasurvey workflow
wf_result <- workflow(
  list(svy_simple),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "annual"
)

cat("Direct estimate:", round(coef(direct_mean), 2), "\n")
#> Direct estimate: 662.29
cat("Workflow estimate:", round(wf_result$value, 2), "\n")
#> Workflow estimate: 662.29
cat("Match:", all.equal(
  as.numeric(coef(direct_mean)),
  wf_result$value,
  tolerance = 1e-6
), "\n")
#> Match: TRUE

Pipeline Visualization

You can use view_graph() to visualize the dependency graph between steps:

svy_viz <- step_compute(svy_simple,
  api_diff = api00 - api99,
  high_growth = ifelse(api00 - api99 > 50, 1L, 0L)
)
view_graph(svy_viz, init_step = "Load API data")

The interactive DAG is not rendered in this vignette to keep the package size small. Run the code above in your R session to explore it.

Quality Assessment

You can assess the quality of estimates using the coefficient of variation:

results_quality <- workflow(
  list(svy_simple),
  survey::svymean(~api00, na.rm = TRUE),
  survey::svymean(~enroll, na.rm = TRUE),
  estimation_type = "annual"
)

for (i in seq_len(nrow(results_quality))) {
  cv_pct <- results_quality$cv[i] * 100
  cat(
    results_quality$stat[i], ":",
    round(cv_pct, 1), "% CV -",
    evaluate_cv(cv_pct), "\n"
  )
}
#> survey::svymean: api00 : 1.4 % CV - Excellent 
#> survey::svymean: enroll : 4.5 % CV - Excellent

Recipe Validation

You can verify that recipes and their steps are consistent:

# Create steps and recipe
svy_rt <- step_compute(svy_simple, api_diff = api00 - api99)

my_recipe <- steps_to_recipe(
  name        = "API Test",
  user        = "QA Team",
  svy         = svy_rt,
  description = "Recipe for validation",
  steps       = get_steps(svy_rt)
)

# Check documentation is correct
doc <- my_recipe$doc()
cat("Input variables:", paste(doc$input_variables, collapse = ", "), "\n")
#> Input variables: api00, api99
cat("Output variables:", paste(doc$output_variables, collapse = ", "), "\n")
#> Output variables: api_diff

# Validate against the survey
my_recipe$validate(svy_rt)
#> [1] TRUE

Validation Checklist

Before putting a survey processing pipeline into production, the following should be verified:

  1. Data integrity – row count, column names, and data types after each step
  2. Weight validation – weight columns exist and are positive
  3. Design verification – the sampling design matches the expected specification (PSU, strata, weights)
  4. Recipe reproducibility – save and reload recipes, verify the JSON round-trip
  5. Cross-validation – compare key estimates with published values or direct calls to the survey package
  6. CV thresholds – flag estimates with high coefficients of variation
validate_pipeline <- function(svy) {
  data <- get_data(svy)
  checks <- list(
    has_data = !is.null(data),
    has_rows = nrow(data) > 0,
    has_weights = all(
      unlist(svy$weight)[is.character(unlist(svy$weight))] %in% names(data)
    )
  )

  passed <- all(unlist(checks))
  if (passed) {
    message("All validation checks passed")
  } else {
    failed <- names(checks)[!unlist(checks)]
    warning("Failed checks: ", paste(failed, collapse = ", "))
  }
  invisible(checks)
}

validate_pipeline(svy_simple)

Best Practices

  1. Always use appropriate weights – never compute unweighted statistics from survey data
  2. Use replicate weights when available – they provide more robust variance estimates
  3. Check sample sizes by domain – combine small domains when CVs are too high
  4. Document the design – include the design specification, weight construction, and variance method
  5. Cross-validate key estimates – compare with published values or alternative methods

Next Steps