Introduction
Complex sampling designs require special treatment for correct
variance estimation. metasurvey manages the sampling design
automatically through the Survey object, so that users can
focus on the analysis rather than technical details.
This vignette covers the following topics:
- Creating surveys with different design types
- Configuring weights and data engines
- Validating the processing pipeline
- Cross-checking results with the
surveypackage
Initial Setup
We use the Academic Performance Index (API) dataset from the
survey package. It includes stratified, cluster, and simple
random sampling versions.
library(metasurvey)
library(survey)
library(data.table)
data(api, package = "survey")
dt_strat <- data.table(apistrat)Sampling Design Types
Simple Weighted Design
The simplest design uses probability weights without clusters or stratification:
svy_simple <- Survey$new(
data = dt_strat,
edition = "2000",
type = "api",
psu = NULL,
engine = "data.table",
weight = add_weight(annual = "pw")
)
cat_design(svy_simple)
#> [1] "\n Design: Not initialized (lazy initialization - will be created when needed)\n"Stratified Cluster Design
Many national surveys use stratified multi-stage sampling. Pass
strata and psu to
Survey$new():
dt_clus <- data.table(apiclus1)
svy_strat_clus <- Survey$new(
data = dt_strat,
edition = "2000",
type = "api",
psu = NULL,
strata = "stype",
engine = "data.table",
weight = add_weight(annual = "pw")
)
cat_design(svy_strat_clus)
#> [1] "\n Design: Not initialized (lazy initialization - will be created when needed)\n"Cross-validate with the survey package:
design_strat <- svydesign(
id = ~1, strata = ~stype, weights = ~pw,
data = dt_strat
)
direct_strat <- svymean(~api00, design_strat)
wf_strat <- workflow(
list(svy_strat_clus),
survey::svymean(~api00, na.rm = TRUE),
estimation_type = "annual"
)
cat("Direct estimate:", round(coef(direct_strat), 2), "\n")
#> Direct estimate: 662.29
cat("Workflow estimate:", round(wf_strat$value, 2), "\n")
#> Workflow estimate: 662.29
cat("Match:", all.equal(
as.numeric(coef(direct_strat)),
wf_strat$value,
tolerance = 1e-6
), "\n")
#> Match: TRUEFor real-world examples of stratified cluster designs with CASEN,
PNADc, ENIGH, and DHS data, see
vignette("international-surveys").
Design Inspection
# Check design type
cat_design_type(svy_simple, "annual")
#> [1] "None"
# View metadata
get_metadata(svy_simple)
#> Type: API
#> Edition: 2000
#> Periodicity: Annual
#> Engine: data.table
#> Design:
#> Design: Not initialized (lazy initialization - will be created when needed)
#>
#> Steps: None
#> Recipes: NoneMultiple Weight Types
Many surveys provide different weights depending on the analysis period (for example, annual vs. monthly). metasurvey associates periodicity labels with weight columns:
set.seed(42)
dt_multi <- copy(dt_strat)
dt_multi[, pw_monthly := pw * runif(.N, 0.9, 1.1)]
svy_multi <- Survey$new(
data = dt_multi,
edition = "2000",
type = "api",
psu = NULL,
engine = "data.table",
weight = add_weight(annual = "pw", monthly = "pw_monthly")
)
# Use different weight types in workflow()
annual_est <- workflow(
list(svy_multi),
survey::svymean(~api00, na.rm = TRUE),
estimation_type = "annual"
)
monthly_est <- workflow(
list(svy_multi),
survey::svymean(~api00, na.rm = TRUE),
estimation_type = "monthly"
)
cat("Annual estimate:", round(annual_est$value, 1), "\n")
#> Annual estimate: 662.3
cat("Monthly estimate:", round(monthly_est$value, 1), "\n")
#> Monthly estimate: 662.5Bootstrap Replicate Weights
For surveys that provide bootstrap replicates (such as Uruguay’s
ECH), use add_replicate() inside
add_weight():
# Requires external bootstrap replicate CSV files
svy_boot <- load_survey(
path = "data/main_survey.csv",
svy_type = "ech",
svy_edition = "2023",
svy_weight = add_weight(
annual = add_replicate(
weight_var = "pesoano",
replicate_path = "data/bootstrap_replicates.csv",
replicate_id = c("numero" = "id"),
replicate_pattern = "bsrep[0-9]+",
replicate_type = "bootstrap"
)
)
)When replicate weights are configured, workflow()
automatically uses them for variance estimation via
survey::svrepdesign().
Engine and Processing Configuration
Data Engine
metasurvey uses data.table by default for fast data
manipulation:
# Current engine
get_engine()
#> [1] "data.table"
# Available engines
show_engines()
#> [1] "data.table" "tidyverse" "dplyr"Lazy Processing
By default, steps are recorded but not executed until
bake_steps() is called. This allows validations to be
performed before execution:
# Check current setting
lazy_default()
#> [1] TRUE
# Change for the session (not recommended for most workflows)
# set_lazy_processing(FALSE)Copy Behavior
You can control whether step operations modify the data in-place or work on copies:
# Current setting
use_copy_default()
#> [1] TRUE
# In-place is faster but modifies the original
# set_use_copy(FALSE)Variance Estimation
Design-Based Variance
Standard variance estimation using the sampling design:
results <- workflow(
list(svy_simple),
survey::svymean(~api00, na.rm = TRUE),
survey::svytotal(~enroll, na.rm = TRUE),
estimation_type = "annual"
)
results
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svymean: api00 662.2874 9.585429e+00 0.01447322 643.5003
#> 2: survey::svytotal: enroll 3687177.5324 1.645323e+05 0.04462283 3364700.1537
#> confint_upper
#> <num>
#> 1: 681.0745
#> 2: 4009654.9112Domain Estimation
Estimates for subpopulations can be computed using
survey::svyby():
domain_results <- workflow(
list(svy_simple),
survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
estimation_type = "annual"
)
domain_results
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svyby: api00 [stype=E] 674.43 12.49343 0.01852443 649.9433
#> 2: survey::svyby: api00 [stype=H] 625.82 15.34078 0.02451309 595.7526
#> 3: survey::svyby: api00 [stype=M] 636.60 16.50239 0.02592270 604.2559
#> confint_upper stype
#> <num> <fctr>
#> 1: 698.9167 E
#> 2: 655.8874 H
#> 3: 668.9441 MRatios
ratio_result <- workflow(
list(svy_simple),
survey::svyratio(~api00, ~api99),
estimation_type = "annual"
)
ratio_result
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svyratio: api00/api99 1.052261 0.00379243 0.003604079 1.044828
#> confint_upper
#> <num>
#> 1: 1.059694Pipeline Validation
Step-by-Step Verification
When building complex pipelines, it is useful to verify each step independently:
# Step 1: Compute new variable
svy_v <- step_compute(svy_simple,
api_diff = api00 - api99,
comment = "API score difference"
)
# Check that the step was recorded
steps <- get_steps(svy_v)
cat("Pending steps:", length(steps), "\n")
#> Pending steps: 1Cross-Validation with the survey Package
You can compare metasurvey workflow() results with
direct calls to the survey package:
# Method 1: Direct survey package
design <- svydesign(id = ~1, weights = ~pw, data = dt_strat)
direct_mean <- svymean(~api00, design)
# Method 2: metasurvey workflow
wf_result <- workflow(
list(svy_simple),
survey::svymean(~api00, na.rm = TRUE),
estimation_type = "annual"
)
cat("Direct estimate:", round(coef(direct_mean), 2), "\n")
#> Direct estimate: 662.29
cat("Workflow estimate:", round(wf_result$value, 2), "\n")
#> Workflow estimate: 662.29
cat("Match:", all.equal(
as.numeric(coef(direct_mean)),
wf_result$value,
tolerance = 1e-6
), "\n")
#> Match: TRUEPipeline Visualization
You can use view_graph() to visualize the dependency
graph between steps:
svy_viz <- step_compute(svy_simple,
api_diff = api00 - api99,
high_growth = ifelse(api00 - api99 > 50, 1L, 0L)
)
view_graph(svy_viz, init_step = "Load API data")The interactive DAG is not rendered in this vignette to keep the package size small. Run the code above in your R session to explore it.
Quality Assessment
You can assess the quality of estimates using the coefficient of variation:
results_quality <- workflow(
list(svy_simple),
survey::svymean(~api00, na.rm = TRUE),
survey::svymean(~enroll, na.rm = TRUE),
estimation_type = "annual"
)
for (i in seq_len(nrow(results_quality))) {
cv_pct <- results_quality$cv[i] * 100
cat(
results_quality$stat[i], ":",
round(cv_pct, 1), "% CV -",
evaluate_cv(cv_pct), "\n"
)
}
#> survey::svymean: api00 : 1.4 % CV - Excellent
#> survey::svymean: enroll : 4.5 % CV - ExcellentRecipe Validation
You can verify that recipes and their steps are consistent:
# Create steps and recipe
svy_rt <- step_compute(svy_simple, api_diff = api00 - api99)
my_recipe <- steps_to_recipe(
name = "API Test",
user = "QA Team",
svy = svy_rt,
description = "Recipe for validation",
steps = get_steps(svy_rt)
)
# Check documentation is correct
doc <- my_recipe$doc()
cat("Input variables:", paste(doc$input_variables, collapse = ", "), "\n")
#> Input variables: api00, api99
cat("Output variables:", paste(doc$output_variables, collapse = ", "), "\n")
#> Output variables: api_diff
# Validate against the survey
my_recipe$validate(svy_rt)
#> [1] TRUEValidation Checklist
Before putting a survey processing pipeline into production, the following should be verified:
- Data integrity – row count, column names, and data types after each step
- Weight validation – weight columns exist and are positive
- Design verification – the sampling design matches the expected specification (PSU, strata, weights)
- Recipe reproducibility – save and reload recipes, verify the JSON round-trip
-
Cross-validation – compare key estimates with
published values or direct calls to the
surveypackage - CV thresholds – flag estimates with high coefficients of variation
validate_pipeline <- function(svy) {
data <- get_data(svy)
checks <- list(
has_data = !is.null(data),
has_rows = nrow(data) > 0,
has_weights = all(
unlist(svy$weight)[is.character(unlist(svy$weight))] %in% names(data)
)
)
passed <- all(unlist(checks))
if (passed) {
message("All validation checks passed")
} else {
failed <- names(checks)[!unlist(checks)]
warning("Failed checks: ", paste(failed, collapse = ", "))
}
invisible(checks)
}
validate_pipeline(svy_simple)Best Practices
- Always use appropriate weights – never compute unweighted statistics from survey data
- Use replicate weights when available – they provide more robust variance estimates
- Check sample sizes by domain – combine small domains when CVs are too high
- Document the design – include the design specification, weight construction, and variance method
- Cross-validate key estimates – compare with published values or alternative methods
Next Steps
-
Estimation
Workflows –
workflow(),RecipeWorkflow, and publishable estimates -
Rotating Panels and
PoolSurvey – Longitudinal analysis with
RotativePanelSurveyandPoolSurvey - Getting Started – Review the basics of steps and Survey objects