What is a Workflow?
After transforming survey data with steps and recipes, the next task is estimation: computing means, totals, ratios, and their standard errors while accounting for the complex survey design.
The workflow() function wraps the estimators from the
survey package (svymean,
svytotal, svyratio, svyby) and
returns tidy results as a data.table that include:
- Point estimates and standard errors
- Coefficients of variation (CV)
- Confidence intervals
- Metadata for reproducibility
Initial Setup
We use the Academic Performance Index (API) dataset from the
survey package, which contains real data from stratified
schools in California.
library(metasurvey)
library(survey)
library(data.table)
data(api, package = "survey")
dt <- data.table(apistrat)
svy <- Survey$new(
data = dt,
edition = "2000",
type = "api",
psu = NULL,
engine = "data.table",
weight = add_weight(annual = "pw")
)Basic Estimation
Multiple Estimates at Once
You can pass multiple estimation calls to workflow() to
compute them in a single step:
results <- workflow(
list(svy),
survey::svymean(~api00, na.rm = TRUE),
survey::svytotal(~enroll, na.rm = TRUE),
estimation_type = "annual"
)
results
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svymean: api00 662.2874 9.585429e+00 0.01447322 643.5003
#> 2: survey::svytotal: enroll 3687177.5324 1.645323e+05 0.04462283 3364700.1537
#> confint_upper
#> <num>
#> 1: 681.0745
#> 2: 4009654.9112Domain Estimation
We use survey::svyby() to compute estimates by
subpopulations (domains):
# Mean API score by school type
api_by_type <- workflow(
list(svy),
survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
estimation_type = "annual"
)
api_by_type
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svyby: api00 [stype=E] 674.43 12.49343 0.01852443 649.9433
#> 2: survey::svyby: api00 [stype=H] 625.82 15.34078 0.02451309 595.7526
#> 3: survey::svyby: api00 [stype=M] 636.60 16.50239 0.02592270 604.2559
#> confint_upper stype
#> <num> <fctr>
#> 1: 698.9167 E
#> 2: 655.8874 H
#> 3: 668.9441 M
# Mean enrollment by awards status
enroll_by_award <- workflow(
list(svy),
survey::svyby(~enroll, ~awards, survey::svymean, na.rm = TRUE),
estimation_type = "annual"
)
enroll_by_award
#> stat value se cv
#> <char> <num> <num> <num>
#> 1: survey::svyby: enroll [awards=No] 727.5958 57.54094 0.07908366
#> 2: survey::svyby: enroll [awards=Yes] 520.5114 25.51854 0.04902590
#> confint_lower confint_upper awards
#> <num> <num> <fctr>
#> 1: 614.8177 840.3740 No
#> 2: 470.4960 570.5269 YesQuality Assessment
The coefficient of variation (CV) measures the
precision of an estimate. You can use evaluate_cv() to
classify quality following standard guidelines:
| CV Range | Quality | Recommendation |
|---|---|---|
| < 5% | Excellent | Use without restrictions |
| 5-10% | Very good | Use with confidence |
| 10-15% | Good | Use for most purposes |
| 15-25% | Acceptable | Use with caution |
| 25-35% | Poor | Only for general trends |
| >= 35% | Unreliable | Do not publish |
# Evaluate quality of the API score estimate
cv_pct <- results$cv[1] * 100
quality <- evaluate_cv(cv_pct)
cat("CV:", round(cv_pct, 2), "%\n")
#> CV: 1.45 %
cat("Quality:", quality, "\n")
#> Quality: ExcellentRecipeWorkflow: Publishable Estimates
A RecipeWorkflow bundles estimation calls with metadata,
making the analysis reproducible and shareable. It records:
- Which recipes were used for data preparation
- Which estimation calls were performed
- Authorship and versioning information
Creating a RecipeWorkflow
wf <- RecipeWorkflow$new(
name = "API Score Analysis 2000",
description = "Mean API score estimation by school type",
user = "Research Team",
survey_type = "api",
edition = "2000",
estimation_type = "annual",
recipe_ids = character(0),
calls = list(
"survey::svymean(~api00, na.rm = TRUE)",
"survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE)"
)
)
wf
#>
#> ── Workflow: API Score Analysis 2000 ──
#> Author: Research Team
#> Survey: api / 2000
#> Version: 1.0.0
#> Description: Mean API score estimation by school type
#> Certification: community
#> Estimation types: annual
#>
#> ── Calls (2) ──
#> 1. survey::svymean(~api00, na.rm = TRUE)
#> 2. survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE)Publishing to the Registry
We publish the workflow so that others can discover and reuse it:
# Configure a local backend
wf_path <- tempfile(fileext = ".json")
set_workflow_backend("local", path = wf_path)
# Publish
publish_workflow(wf)
# Discover workflows
all_wf <- list_workflows()
length(all_wf)
#> [1] 1
# Search by text
found <- search_workflows("income")
length(found)
#> [1] 0
# Filter by survey type
ech_wf <- filter_workflows(survey_type = "ech")
length(ech_wf)
#> [1] 0Finding Workflows Associated with a Recipe
If you have a recipe and want to know which estimates have been
published for it, you can use
find_workflows_for_recipe():
# Create a workflow that references a recipe
wf2 <- RecipeWorkflow$new(
name = "Labor Market Estimates",
user = "Team",
survey_type = "ech",
edition = "2023",
estimation_type = "annual",
recipe_ids = c("labor_force_recipe_001"),
calls = list("survey::svymean(~employed, na.rm = TRUE)")
)
publish_workflow(wf2)
# Find all workflows that use this recipe
related <- find_workflows_for_recipe("labor_force_recipe_001")
length(related)
#> [1] 1
if (length(related) > 0) cat("Found:", related[[1]]$name, "\n")
#> Found: Labor Market EstimatesSharing via the Remote API
For broader dissemination, you can publish workflows to the metasurvey API:
# Requires authentication
api_login("you@example.com", "password")
# Publish
api_publish_workflow(wf)
# Browse
all <- api_list_workflows(survey_type = "ech")
specific <- api_get_workflow("workflow_id_here")Full Pipeline
Below is a complete pipeline from raw data to publishable estimation, using the API dataset:
# 1. Create survey from real data
dt_full <- data.table(apistrat)
svy_full <- Survey$new(
data = dt_full,
edition = "2000",
type = "api",
psu = NULL,
engine = "data.table",
weight = add_weight(annual = "pw")
)
# 2. Apply steps: compute derived variables
svy_full <- step_compute(svy_full,
api_growth = api00 - api99,
high_growth = ifelse(api00 - api99 > 50, 1L, 0L),
comment = "API score growth indicators"
)
svy_full <- step_recode(svy_full, school_level,
stype == "E" ~ "Elementary",
stype == "M" ~ "Middle",
stype == "H" ~ "High",
.default = "Other",
comment = "School level classification"
)
# 3. Estimate means
estimates <- workflow(
list(svy_full),
survey::svymean(~api_growth, na.rm = TRUE),
survey::svymean(~high_growth, na.rm = TRUE),
estimation_type = "annual"
)
estimates
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svymean: api_growth 32.8925184 2.1583789 0.06561914 28.6621734
#> 2: survey::svymean: high_growth 0.2938489 0.0363651 0.12375443 0.2225746
#> confint_upper
#> <num>
#> 1: 37.1228633
#> 2: 0.3651232
# 4. Domain estimation (by school type)
by_school <- workflow(
list(svy_full),
survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
estimation_type = "annual"
)
by_school
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svyby: api00 [stype=E] 674.43 12.49343 0.01852443 649.9433
#> 2: survey::svyby: api00 [stype=H] 625.82 15.34078 0.02451309 595.7526
#> 3: survey::svyby: api00 [stype=M] 636.60 16.50239 0.02592270 604.2559
#> confint_upper stype
#> <num> <fctr>
#> 1: 698.9167 E
#> 2: 655.8874 H
#> 3: 668.9441 M
# 5. Assess quality
for (i in seq_len(nrow(estimates))) {
cv_val <- estimates$cv[i] * 100
cat(
estimates$stat[i], ":",
round(cv_val, 1), "% CV -",
evaluate_cv(cv_val), "\n"
)
}
#> survey::svymean: api_growth : 6.6 % CV - Very good
#> survey::svymean: high_growth : 12.4 % CV - GoodProvenance: Data Lineage
Every Survey object records provenance
metadata: where the data came from, which steps were applied, how many
rows survived each step, and which versions of R and metasurvey were
used. This makes it possible to trace any estimate back to the raw
data.
# Provenance is populated automatically after bake_steps()
prov <- provenance(svy_full)
prov
#> ── Data Provenance ─────────────────────────────────────────────────────────────
#> Loaded: 2026-02-25T12:03:17
#> Initial rows: 200
#>
#> Environment:
#> metasurvey: 0.0.21
#> R: 4.5.2
#> survey: 4.5Provenance is also attached to workflow() results, so
you can always inspect the full lineage of an estimate:
prov_wf <- provenance(estimates)
cat("metasurvey version:", prov_wf$environment$metasurvey_version, "\n")
#> metasurvey version: 0.0.21
cat("Steps applied:", length(prov_wf$steps), "\n")
#> Steps applied: 0For audit trails, export provenance to JSON:
provenance_to_json(prov, "audit_trail.json")To compare two runs (e.g., different editions), use
provenance_diff():
diff <- provenance_diff(prov_2022, prov_2023)
diff$steps_changed
diff$n_final_changedPublication-Quality Tables
workflow_table() formats estimation results as
publication-ready tables using the gt package. It adds
confidence intervals, CV quality classification with color coding, and
provenance-based source notes.
workflow_table(estimates)| Survey Estimation Results | ||||||
| Statistic | Estimate | SE | CI Lower | CI Upper | CV (%) | Quality |
|---|---|---|---|---|---|---|
| :svymean: api_growth | 32.89 | 2.158 | 28.66 | 37.12 | 6.6 | Very good |
| :svymean: high_growth | 0.29 | 0.036 | 0.22 | 0.37 | 12.4 | Good |
| metasurvey 0.0.21 | CI: 95% | 2026-02-25 | ||||||
You can customize the output:
# Spanish locale, hide SE, custom title
workflow_table(
estimates,
locale = "es",
show_se = FALSE,
title = "API Growth Indicators",
subtitle = "California Schools, 2000"
)| API Growth Indicators | |||||
| California Schools, 2000 | |||||
| Statistic | Estimate | CI Lower | CI Upper | CV (%) | Quality |
|---|---|---|---|---|---|
| :svymean: api_growth | 32,89 | 28,66 | 37,12 | 6,6 | Very good |
| :svymean: high_growth | 0,29 | 0,22 | 0,37 | 12,4 | Good |
| metasurvey 0.0.21 | CI: 95% | 2026-02-25 | |||||
For domain estimates, the table detects group columns automatically:
workflow_table(by_school)| Survey Estimation Results | |||||||
| Statistic | stype | Estimate | SE | CI Lower | CI Upper | CV (%) | Quality |
|---|---|---|---|---|---|---|---|
| :svyby: api00 | E | 674.43 | 12.493 | 649.94 | 698.92 | 1.9 | Excellent |
| :svyby: api00 | H | 625.82 | 15.341 | 595.75 | 655.89 | 2.5 | Excellent |
| :svyby: api00 | M | 636.60 | 16.502 | 604.26 | 668.94 | 2.6 | Excellent |
| metasurvey 0.0.21 | CI: 95% | 2026-02-25 | |||||||
Export to any format supported by gt::gtsave():
tbl <- workflow_table(estimates)
gt::gtsave(tbl, "estimates.html")
gt::gtsave(tbl, "estimates.docx")
gt::gtsave(tbl, "estimates.png")Next Steps
- Creating and Publishing Recipes – Build reproducible transformation pipelines
- Survey Designs and Validation – Stratification, clustering, replicate weights
- Case Study: ECH – Complete labor market analysis with estimation