Building an ECH Demographics Recipe
Source:vignettes/ech-demographics-recipe.Rmd
ech-demographics-recipe.RmdTwo paths: transpile fast or build right
metasurvey offers two ways to create recipes from existing STATA code:
-
Transpile automatically with
transpile_stata()– converts.dofiles to recipes in seconds. Great for migrating legacy code quickly (seevignette("stata-transpiler")). - Build from scratch in R – more work upfront, but the result is cleaner, uses proper R idioms, and you control every detail.
The transpiler is a pragmatic shortcut: it reads hundreds of lines of
STATA and produces a working recipe, but the output inherits the
original code’s structure – long gen/replace
chains become long step_recode calls, temporary variables
survive, and STATA-specific patterns (like mvencode) get
translated literally rather than rethought.
A hand-crafted recipe, on the other hand, lets you redesign the logic in R from the ground up. You pick meaningful variable names, combine related transformations into a single step, and skip intermediate variables that only existed because STATA needed them. The result is shorter, easier to read, and easier to maintain.
This vignette builds a demographics recipe from scratch in about 20
lines of R. A transpiled version of the same pipeline would take 80+
steps and carry over variable names like bc_pe2 and
bc_pe3 that mean nothing outside the original
.do file.
Setting up the survey
We start with an empty Survey object. This declares the survey type and edition without loading any data yet – the recipe will work on whatever data we feed it later.
library(metasurvey)
svy <- survey_empty(type = "ech", edition = "2023")
svyNow let’s attach some sample data. In production this would come from
anda_download_microdata("2023") or a local file; here we
simulate it.
set.seed(42)
n <- 200
dt <- data.table::data.table(
id = rep(1:50, each = 4),
nper = rep(1:4, 50),
pesoano = runif(n, 50, 300),
e26 = sample(1:2, n, replace = TRUE),
e27 = sample(0:90, n, replace = TRUE),
e30 = sample(1:7, n, replace = TRUE),
e51_2 = sample(c(0:6, -9), n, replace = TRUE),
region_4 = sample(1:4, n, replace = TRUE)
)
svy <- svy |> set_data(dt)Building the pipeline
Every transformation is a step. By default, steps are lazy: they record what to do without executing it. This lets you inspect and modify the pipeline before materializing the results.
Compare this with the transpiler approach:
transpile_stata() would produce one step per STATA command,
faithfully preserving every gen and replace.
Here we think in terms of the output variables we want, not the
commands we need to type.
Rename identifiers
svy <- svy |>
step_rename(
hh_id = "id", person_id = "nper",
comment = "Standardize identifiers"
)Nothing happened to the data yet:
The original column names are still there because the step is pending. Let’s keep adding steps.
Recode sex
In STATA this would be a gen + replace +
replace sequence (3 commands). With
step_recode it’s a single, declarative mapping that
produces human-readable labels:
svy <- svy |>
step_recode(sex,
e26 == 1 ~ "Male",
e26 == 2 ~ "Female",
.default = NA_character_,
comment = "Sex from e26"
)Age groups
The STATA equivalent uses five replace lines with
inrange(). Here we write the same logic as a single recode
with readable conditions:
svy <- svy |>
step_recode(age_group,
e27 >= 0 & e27 <= 13 ~ "Child",
e27 >= 14 & e27 <= 17 ~ "Adolescent",
e27 >= 18 & e27 <= 29 ~ "Young adult",
e27 >= 30 & e27 <= 64 ~ "Adult",
e27 >= 65 ~ "Elderly",
.default = NA_character_,
comment = "Age groups from e27"
)Relationship to head of household
svy <- svy |>
step_recode(relationship,
e30 == 1 ~ "Head",
e30 == 2 ~ "Spouse",
e30 >= 3 & e30 <= 5 ~ "Child",
e30 == 6 ~ "Other relative",
e30 == 7 ~ "Non-relative",
.default = "Unknown",
comment = "Relationship from e30"
)Education level
svy <- svy |>
step_recode(edu_level,
e51_2 == 0 ~ "None",
e51_2 >= 1 & e51_2 <= 2 ~ "Primary",
e51_2 >= 3 & e51_2 <= 4 ~ "Secondary",
e51_2 >= 5 & e51_2 <= 6 ~ "Tertiary",
.default = NA_character_,
comment = "Education level from e51_2"
)Geographic area
svy <- svy |>
step_recode(area,
region_4 == 1 ~ "Montevideo",
region_4 == 2 ~ "Urban >5k",
region_4 == 3 ~ "Urban <5k",
region_4 == 4 ~ "Rural",
.default = NA_character_,
comment = "Geographic area from region_4"
)Notice that all our output variables have meaningful labels instead of numeric codes. A transpiled recipe would keep the original integer codes (1, 2, 3…) because that’s what the STATA code used. Building from scratch lets you choose the representation that makes analysis easier.
Remove raw variables
svy <- svy |>
step_remove(e26, e27, e30, e51_2, region_4,
comment = "Drop raw ECH variables"
)Inspecting the pipeline before execution
At this point we have seven pending steps. Let’s see what was recorded:
This is one of the key advantages of building from scratch: 7 steps that each do one clear thing. A transpiled version of the full IECON demographics module has 80+ steps because it preserves every intermediate STATA command.
The pipeline is a DAG (directed acyclic graph) of transformations.
view_graph() renders it as an interactive network – each
node is a step, and edges show variable dependencies:
view_graph(svy)The interactive DAG is not rendered in this vignette to keep the
package size small. Run view_graph() in your R session to
explore it. With only 7 nodes the graph is clean and navigable. Compare
that with a transpiled recipe where the DAG can have 100+ nodes – still
useful for auditing, but much harder to read at a glance.
For static output we can inspect the step list:
Packaging as a recipe (before baking)
A recipe bundles the steps with metadata so anyone can reproduce the same pipeline on different data. We create the recipe before baking – the lazy steps are the pipeline:
rec <- steps_to_recipe(
name = "ECH Demographics (minimal)",
user = "research_team",
svy = svy,
steps = get_steps(svy),
description = paste(
"Harmonized demographics: sex, age group, relationship,",
"education level, and geographic area."
),
topic = "demographics"
)
recThe recipe auto-generates documentation from the steps:
Baking: materializing the pipeline
Now let’s execute the steps. bake_steps() runs all
pending steps in order and returns the transformed survey:
svy <- bake_steps(svy)The data has the new columns with readable labels:
The raw variables are gone:
Saving and loading
Recipes serialize to JSON for version control and sharing:
f <- tempfile(fileext = ".json")
save_recipe(rec, f)
rec2 <- read_recipe(f)
rec2$name
length(rec2$steps)The JSON is human-readable and diffable in git:
Applying to a new edition
The same recipe works on any edition. Load the recipe from JSON, attach it to new data, and bake:
rec_loaded <- read_recipe(f)
svy_2024 <- survey_empty(type = "ech", edition = "2024") |>
set_data(data.table::data.table(
id = rep(1:30, each = 3),
nper = rep(1:3, 30),
pesoano = runif(90, 50, 300),
e26 = sample(1:2, 90, replace = TRUE),
e27 = sample(0:90, 90, replace = TRUE),
e30 = sample(1:7, 90, replace = TRUE),
e51_2 = sample(c(0:6, -9), 90, replace = TRUE),
region_4 = sample(1:4, 90, replace = TRUE)
)) |>
add_recipe(rec_loaded) |>
bake_recipes()
head(get_data(svy_2024)[, .(hh_id, person_id, sex, age_group, area)])No code changes needed. The recipe encodes the logic, not the data.
Transpiler vs hand-crafted: when to use each
Transpiler (transpile_stata()) |
Hand-crafted recipe | |
|---|---|---|
| Speed | Seconds – instant migration | Hours – requires understanding the logic |
| Steps | 80-200 per module (one per STATA line) | 5-20 (one per concept) |
| Variable names | Inherits STATA names (bc_pe2, bc_pe3) |
Your own names (sex, age_group) |
| Labels | Numeric codes (1, 2, 3) |
Readable labels ("Male", "Female") |
| Readability | Faithful to original, verbose | Clean, self-documenting |
| Maintenance | Hard to modify individual steps | Easy to change any mapping |
| DAG visualization | Large, hard to read | Compact, meaningful nodes |
| Best for | Migrating legacy code fast | New projects, critical pipelines |
Recommended workflow: use
transpile_stata() to migrate your existing .do
files immediately so you have a working baseline. Then gradually replace
transpiled recipes with hand-crafted ones as you review each module. The
transpiled version keeps you running; the hand-crafted version is where
you want to end up.
What metasurvey gives you
| Manual STATA scripts | metasurvey recipe |
|---|---|
Copy-paste .do files per year |
One recipe, any edition |
| Undocumented variable names | Auto-generated input/output docs |
| No dependency tracking | DAG visualization with view_graph()
|
| Flat scripts, no validation |
validate() checks required variables |
Email .do files to colleagues |
publish_recipe() to shared registry |
| Re-run entire script to test | Lazy steps: inspect before baking |
Next steps
-
Add more variables: income, labor force status,
housing conditions – each can be a separate recipe with
depends_on_recipes -
Certify your recipe: use
certify_recipe()to mark it as reviewed or official -
Publish:
publish_recipe(rec)uploads to the shared registry where others can find it withsearch_recipes(topic = "demographics") -
Start from STATA: if you already have
.dofiles, usetranspile_stata()to generate a working baseline immediately – seevignette("stata-transpiler")– then refine the output into a hand-crafted recipe like the one in this vignette