Skip to contents

Two paths: transpile fast or build right

metasurvey offers two ways to create recipes from existing STATA code:

  1. Transpile automatically with transpile_stata() – converts .do files to recipes in seconds. Great for migrating legacy code quickly (see vignette("stata-transpiler")).
  2. Build from scratch in R – more work upfront, but the result is cleaner, uses proper R idioms, and you control every detail.

The transpiler is a pragmatic shortcut: it reads hundreds of lines of STATA and produces a working recipe, but the output inherits the original code’s structure – long gen/replace chains become long step_recode calls, temporary variables survive, and STATA-specific patterns (like mvencode) get translated literally rather than rethought.

A hand-crafted recipe, on the other hand, lets you redesign the logic in R from the ground up. You pick meaningful variable names, combine related transformations into a single step, and skip intermediate variables that only existed because STATA needed them. The result is shorter, easier to read, and easier to maintain.

This vignette builds a demographics recipe from scratch in about 20 lines of R. A transpiled version of the same pipeline would take 80+ steps and carry over variable names like bc_pe2 and bc_pe3 that mean nothing outside the original .do file.

Setting up the survey

We start with an empty Survey object. This declares the survey type and edition without loading any data yet – the recipe will work on whatever data we feed it later.

library(metasurvey)

svy <- survey_empty(type = "ech", edition = "2023")
svy

Now let’s attach some sample data. In production this would come from anda_download_microdata("2023") or a local file; here we simulate it.

set.seed(42)
n <- 200
dt <- data.table::data.table(
  id       = rep(1:50, each = 4),
  nper     = rep(1:4, 50),
  pesoano  = runif(n, 50, 300),
  e26      = sample(1:2, n, replace = TRUE),
  e27      = sample(0:90, n, replace = TRUE),
  e30      = sample(1:7, n, replace = TRUE),
  e51_2    = sample(c(0:6, -9), n, replace = TRUE),
  region_4 = sample(1:4, n, replace = TRUE)
)

svy <- svy |> set_data(dt)

Building the pipeline

Every transformation is a step. By default, steps are lazy: they record what to do without executing it. This lets you inspect and modify the pipeline before materializing the results.

Compare this with the transpiler approach: transpile_stata() would produce one step per STATA command, faithfully preserving every gen and replace. Here we think in terms of the output variables we want, not the commands we need to type.

Rename identifiers

svy <- svy |>
  step_rename(
    hh_id = "id", person_id = "nper",
    comment = "Standardize identifiers"
  )

Nothing happened to the data yet:

names(get_data(svy))[1:4]

The original column names are still there because the step is pending. Let’s keep adding steps.

Recode sex

In STATA this would be a gen + replace + replace sequence (3 commands). With step_recode it’s a single, declarative mapping that produces human-readable labels:

svy <- svy |>
  step_recode(sex,
    e26 == 1 ~ "Male",
    e26 == 2 ~ "Female",
    .default = NA_character_,
    comment = "Sex from e26"
  )

Age groups

The STATA equivalent uses five replace lines with inrange(). Here we write the same logic as a single recode with readable conditions:

svy <- svy |>
  step_recode(age_group,
    e27 >= 0 & e27 <= 13 ~ "Child",
    e27 >= 14 & e27 <= 17 ~ "Adolescent",
    e27 >= 18 & e27 <= 29 ~ "Young adult",
    e27 >= 30 & e27 <= 64 ~ "Adult",
    e27 >= 65 ~ "Elderly",
    .default = NA_character_,
    comment = "Age groups from e27"
  )

Relationship to head of household

svy <- svy |>
  step_recode(relationship,
    e30 == 1 ~ "Head",
    e30 == 2 ~ "Spouse",
    e30 >= 3 & e30 <= 5 ~ "Child",
    e30 == 6 ~ "Other relative",
    e30 == 7 ~ "Non-relative",
    .default = "Unknown",
    comment = "Relationship from e30"
  )

Education level

svy <- svy |>
  step_recode(edu_level,
    e51_2 == 0 ~ "None",
    e51_2 >= 1 & e51_2 <= 2 ~ "Primary",
    e51_2 >= 3 & e51_2 <= 4 ~ "Secondary",
    e51_2 >= 5 & e51_2 <= 6 ~ "Tertiary",
    .default = NA_character_,
    comment = "Education level from e51_2"
  )

Geographic area

svy <- svy |>
  step_recode(area,
    region_4 == 1 ~ "Montevideo",
    region_4 == 2 ~ "Urban >5k",
    region_4 == 3 ~ "Urban <5k",
    region_4 == 4 ~ "Rural",
    .default = NA_character_,
    comment = "Geographic area from region_4"
  )

Notice that all our output variables have meaningful labels instead of numeric codes. A transpiled recipe would keep the original integer codes (1, 2, 3…) because that’s what the STATA code used. Building from scratch lets you choose the representation that makes analysis easier.

Remove raw variables

svy <- svy |>
  step_remove(e26, e27, e30, e51_2, region_4,
    comment = "Drop raw ECH variables"
  )

Inspecting the pipeline before execution

At this point we have seven pending steps. Let’s see what was recorded:

This is one of the key advantages of building from scratch: 7 steps that each do one clear thing. A transpiled version of the full IECON demographics module has 80+ steps because it preserves every intermediate STATA command.

The pipeline is a DAG (directed acyclic graph) of transformations. view_graph() renders it as an interactive network – each node is a step, and edges show variable dependencies:

The interactive DAG is not rendered in this vignette to keep the package size small. Run view_graph() in your R session to explore it. With only 7 nodes the graph is clean and navigable. Compare that with a transpiled recipe where the DAG can have 100+ nodes – still useful for auditing, but much harder to read at a glance.

For static output we can inspect the step list:

for (s in get_steps(svy)) {
  cat(sprintf("[%s] %s\n", s$type, s$comment %||% ""))
}

Packaging as a recipe (before baking)

A recipe bundles the steps with metadata so anyone can reproduce the same pipeline on different data. We create the recipe before baking – the lazy steps are the pipeline:

rec <- steps_to_recipe(
  name = "ECH Demographics (minimal)",
  user = "research_team",
  svy = svy,
  steps = get_steps(svy),
  description = paste(
    "Harmonized demographics: sex, age group, relationship,",
    "education level, and geographic area."
  ),
  topic = "demographics"
)

rec

The recipe auto-generates documentation from the steps:

doc <- rec$doc()
cat("Input variables: ", paste(doc$input_variables, collapse = ", "), "\n")
cat("Output variables:", paste(doc$output_variables, collapse = ", "), "\n")
cat("Pipeline steps:  ", length(doc$pipeline), "\n")

Baking: materializing the pipeline

Now let’s execute the steps. bake_steps() runs all pending steps in order and returns the transformed survey:

svy <- bake_steps(svy)

The data has the new columns with readable labels:

head(get_data(svy)[, .(
  hh_id, person_id, sex, age_group, relationship,
  edu_level, area
)])

The raw variables are gone:

"e26" %in% names(get_data(svy))

Saving and loading

Recipes serialize to JSON for version control and sharing:

f <- tempfile(fileext = ".json")
save_recipe(rec, f)
rec2 <- read_recipe(f)
rec2$name
length(rec2$steps)

The JSON is human-readable and diffable in git:

cat(readLines(f, n = 15), sep = "\n")

Applying to a new edition

The same recipe works on any edition. Load the recipe from JSON, attach it to new data, and bake:

rec_loaded <- read_recipe(f)

svy_2024 <- survey_empty(type = "ech", edition = "2024") |>
  set_data(data.table::data.table(
    id       = rep(1:30, each = 3),
    nper     = rep(1:3, 30),
    pesoano  = runif(90, 50, 300),
    e26      = sample(1:2, 90, replace = TRUE),
    e27      = sample(0:90, 90, replace = TRUE),
    e30      = sample(1:7, 90, replace = TRUE),
    e51_2    = sample(c(0:6, -9), 90, replace = TRUE),
    region_4 = sample(1:4, 90, replace = TRUE)
  )) |>
  add_recipe(rec_loaded) |>
  bake_recipes()

head(get_data(svy_2024)[, .(hh_id, person_id, sex, age_group, area)])

No code changes needed. The recipe encodes the logic, not the data.

Transpiler vs hand-crafted: when to use each

Transpiler (transpile_stata()) Hand-crafted recipe
Speed Seconds – instant migration Hours – requires understanding the logic
Steps 80-200 per module (one per STATA line) 5-20 (one per concept)
Variable names Inherits STATA names (bc_pe2, bc_pe3) Your own names (sex, age_group)
Labels Numeric codes (1, 2, 3) Readable labels ("Male", "Female")
Readability Faithful to original, verbose Clean, self-documenting
Maintenance Hard to modify individual steps Easy to change any mapping
DAG visualization Large, hard to read Compact, meaningful nodes
Best for Migrating legacy code fast New projects, critical pipelines

Recommended workflow: use transpile_stata() to migrate your existing .do files immediately so you have a working baseline. Then gradually replace transpiled recipes with hand-crafted ones as you review each module. The transpiled version keeps you running; the hand-crafted version is where you want to end up.

What metasurvey gives you

Manual STATA scripts metasurvey recipe
Copy-paste .do files per year One recipe, any edition
Undocumented variable names Auto-generated input/output docs
No dependency tracking DAG visualization with view_graph()
Flat scripts, no validation validate() checks required variables
Email .do files to colleagues publish_recipe() to shared registry
Re-run entire script to test Lazy steps: inspect before baking

Next steps

  • Add more variables: income, labor force status, housing conditions – each can be a separate recipe with depends_on_recipes
  • Certify your recipe: use certify_recipe() to mark it as reviewed or official
  • Publish: publish_recipe(rec) uploads to the shared registry where others can find it with search_recipes(topic = "demographics")
  • Start from STATA: if you already have .do files, use transpile_stata() to generate a working baseline immediately – see vignette("stata-transpiler") – then refine the output into a hand-crafted recipe like the one in this vignette