Validation against the survey package • weightflow

weightflow’s calibration is meant to reproduce the established results of the survey package on the methods they share — raking, post-stratification and linear (GREG) calibration — while adding the staged cascade (eligibility, nonresponse, selection) and a recipe-aware bootstrap on top. This vignette checks that agreement directly: on the same starting weights and the same control totals, the two packages return the same weights.

To make every unit comparable one-to-one, the recipes below use only the calibration step (no dropping or nonresponse), so no rows are removed.

d <- sample_survey
N <- nrow(population)

Post-stratification

Post-stratifying to the population counts of region: each region’s weights are rescaled so the weighted count matches the known total.

library(survey)
#> Loading required package: grid
#> Loading required package: Matrix
#> Loading required package: survival
#> 
#> Attaching package: 'survey'
#> The following object is masked from 'package:graphics':
#> 
#>     dotchart

# weightflow
wf <- weighting_spec(d, base_weights = pw) |>
  step_calibrate(method = "poststratify",
                 margins = list(region = c(table(population$region)))) |>
  prep()
w_wf <- wf$final_weight

# survey
des    <- svydesign(ids = ~1, weights = ~pw, data = d)
pr     <- data.frame(region = names(table(population$region)),
                     Freq = as.numeric(table(population$region)))
des_ps <- postStratify(des, ~region, pr)
w_sv   <- weights(des_ps)

c(max_abs_weight_diff = max(abs(w_wf - w_sv)))
#> max_abs_weight_diff 
#>                   0

Raking

Raking (iterative proportional fitting) to the region and sex margins. We tighten survey’s convergence so both solve the system to the same precision.

# weightflow
wf <- weighting_spec(d, base_weights = pw) |>
  step_calibrate(method = "raking",
                 margins = list(region = c(table(population$region)),
                                sex    = c(table(population$sex)))) |>
  prep()
w_wf <- wf$final_weight

# survey (tight epsilon so it fully converges, like weightflow)
des   <- svydesign(ids = ~1, weights = ~pw, data = d)
ps    <- data.frame(sex = names(table(population$sex)),
                    Freq = as.numeric(table(population$sex)))
des_rk <- rake(des, list(~region, ~sex), list(pr, ps),
               control = list(epsilon = 1e-10, maxit = 100))
w_sv  <- weights(des_rk)

c(max_abs_weight_diff = max(abs(w_wf - w_sv)))
#> max_abs_weight_diff 
#>        1.110972e-09

Linear (GREG) calibration

Linear calibration to the totals of the design matrix of ~ region + sex, including the intercept (the population size N).

totals <- colSums(model.matrix(~ region + sex, population))

# weightflow
wf <- weighting_spec(d, base_weights = pw) |>
  step_calibrate(method = "linear", formula = ~ region + sex, totals = totals) |>
  prep()
w_wf <- wf$final_weight

# survey
des     <- svydesign(ids = ~1, weights = ~pw, data = d)
des_cal <- calibrate(des, ~ region + sex, population = totals, calfun = "linear")
w_sv    <- weights(des_cal)

c(max_abs_weight_diff = max(abs(w_wf - w_sv)))
#> max_abs_weight_diff 
#>        3.552714e-15

Same estimates

The agreement carries over to estimates. A calibrated total of a survey outcome matches between the two packages:

wf <- weighting_spec(d, base_weights = pw) |>
  step_calibrate(method = "raking",
                 margins = list(region = c(table(population$region)),
                                sex    = c(table(population$sex)))) |>
  prep()
total_wf <- sum(wf$final_weight * d$employed, na.rm = TRUE)

des    <- svydesign(ids = ~1, weights = ~pw, data = d)
des_rk <- rake(des, list(~region, ~sex), list(pr, ps),
               control = list(epsilon = 1e-10, maxit = 100))
total_sv <- as.numeric(svytotal(~employed, des_rk, na.rm = TRUE))

c(weightflow = total_wf, survey = total_sv, difference = total_wf - total_sv)
#>   weightflow       survey   difference 
#> 1.084772e+03 1.084772e+03 3.450396e-09

What weightflow adds

The point of agreement is trust: where the methods overlap, weightflow returns exactly what survey does. On top of that shared core, weightflow contributes the staged cascade — unknown eligibility, ineligible dropping, within-household selection, and person- or household-level nonresponse, each as a pipeable step with diagnostics — and a bootstrap that re-applies the whole recipe on each replicate, so the variance reflects every adjustment (see the Variance estimation article). For design-based inference you can always export the final weights back to survey/srvyr.