Model calibration (model-assisted weighting) • weightflow

Ordinary calibration forces the weighted sample to match known population totals of auxiliary variables. Model calibration (Wu and Sitter, 2001) goes further: it uses a working model for the outcome to build a better calibration variable. step_model_calibration() implements it, and (less commonly) lets you keep the ordinary consistency constraints on the auxiliaries in the same step.

Throughout we follow the design-based notation: $U = \{1, \dots, N\}$ is the finite population, $s \subseteq U$ the sample, and $w_i$ the weight of unit $i$ . The study variable is $y$ , observed only in $s$ , with population total $Y = \sum_{i \in U} y_i$ . Each unit carries a vector of auxiliaries $\mathbf{x}_i$ , known for every unit of $U$ , with population totals $\mathbf{X} = \sum_{i \in U} \mathbf{x}_i$ .

The idea

You do not know the total $Y$ and estimating it is the goal. But you can fit a working model $\hat y_i = m(\mathbf{x}_i)$ for the outcome on the sample, predict the fitted values for every unit in the sample and in the population, and calibrate so that the weighted sample total of the predictions equals their population total:

$\sum_{i \in s} w_i^{\mathrm{mc}}\,\hat y_i = \sum_{i \in U} \hat y_i .$

The calibration target is the right-hand side, $\sum_{i \in U} \hat y_i$ , the total of the predictions over the whole population. This is computable, because the predictors $\mathbf{x}_i$ are known for every population unit, so $\hat y_i = m(\mathbf{x}_i)$ can be evaluated and summed across $U$ . The total of $y$ itself, $Y = \sum_{i \in U} y_i$ , is not computable: $y$ is observed only in the sample. Model calibration therefore calibrates against the population total of the predictions as a computable stand-in for the unknown total of $y$ . The resulting estimator

$\hat Y_{\mathrm{mc}} = \sum_{i \in s} w_i^{\mathrm{mc}}\,y_i$

is design-consistent and efficient when the model predicts $y$ well, since its error depends on the residuals $y_i - \hat y_i$ rather than on $y$ itself. Wu (2003) established the optimality of this estimator within a class of calibration estimators, and Montanari and Ranalli (2005) extended it to nonparametric models, the natural home of the tree and forest learners.

In weightflow, with a random forest

step_model_calibration() takes consistency auxiliaries (x_formula, with known population totals), a named list of working models from y_model(), and the population frame used to compute the prediction totals. Here, the working model is a random forest (although other models can be used), which captures nonlinear structure in the predictors without specifying it.

spec <- weighting_spec(sample_survey, base_weights = pw) |>
  step_nonresponse(respondent = responded, method = "weighting_class",
                   by = "region") |>
  step_model_calibration(
    x_formula  = ~ region + sex,                       # consistency constraints
    models     = list(income = y_model(income ~ age + sex + region,
                                       engine = "forest")),
    population = population)
fitted <- prep(spec)

The model is fitted on the responding units only (where income is observed), weighted by the weights entering the step

Validation: the constraints are met exactly

Both kinds of constraint hold afterwards: the weighted totals of the consistency auxiliaries match their known population totals, and the weighted total of the income predictions matches its population total. The step’s diagnostics show target versus achieved:

fitted$steps[[2]]$diagnostics
#>              constraint            type   target achieved
#> (Intercept) (Intercept) X (consistency)     4495     4495
#> regionSouth regionSouth X (consistency)     1250     1250
#> regionEast   regionEast X (consistency)      927      927
#> regionWest   regionWest X (consistency)      748      748
#> sexM               sexM X (consistency)     2184     2184
#> income           income       y (model) 91802625 91802625

The X (consistency) rows are the region and sex totals; the y (model) row is the population total of the forest predictions. Each achieved equals its target.

Does it help the estimate?

Because the weights are calibrated on the income predictions, the weighted estimate of the income total tracks the true population total:

est_total  <- sum(fitted$final_weight * sample_survey$income, na.rm = TRUE)
true_total <- sum(population$income)
c(estimated = est_total, population = true_total,
  rel_error = (est_total - true_total) / true_total)
#>    estimated   population    rel_error 
#> 8.971419e+07 8.674501e+07 3.422891e-02

The estimator stays design-based, $\hat Y_{\mathrm{mc}} = \sum_{i \in s} w_i\,y_i$ ; the model only shapes the weights to make it efficient. A poor model does not bias the result, it just yields a less efficient one, because the design-based form is (approximately) unbiased regardless of how well $m$ fits.

What you need: unit-level auxiliaries

This is the practical asymmetry that sets model calibration apart. Raking, post-stratification and GREG need only the population totals $\mathbf{X}$ of the auxiliaries (e.g. numbers such as “1570 people in the North, 2311 women”) which official statistics routinely publish. Model calibration needs more: the auxiliaries $\mathbf{x}_i$ at the unit level for the whole population, because the fitted values $\hat y_i$ must be evaluated for every population unit before being summed. This is the complete auxiliary information of the Wu-Sitter setting, and it confines the method to situations where a population register, a census microdata file, or a large reference frame is available. When only published totals are at hand, the analyst is restricted to raking, post-stratification or GREG.

That difference is visible in the arguments: step_calibrate() takes margins or totals (vectors of totals), whereas step_model_calibration() takes population, a data frame with one row per population unit carrying the predictors.

The hybrid: model term and consistency constraints

The distinctive part of the step is the hybrid: it imposes the consistency constraints on the raw auxiliary totals and the model-calibration constraint on the predictions at once,

$\sum_{i \in s} w_i\,\mathbf{x}_i = \mathbf{X} \qquad\text{and}\qquad \sum_{i \in s} w_i\,\hat y_i = \sum_{i \in U} \hat y_i .$

Whether this second constraint adds anything depends on the model. If $m$ is linear in $\mathbf{x}$ , the predictions lie in the span of $\mathbf{x}$ , the second constraint is already implied by the first, and the estimator reduces to the generalized regression (GREG) estimator. If $m$ is nonlinear (e.g regression tree or a random forest) the predictions carry information beyond the linear terms, and the two sets of constraints become complementary: the auxiliary totals enforce internal consistency with published figures (benchmarking), while the model term injects the efficiency of the nonlinear fit. The hybrid is therefore meaningful precisely for the flexible learners that motivate it.

Relationship to the literature

Model calibration was introduced by Wu and Sitter (2001), who showed it handles any linear or nonlinear working model and reduces to the Deville-Sarndal calibration / GREG estimator in the linear case. Wu (2003) proved its optimality within a class of calibration estimators, and Montanari and Ranalli (2005) extended it to nonparametric models, close in spirit to the forest used here. Sarndal (2007) reviews the approach within the general calibration framework.

Combining the model term with the ordinary consistency constraints is supported by the fact that calibration accommodates an arbitrary number of constraints; the practical motivation (efficiency from the model plus benchmarking to published totals) is standard. What is uncommon is a ready-to-use implementation: the major R survey packages calibrate on auxiliary totals (GREG, raking, post-stratification) but do not fit an outcome model and calibrate on its predictions, let alone with flexible engines combined with consistency constraints. step_model_calibration() packages exactly that. It is an accessible implementation of an established idea, not a new method.

References.

Wu & Sitter (2001), JASA 96(453), 185–193.
Wu (2003), Biometrika 90(4), 937–951.
Montanari & Ranalli (2005), JASA 100, 1429–1442.
Sarndal (2007), Survey Methodology 33(2), 99–119.
Deville & Sarndal (1992), JASA 87(418), 376–382.