
Model calibration (model-assisted weighting)
Source:vignettes/model-calibration.Rmd
model-calibration.RmdOrdinary calibration forces the weighted sample to match known
population totals of auxiliary variables. Model calibration (Wu
and Sitter, 2001) goes further: it uses a working model for the
outcome to build a better calibration variable.
step_model_calibration() implements it, and (less commonly)
lets you keep the ordinary consistency constraints on the auxiliaries in
the same step.
Throughout we follow the design-based notation: is the finite population, the sample, and the weight of unit . The study variable is , observed only in , with population total . Each unit carries a vector of auxiliaries , known for every unit of , with population totals .
The idea
You do not know the total and estimating it is the goal. But you can fit a working model for the outcome on the sample, predict the fitted values for every unit in the sample and in the population, and calibrate so that the weighted sample total of the predictions equals their population total:
The calibration target is the right-hand side, , the total of the predictions over the whole population. This is computable, because the predictors are known for every population unit, so can be evaluated and summed across . The total of itself, , is not computable: is observed only in the sample. Model calibration therefore calibrates against the population total of the predictions as a computable stand-in for the unknown total of . The resulting estimator
is design-consistent and efficient when the model predicts well, since its error depends on the residuals rather than on itself. Wu (2003) established the optimality of this estimator within a class of calibration estimators, and Montanari and Ranalli (2005) extended it to nonparametric models, the natural home of the tree and forest learners.
In weightflow, with a random forest
step_model_calibration() takes consistency auxiliaries
(x_formula, with known population totals), a named list of
working models from y_model(), and the
population frame used to compute the prediction totals.
Here, the working model is a random forest (although other models can be
used), which captures nonlinear structure in the predictors without
specifying it.
spec <- weighting_spec(sample_survey, base_weights = pw) |>
step_nonresponse(respondent = responded, method = "weighting_class",
by = "region") |>
step_model_calibration(
x_formula = ~ region + sex, # consistency constraints
models = list(income = y_model(income ~ age + sex + region,
engine = "forest")),
population = population)
fitted <- prep(spec)The model is fitted on the responding units only (where
income is observed), weighted by the weights entering the
step
Validation: the constraints are met exactly
Both kinds of constraint hold afterwards: the weighted totals of the consistency auxiliaries match their known population totals, and the weighted total of the income predictions matches its population total. The step’s diagnostics show target versus achieved:
fitted$steps[[2]]$diagnostics
#> constraint type target achieved
#> (Intercept) (Intercept) X (consistency) 4495 4495
#> regionSouth regionSouth X (consistency) 1250 1250
#> regionEast regionEast X (consistency) 927 927
#> regionWest regionWest X (consistency) 748 748
#> sexM sexM X (consistency) 2184 2184
#> income income y (model) 91802625 91802625The X (consistency) rows are the region and sex totals;
the y (model) row is the population total of the forest
predictions. Each achieved equals its
target.
Does it help the estimate?
Because the weights are calibrated on the income predictions, the weighted estimate of the income total tracks the true population total:
est_total <- sum(fitted$final_weight * sample_survey$income, na.rm = TRUE)
true_total <- sum(population$income)
c(estimated = est_total, population = true_total,
rel_error = (est_total - true_total) / true_total)
#> estimated population rel_error
#> 8.971419e+07 8.674501e+07 3.422891e-02The estimator stays design-based, ; the model only shapes the weights to make it efficient. A poor model does not bias the result, it just yields a less efficient one, because the design-based form is (approximately) unbiased regardless of how well fits.
What you need: unit-level auxiliaries
This is the practical asymmetry that sets model calibration apart. Raking, post-stratification and GREG need only the population totals of the auxiliaries (e.g. numbers such as “1570 people in the North, 2311 women”) which official statistics routinely publish. Model calibration needs more: the auxiliaries at the unit level for the whole population, because the fitted values must be evaluated for every population unit before being summed. This is the complete auxiliary information of the Wu-Sitter setting, and it confines the method to situations where a population register, a census microdata file, or a large reference frame is available. When only published totals are at hand, the analyst is restricted to raking, post-stratification or GREG.
That difference is visible in the arguments:
step_calibrate() takes margins or
totals (vectors of totals), whereas
step_model_calibration() takes population, a
data frame with one row per population unit carrying the predictors.
The hybrid: model term and consistency constraints
The distinctive part of the step is the hybrid: it imposes the consistency constraints on the raw auxiliary totals and the model-calibration constraint on the predictions at once,
Whether this second constraint adds anything depends on the model. If is linear in , the predictions lie in the span of , the second constraint is already implied by the first, and the estimator reduces to the generalized regression (GREG) estimator. If is nonlinear (e.g regression tree or a random forest) the predictions carry information beyond the linear terms, and the two sets of constraints become complementary: the auxiliary totals enforce internal consistency with published figures (benchmarking), while the model term injects the efficiency of the nonlinear fit. The hybrid is therefore meaningful precisely for the flexible learners that motivate it.
Relationship to the literature
Model calibration was introduced by Wu and Sitter (2001), who showed it handles any linear or nonlinear working model and reduces to the Deville-Sarndal calibration / GREG estimator in the linear case. Wu (2003) proved its optimality within a class of calibration estimators, and Montanari and Ranalli (2005) extended it to nonparametric models, close in spirit to the forest used here. Sarndal (2007) reviews the approach within the general calibration framework.
Combining the model term with the ordinary consistency constraints is
supported by the fact that calibration accommodates an arbitrary number
of constraints; the practical motivation (efficiency from the model plus
benchmarking to published totals) is standard. What is uncommon is a
ready-to-use implementation: the major R survey packages calibrate on
auxiliary totals (GREG, raking, post-stratification) but do not fit an
outcome model and calibrate on its predictions, let alone with flexible
engines combined with consistency constraints.
step_model_calibration() packages exactly that. It is an
accessible implementation of an established idea, not a new method.
References.
Wu & Sitter (2001), JASA 96(453), 185–193.
Wu (2003), Biometrika 90(4), 937–951.
Montanari & Ranalli (2005), JASA 100, 1429–1442.
Sarndal (2007), Survey Methodology 33(2), 99–119.
Deville & Sarndal (1992), JASA 87(418), 376–382.