--- title: "Machine learning, cross-fitting and robust calibration" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Machine learning, cross-fitting and robust calibration} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(weightflow) ``` The *Get started* article builds a weighting recipe with classical tools: weighting classes for nonresponse, raking for calibration, a fixed-fence trim. This article covers the methods that go one step further — flexible learners for the response propensity, cross-fitting to keep them honest, penalized calibration, and data-driven trimming. Each is opt-in: the defaults are the classical methods, and a single argument switches the richer one on. We use the bundled `sample_one` design (a multistage select-one survey) and resolve eligibility, household nonresponse and within-household selection first, so that the examples below start from a clean staged sample. Throughout, $s$ denotes the sample and $U$ the population (universe); $w_i$ is the weight of unit $i$; $X$ are auxiliary totals known for $U$; and $\hat\phi_i$, $\hat m(x_i)$ are estimated response propensities and outcome predictions. ```{r base} base <- weighting_spec(sample_one, base_weights = pw) |> step_unknown_eligibility(unknown = unknown_elig, by = "region") |> step_drop_ineligible(ineligible = ineligible) |> step_nonresponse(respondent = hh_responded, method = "weighting_class", by = "region") |> step_select_within(prob = p_within) ``` ## Flexible propensities Weighting classes assume response is ignorable within cells. A response-propensity model relaxes that: it estimates `P(respond)` from the auxiliaries and adjusts by the inverse propensity (optionally grouped into classes). weightflow can fit that model with logistic regression, a regression tree, a random forest, or gradient boosting, all through the same `engine` argument. ```{r boost, eval = requireNamespace("xgboost", quietly = TRUE)} fit_boost <- base |> step_nonresponse(respondent = responded, method = "propensity", formula = ~ region + sex + age, engine = "boost", num_classes = 5) |> prep() ``` Gradient boosting captures nonlinearities and interactions among the predictors without you specifying them. That flexibility is useful when nonresponse depends on the covariates in complicated ways — but it comes with a risk. ## The overfitting problem A flexible learner that predicts the same units it was trained on fits them *too* well. The estimated propensities are pulled toward the observed outcomes, so some respondents get artificially low fitted propensities. Since the nonresponse adjustment multiplies the weight by the inverse propensity, $$w_i^{\text{nr}} = \frac{w_i^{\text{prev}}}{\hat\phi_i},$$ a propensity $\hat\phi_i$ that is biased toward zero produces an enormous weight. The result is a handful of extreme weights that inflate the variance — exactly what weighting is meant to avoid. The design effect makes this visible. With in-sample boosting, a single propensity class can carry a factor an order of magnitude larger than the rest, and the Kish design effect $\text{deff} = 1 + \text{CV}^2(w)$ climbs accordingly. ## Cross-fitting Cross-fitting breaks the circularity. The sample is split into K folds; the propensity of each unit is predicted by a model trained on the *other* folds, so no unit informs its own prediction. This is the survey-weighting counterpart of the cross-fitting used in double/debiased machine learning (Chernozhukov et al., 2018). In weightflow it is a single argument, `crossfit = K`; when the step has a `cluster`, the folds are formed by cluster, so members of the same household never split across folds and there is no leakage. ```{r crossfit, eval = requireNamespace("xgboost", quietly = TRUE)} fit_cf <- base |> step_nonresponse(respondent = responded, method = "propensity", formula = ~ region + sex + age, engine = "boost", num_classes = 5, crossfit = 5, crossfit_seed = 1) |> prep() ``` On these data the effect is large. Without cross-fitting the boosted propensities overfit and the final Kish design effect rises to roughly 2.4; with five-fold cross-fitting the same model keeps it near 1.5. The out-of-sample propensities are also smoother, so the quantile-based classes come out balanced. The estimates barely change; the *stability* of the weights does. Cross-fitting is available wherever a model is fitted. In `step_model_calibration()` it plays the same role: a working model $\hat m(x)$ is fitted for each outcome $y$, and the weights are calibrated so that the weighted sample total of the prediction matches its population total, $$\sum_{i \in s} w_i\,\hat m(x_i) = \sum_{i \in U} \hat m(x_i).$$ With `crossfit = K`, the sample predictions $\hat m(x_i)$ on the left are made out-of-fold (each unit predicted by a model that did not see it), while the population total on the right uses the full model. The example below fits the outcome model for `income` with gradient boosting and cross-fits it: ```{r modelcal, eval = requireNamespace("xgboost", quietly = TRUE)} fit_mc <- weighting_spec(sample_survey, base_weights = pw) |> step_nonresponse(respondent = responded, method = "weighting_class", by = "region") |> step_model_calibration( x_formula = ~ region + sex, models = list(income = y_model(income ~ age + sex + region, engine = "boost")), population = population, crossfit = 5, crossfit_seed = 1) |> prep() fit_mc$steps[[2]]$diagnostics ``` The calibration still solves its system exactly, so the achieved totals match the targets; cross-fitting only changes how the predictions that enter the system are formed, removing the in-sample optimism of the flexible learner. ## Ridge (penalized) calibration Calibration forces the weighted sample to reproduce known population totals. With many control variables, forcing every constraint *exactly* can again push the weights to extremes. Ridge calibration relaxes the constraints in a controlled way: each target is allowed a small, penalized deviation. In the linear case the calibration system $A\lambda = (X - \hat X)$ gains a penalty on its diagonal, $$\big(A + \operatorname{diag}(s / c_j)\big)\,\lambda = X - \hat X,$$ where $c_j$ is the cost of constraint $j$ and $s$ scales the penalty to the system, making it unit-free. A single, scale-free `penalty` governs the trade-off — large values stay (almost) exact, small values relax more and tighten the weights. ```{r ridge} pop_totals <- c("(Intercept)" = nrow(population), regionSouth = sum(population$region == "South"), regionEast = sum(population$region == "East"), regionWest = sum(population$region == "West"), sexM = sum(population$sex == "M")) fit_ridge <- weighting_spec(sample_survey, base_weights = pw) |> step_nonresponse(respondent = responded, method = "weighting_class", by = "region") |> step_calibrate(method = "linear", formula = ~ region + sex, totals = pop_totals, penalty = 1) |> prep() fit_ridge$steps[[2]]$diagnostics ``` Under ridge the achieved totals no longer match the targets exactly; the `deviation` column reports the (small) gap, which is the price paid for steadier weights. As `penalty` decreases, the deviations grow and the calibration factors concentrate around one. ## Potter (MSE-optimal) trimming Trimming caps extreme weights, but the cap is usually picked by hand. Potter's method chooses it from the data. Writing $\tau$ for the candidate cutoff (weights above $\tau$ are capped), the method evaluates a grid of values of $\tau$ and minimizes an estimate of the mean squared error of the weighted total, $$\text{MSE}(\tau) = \text{bias}(\tau)^2 + \text{var}(\tau),$$ balancing the bias introduced by trimming at $\tau$ (related to the weight removed above the cutoff) against the variance contributed by the weights that remain. The chosen cutoff is the $\tau$ with the smallest estimated MSE. ```{r potter} trimmed_tukey <- base |> step_nonresponse(respondent = responded, method = "weighting_class", by = c("region", "sex")) |> step_trim_weights(method = "tukey") |> prep() trimmed_potter <- base |> step_nonresponse(respondent = responded, method = "weighting_class", by = c("region", "sex")) |> step_trim_weights(method = "potter") |> prep() trimmed_tukey$steps[[6]]$diagnostics[, c("method", "upper", "n_capped")] trimmed_potter$steps[[6]]$diagnostics[, c("method", "upper", "n_capped")] ``` The two rules give different cutoffs: the Tukey far-out fence is a fixed, conservative rule, while Potter searches for the cutoff that minimizes estimated MSE — often more aggressive, capping a few more weights when that lowers the overall error. ## Putting it together These methods compose like any other step, and the recipe-aware bootstrap covers them automatically: because it re-applies the whole recipe on each replicate — re-fitting the propensity model, re-running the cross-fitting, re-solving the calibration — the standard errors reflect the variability these methods introduce, not just the final weights. See the *Variance estimation* article for the bootstrap, and *Get started* for the staged logic these methods plug into. ## References - Chernozhukov, V., et al. (2018). Double/debiased machine learning for treatment and structural parameters. *The Econometrics Journal*, 21(1), C1–C68. - Breidt, F. J., & Opsomer, J. D. (2017). Model-assisted survey estimation with modern prediction techniques. *Statistical Science*, 32(2), 190–205. - Bardsley, P., & Chambers, R. L. (1984). Multipurpose estimation from unbalanced samples. *Applied Statistics*, 33(3), 290–299. - Potter, F. J. (1990). A study of procedures to identify and trim extreme sample weights. *Proc. ASA Survey Research Methods Section*, 225–230.