The need for diagnostic assessment of bootstrap predictive models
Glen Barnett and Ben Zehnwirth
Contents:
- The need for diagnostic assessment of bootstrap predictive models
- A basic bootstrap introduction
- Diagnostic displays for a bootstrapped chain ladder
- Assessing bootstrap predictive distributions
- Some other considerations
- Conclusions
- References
- Appendices
The following set of pages are available as a PDF document here.
A basic bootstrap introduction
The bootstrap was devised by Efron (1979), growing out of earlier work on the jackknife. He further developed it in a book (Efron, 1982), and various other papers. These days there are numerous books relating to the bootstrap, such as Efron and Tibshirani (1994). A good introduction to the basic bootstrap may be found in Moore et al. (2003); it can be obtained online.
The original form of the bootstrap is where the data itself is resampled, in order to get an approximation to the sampling distribution of some statistic of interest, in order to make inference about a corresponding population statistic.
For example, in the context of a simple model E(Xi) = μ, i = 1, 2, ... , n, where the X's assumed to be independent, the population statistic of interest is the mean, μ, and the sampling statistic of interest would typically be the sample mean, x_bar .
Consequently, we estimate the population mean by the sample mean (μ_hat = x_bar ) - but how good is that estimate? If we were to collect many samples, how far would the sample means typically be from the population mean?
While that question could be answered if we could directly take many samples from the population, typically we cannot resample the original population again. If we assume a distribution, we could infer the behaviour of the sample mean from the assumed distribution, and then check that the sample could reasonably have come from the assumed distribution.
(Note that rather than needing to assume an entire distribution, if the population variance were assumed known, we could compute the variance of the sample mean, and given a large enough sample, we might consider applying the central limit theorem (CLT) in order to produce an approximate interval for the population parameter, without further assumptions about the distributional form. There are many issues that arise. One such issue is whether or not the sample is large enough - the number of observations per parameter in reserving is often quite small. Indeed, many common techniques have some parameters whose estimates are based on only a single observation! Another issue is that to be able to apply the CLT we assumed a variance - if instead we estimate the variance, then the inference about the mean depends on the distribution again. As the sample sizes become large enough that we may apply Slutzky's theorem, then for example a t-statistic is asymptotically normal, even though in small samples the t-statistic only has a t-distribution if the data were normal. Lastly, and perhaps most importantly when we want a predictive distribution, the CLT usually cannot help.)
In the case of bootstrapping, the sample is itself resampled, and then from that, inferences about the behaviour of samples from the population are made on the basis of those resamples. The empirical distribution of the original sample is taken as the best estimate of the population distribution.
In the simple example above, we repeatedly draw samples of size n (with replacement) from the original sample, and compute the distribution of the statistic (the sample mean) of each resample. Not all of the original sample will be present in the resample - on average a little under 2/3 of the original observations will appear, and the rest will be repetitions of values already in the sample. A few observations may appear more than twice.
The standard error, the bias and even the distribution of an estimator about the population value can be approximated using these resamples, by replacing the population distribution, F by the empirical distribution Fn.
For more complex models, this direct resampling approach may not be suitable. For example, in a regression model, there is a difficulty with resampling the responses directly, since they will typically have different means.
For regression models, one approach is to keep all the predictors with each observation, and sample them together. That is, if X is a matrix of predictors (sometimes called a design matrix) and y is a data-vector, for the multiple regression model Y = XΒ + ε , then the rows of the augmented design matrix [X|y] are resampled. (This is particularly useful when the X's are thought of as random.)
A similar approach can be used when computing multivariate statistics, such as correlations.
Another approach is to resample the residuals from the model. The residuals are estimates of the error term, and in many models the errors (or in some cases, scaled errors) at least share a common mean and variance. The bootstrap in this case assumes more than that - they should have a common distribution (in some applications this assumption is violated).
In this case (with the assumption of equal variance), after fitting the model and estimating the parameters, the residuals from the model are computed: ei = yi - y_hati , and then the residuals are resampled as if they were the data.
Then a new sample is generated from the resampled residuals by adding them to the fitted values, and the model is fitted to the new bootstrap sample. The procedure is repeated many times.
Forms of this residual resampling bootstrap have been used almost exclusively in reserving.
If the model is correct, appropriately implemented residual resampling works. If it is incorrect, the resampling scheme will be affected by it, some more than others, though in general the size of the difference in predicted variance is small. More sophisticated versions of this kind of resampling scheme, such as the second bootstrap procedure in Pinheiro et al. (2003) can reduce the impact of model misspecification when the prediction is, as is common for regression models, within the range of the data. However, the underlying problem of amplification of unfitted calendar year effects remains, as we shall see.
For the examples in this paper we use a slightly augmented version of Sampler 2 given in Pinheiro et al - the prediction errors are added to the predictions to yield bootstrap-simulated predictive values, so that we can directly find the proportion of the bootstrap predictive distribution below the actual values in one-step-ahead predictions.
In the case of reserving, the special structure of the problem means that while often we predict inside the range of observed accident years, and usually also within the range of observed development years, we are always projecting outside the range of observed calendar years - precisely the direction in which the models corresponding to most standard techniques are inadequate.
As a number of authors have noted, the chain ladder models the data using a two-way cross-classification scheme (that is, like a two-way main-effects ANOVA model in a log-link). As discussed in Barnett et al. (2005), this is an unsuitable approach in the accident and development direction, but the issues in the calendar direction are even more problematic. Even the more sophisticated approaches to residual resampling can fail on the reserving problem if the model is unsuitable.
Continue with: Diagnostic displays for a bootstrapped chain ladder.


