The need for diagnostic assessment of bootstrap predictive models
Glen Barnett and Ben Zehnwirth
Contents:
- The need for diagnostic assessment of bootstrap predictive models
- A basic bootstrap introduction
- Diagnostic displays for a bootstrapped chain ladder
- Assessing bootstrap predictive distributions
- Some other considerations
- Conclusions
- References
- Appendices
The following set of pages are available as a PDF document here.
Assessing bootstrap predictive distributions
When calculating predictive distributions with the bootstrap, we can in similar fashion make plots of standardized prediction errors against predicted values and against calendar years. Of course, since the prediction errors are the same, the only change would be a difference in the amount by which each prediction error is scaled (since we have bootstrap standard errors in place of asymptotic standard errors from an assumed model); the broad pattern will not change, however, so the plot based on asymptotic results are useful prior to performing the bootstrap.
Since we can produce the entire predictive distribution via the bootstrap, we can evaluate the percentiles of the omitted observations from their bootstrapped predictive distributions - if the model is suitable, the data should be reasonably close to "random" percentiles from the predictive distribution. This further information will be of particular interest for the most recent calendar periods (since the ability of the model to predict recent periods gives our best available indication if there is any hope for it in the immediate future - if your model cannot predict last year you cannot have a great deal of confidence in its ability to predict next year).
We could look at a visual diagnostic, such as the set of predictive distributions with the position of each value marked on it, though it may be desirable to look at all of them together on a single plot, if the scale can be rendered so that enough detail can be gleaned from each individual component. It may be necessary to "summarize" the distribution somewhat in order to see where the values lie (for example, indicating 10th, 25th, 50th, 75th and 90th percentiles, rather than showing the entire bootstrap density). In order to more readily compare values it may help to standardize by subtracting the mean and dividing by the standard deviation, though in many cases, if the means don't vary over too many standard deviations, simply looking at the original predicted values on (whether on the original scale or on a log scale) may be sufficient - sometimes a little judgement is required as to which plot will be most informative.
In some circumstances it might also be useful to obtain a single summary of the indicated lack of 'predictive fit'. If the data are "random" percentiles from their predictive distributions, the proportion of the predictive distribution below which each observation falls should be uniform. Of primary interest would be (i) substantial bias in the predictive distribution (both up and down bias are problematic for the insurer), and (ii) substantial error in the variability of the predictive distribution. The first will tend to yield percentiles that are too high or too low, which the second will either yield percentiles that are both too high and too low (if the predictive variance is underestimated), or percentiles that are clustered toward the centre (if it is overestimated). If pi is the proportion of the predictive distribution below the observed value, then qi = 2| pi - 1/2 | should also be uniform, but if there are too many extreme percentiles (either from upward or downward bias, or from underestimating of the predictive variances), the qi values will tend to be too large, while if there are too few extreme percentiles the qi values will tend to be too small. There are various ways of combining the qi into a single diagnostic measure. Once such would be to note that if the q's are uniform, then 2.exp(-{1-qi}) has a chi-square distribution with 2d.f. (with large values again indicating an excess of extreme percentiles). Several of these (k, say) may be added if a single statistic is desired and compared with a Χ22k distribution. Unusually large values would be of particular interest, though unusually small values would also be important. If required this could be used as a formal hypothesis test, but it is generally of greater value as a diagnostic summary of the overall tendency to extreme percentiles.
Example 2
ABC data. This is Worker's Compensation data for a large company. This data was analyzed in some detail in Barnett and Zehnwirth (2000). See appendix B.
In this example we actually use the bootstrap predictions discussed in the basic bootstrap introduction above, based on the second algorithm from Pinheiro et al (2003). Below are the predictive distributions for the first two values (after DY0) for the last diagonal, for a ODP GLM fitted to the data prior to the final calendar year, which was omitted. The brown arrows mark the actual observation that the predictive distribution is attempting to predict.
ABC Predictive distribution for last diagonal - histograms

The arrow marks the actual observation for that delay; for the two distributions shown, the observed value sits fairly high. For a single observation, this might happens by chance, even with an appropriate model, of course.
The runoff decreases sharply for this triangle, so most of this information in the histograms would be lost if we looked at them on a single plot. Consequently, for a more detailed examination, the bootstrap results are reduced to a five-number summary of the percentiles:
ABC Predictive distribution for last diagonal - box and whisker plots

As can be seen, the actual payments for the first seven development periods are all very high, but it's a little hard to see the details in the last few periods. Let's look at them on the log-scale:

Now we can see that in all cases the observations sit above the median of the predictive distribution, and all but the last two are above the upper quartile.
Below is a summary table of the bootstrap distribution for the final calendar year:
ABC: Bootstrap Predictive distributions for last calendar year

So what's going on? Why is this predicting so badly?
Well, we can see via one-step ahead prediction errors that there's a problem with the assumption of no calendar period trend; alternatively, as we noted earlier, we can look at residuals from a Mack-style model, and get a similar impression:

We can see a strong trend-change. Consequently, predictions of the last calendar year will be too low. One major difficulty with the common use of the chain ladder in the absence of careful consideration of the remaining calendar period trend is that there is no opportunity to apply proper judgement of the future trends in this direction, because the practitioner lacks information about the past behaviour for a context in which to even seek information that would inform scenarios relating to the future behaviour.
Example 3 - LR high
The data for this example is available in Appendix C. As we have seen, we can look at diagnostics and assess before we attempt to produce bootstrap prediction intervals whether we should proceed.
Here are the standardized residuals vs calendar years from a Mack-style Chain ladder fit. As you can see, there's a lot of structure.

There's similar structure in the overdispersed Poisson GLM formulation of the chain ladder - residuals show there are strong trend changes in the calendar year direction:

However, as we described before, this residual plot gives the incorrect impression that the GLM is underpredicting. This impression is incorrect, as we see by looking at the validation (one step ahead predictions) for the last year:

It's a little hard to see detail over on the right, so let's look at the same plot on the log scale:

The Mack-model residual plot gave a good indication of the predictive performance of the chain ladder (bootstrapped or not) for both the Mack model and the quasi- (overdispersed) Poisson GLM. It's always a good idea to validate the last calendar year (look at one-step-ahead prediction errors), but a quick approximation of the performance is usually given by examining residuals from a Mack-chain ladder model.
A further problem with the GLM is revealed by the plot of residuals vs development year:

The assumed variance function does not reflect the data.
Example 4
The next example has been widely used in the literature relating to the chain ladder. Indeed, Pinherio et al. (2003) referred to it as a "benchmark for claims reserving models". The data come from Taylor and Ashe (1983). See Appendix D.
Here are the bootstrap predictive means and s.d.s for the last diagonal (i.e. with that data not used in the estimation) for a quasi-Poisson GLM, and the actual payments for comparison:

Firstly, there is an apparent bias in the bootstrap means. The chain ladder predictions sit below the bootstrap means, indicating a bias. Since, the ML for a Poisson is unbiased, if the model is correct, these predictions should be unbiased. This doesn't necessarily indicate a bad predictive model, but is there anything going on?
In fact there is, and we can see problem can be seen in residual plots.
Here is a plot of the residuals versus calendar year from a Mack type fit (this is done first because it's the easiest to obtain - it takes only a couple of clicks in ICRFS-ELRF).

Strong calendar period effects in the last few years. The existence of a calendar period effect was already noted by Taylor and Ashe in 1983 (who included the late calendar year effect in some of their models), but it has been ignored by almost every author to consider this data since. If the trend were to continue for next year, the forecasts may be quite wrong. If we didn't examine the residuals, we may not even be aware this problem is present.
Exactly the same effect appears when fitting a quasi-Poisson two-way cross classification with log-link:

There's little advantage in not examining Mack residuals before fitting a quasi-Poisson GLM - the residuals are easier to produce, and the information in the plot of residuals vs fitted has more information about the predictive ability of the model.
Continue with: Some other considerations.


