Answer to Puzzle #??: "Bootstrap" in statistics

In the bootstrap approach to statistics, you acquire some data set consisting of, say, N items of data, and you want to compute some function F of the data. Your N-item sample is of course extracted from a much larger universe, and whatever function you compute, will only approximate the true all-universe value. One wants to understand how bad that approximation is. The question is: what is the probability distribution describing your post-data-analysis ignorance about F? One "bootstrapping" approach to answer that is to compute F on a ton on random M-element random subsets of your N-item sample, then declare/hope that this empirical distribution is a good approximation to your ignorance about the unknown real one.
Question: What is the best choice of M?

An Answer

Suppose this is something like a pre-election poll and F is "the percentages of Kennedy, Nixon, and Wallace voters." The number of Kennedy (or whoever) voters in your N-person random sample can be regarded to good approximation as a Poisson random variable with mean=variance=vN for some unknown constant v (which we would like to know) with 0<v<1. The naive approximation of v is simply v≈K/N, where K is the number K of Kennedy voters in our sample. If we take an M-person random subset of our sample, then the mean number of Kennedyists in the subset will be MK/N, and the variance will be (1-K/N)MK/N. Thus the ratio of (standard deviation)/(mean) originally was (vN)-1/2 in the whole N-person sample, but in a random subset will be [(N-K)/(KM)]1/2. These two ratios are equal – which, I suppose, means the bootstrapped distribution is least distorted versus the correct one – if, and only if, M=(N-K)vN/K≈N-K. This suggests that for the purpose of analysing a poll in a C-candidate race (assuming this is genuinely a C-way race, i.e. all C of the candidates are approximately equally popular) it is best to use M≈(1-1/C)N.

Return to puzzles

Return to main page