Bootstrap Resampling
Description
Generate R
bootstrap replicates of a statistic applied to data. Both
parametric and nonparametric resampling are possible. For the nonparametric
bootstrap, possible resampling methods are the ordinary bootstrap, the
balanced bootstrap, antithetic resampling, and permutation.
For nonparametric multisample problems stratified resampling is used:
this is specified by including a vector of strata in the call to boot.
Importance resampling weights may be specified.
Usage
boot(data, statistic, R, sim = "ordinary", stype = c("i", "f", "w"),
strata = rep(1,n), L = NULL, m = 0, weights = NULL,
ran.gen = function(d, p) d, mle = NULL, simple = FALSE, ...,
parallel = c("no", "multicore", "snow"),
ncpus = getOption("boot.ncpus", 1L), cl = NULL)
Arguments
data 
The data as a vector, matrix or data frame. If it is a matrix or
data frame then each row is considered as one multivariate
observation.

statistic 
A function which when applied to data returns a vector containing
the statistic(s) of interest. When sim = "parametric" , the
first argument to statistic must be the data. For each
replicate a simulated dataset returned by ran.gen will be
passed. In all other cases statistic must take at least two
arguments. The first argument passed will always be the original
data. The second will be a vector of indices, frequencies or weights
which define the bootstrap sample. Further, if predictions are
required, then a third argument is required which would be a vector
of the random indices used to generate the bootstrap predictions.
Any further arguments can be passed to statistic through the
... argument.

R 
The number of bootstrap replicates. Usually this will be a single
positive integer. For importance resampling, some resamples may use
one set of weights and others use a different set of weights. In
this case R would be a vector of integers where each
component gives the number of resamples from each of the rows of
weights.

sim 
A character string indicating the type of simulation required.
Possible values are "ordinary" (the default),
"parametric" , "balanced" , "permutation" , or
"antithetic" . Importance resampling is specified by
including importance weights; the type of importance resampling must
still be specified but may only be "ordinary" or
"balanced" in this case.

stype 
A character string indicating what the second argument of statistic
represents. Possible values of stype are "i" (indices  the
default), "f" (frequencies), or "w" (weights). Not
used for sim = "parametric" .

strata 
An integer vector or factor specifying the strata for multisample
problems. This may be specified for any simulation, but is ignored
when sim = "parametric" . When strata is
supplied for a nonparametric bootstrap, the simulations are done
within the specified strata.

L 
Vector of influence values evaluated at the observations. This is
used only when sim is "antithetic" . If not supplied,
they are calculated through a call to empinf . This will use
the infinitesimal jackknife provided that stype is
"w" , otherwise the usual jackknife is used.

m 
The number of predictions which are to be made at each bootstrap
replicate. This is most useful for (generalized) linear models.
This can only be used when sim is "ordinary" .
m will usually be a single integer but, if there are strata,
it may be a vector with length equal to the number of strata,
specifying how many of the errors for prediction should come from
each strata. The actual predictions should be returned as the final
part of the output of statistic , which should also take an
argument giving the vector of indices of the errors to be used for
the predictions.

weights 
Vector or matrix of importance weights. If a vector then it should
have as many elements as there are observations in data .
When simulation from more than one set of weights is required,
weights should be a matrix where each row of the matrix is
one set of importance weights. If weights is a matrix then
R must be a vector of length nrow(weights) . This
parameter is ignored if sim is not "ordinary" or
"balanced" .

ran.gen 
This function is used only when sim = "parametric"
when it describes how random values are to be generated. It should
be a function of two arguments. The first argument should be the
observed data and the second argument consists of any other
information needed (e.g. parameter estimates). The second argument
may be a list, allowing any number of items to be passed to
ran.gen . The returned value should be a simulated data set
of the same form as the observed data which will be passed to
statistic to get a bootstrap replicate. It is important that the
returned value be of the same shape and type as the original
dataset. If ran.gen is not specified, the default is a
function which returns the original data in which case all
simulation should be included as part of statistic . Use of
sim = "parametric" with a suitable ran.gen allows the
user to implement any types of nonparametric resampling which are
not supported directly.

mle 
The second argument to be passed to ran.gen . Typically these
will be maximum likelihood estimates of the parameters. For
efficiency mle is often a list containing all of the objects
needed by ran.gen which can be calculated using the original
data set only.

simple 
logical, only allowed to be TRUE for
sim = "ordinary", stype = "i", n = 0 (otherwise ignored with a
warning). By default a n by R index array is created:
this can be large and if simple = TRUE this is avoided by
sampling separately for each replication, which is slower but uses
less memory.

... 
Other named arguments for statistic which are passed
unchanged each time it is called. Any such arguments to
statistic should follow the arguments which statistic is
required to have for the simulation. Beware of partial matching to
arguments of boot listed above, and that arguments named
X and FUN cause conflicts in some versions of
boot (but not this one).

parallel 
The type of parallel operation to be used (if any). If missing, the
default is taken from the option "boot.parallel" (and if that
is not set, "no" ).

ncpus 
integer: number of processes to be used in parallel operation:
typically one would chose this to the number of available CPUs.

cl 
An optional parallel or snow cluster for use if
parallel = "snow" . If not supplied, a cluster on the
local machine is created for the duration of the boot call.

Details
The statistic to be bootstrapped can be as simple or complicated as
desired as long as its arguments correspond to the dataset and (for
a nonparametric bootstrap) a vector of indices, frequencies or
weights. statistic
is treated as a black box by the
boot
function and is not checked to ensure that these
conditions are met.
The first order balanced bootstrap is described in Davison, Hinkley
and Schechtman (1986). The antithetic bootstrap is described by
Hall (1989) and is experimental, particularly when used with strata.
The other nonparametric simulation types are the ordinary bootstrap
(possibly with unequal probabilities), and permutation which returns
random permutations of cases. All of these methods work
independently within strata if that argument is supplied.
For the parametric bootstrap it is necessary for the user to specify
how the resampling is to be conducted. The best way of
accomplishing this is to specify the function ran.gen
which
will return a simulated data set from the observed data set and a
set of parameter estimates specified in mle
.
Value
The returned value is an object of class "boot"
, containing the
following components:
t0 
The observed value of statistic applied to data .

t 
A matrix with sum(R) rows each of which is a bootstrap replicate
of the result of calling statistic .

R 
The value of R as passed to boot .

data 
The data as passed to boot .

seed 
The value of .Random.seed when boot started work.

statistic 
The function statistic as passed to boot .

sim 
Simulation type used.

stype 
Statistic type as passed to boot .

call 
The original call to boot .

strata 
The strata used. This is the vector passed to boot , if it
was supplied or a vector of ones if there were no strata. It is not
returned if sim is "parametric" .

weights 
The importance sampling weights as passed to boot or the empirical
distribution function weights if no importance sampling weights were
specified. It is omitted if sim is not one of
"ordinary" or "balanced" .

pred.i 
If predictions are required (m > 0 ) this is the matrix of
indices at which predictions were calculated as they were passed to
statistic. Omitted if m is 0 or sim is not
"ordinary" .

L 
The influence values used when sim is "antithetic" .
If no such values were specified and stype is not "w"
then L is returned as consecutive integers corresponding to
the assumption that data is ordered by influence values. This
component is omitted when sim is not "antithetic" .

ran.gen 
The random generator function used if sim is
"parametric" . This component is omitted for any other value
of sim .

mle 
The parameter estimates passed to boot when sim is
"parametric" . It is omitted for all other values of
sim .

There are c
, plot
and print
methods for this class.
Parallel operation
When parallel = "multicore"
is used (not available on Windows),
each worker process inherits the environment of the current session,
including the workspace and the loaded namespaces and attached
packages (but not the random number seed: see below).
More work is needed when parallel = "snow"
is used: the worker
processes are newly created R processes, and statistic
needs
to arrange to set up the environment it needs: often a good way to do
that is to make use of lexical scoping since when statistic
is
sent to the worker processes its enclosing environment is also sent.
(E.g. see the example for jack.after.boot
where
ancillary functions are nested inside the statistic
function.)
parallel = "snow"
is primarily intended to be used on
multicore Windows machine where parallel = "multicore"
is not
available.
For most of the boot
methods the resampling is done in the
master process, but not if simple = TRUE
nor sim =
"parametric"
. In those cases (or where statistic
itself uses
random numbers), more care is needed if the results need to be
reproducible. Resampling is done in the worker processes by
censboot(sim = "wierd")
and by most of the schemes in
tsboot
(the exceptions being sim == "fixed"
and
sim == "geom"
with the default ran.gen
).
Where randomnumber generation is done in the worker processes, the
default behaviour is that each worker chooses a separate seed,
nonreproducibly. However, with parallel = "multicore"
or
parallel = "snow"
using the default cluster, a second approach
is used if RNGkind("L'EcuyerCMRG")
has been selected.
In that approach each worker gets a different subsequence of the RNG
stream based on the seed at the time the worker is spawned and so the
results will be reproducible if ncpus
is unchanged, and for
parallel = "multicore"
if parallel::mc.reset.stream()
is
called: see the examples for mclapply
.
Note that loading the parallel namespace may change the random
seed, so for maximum reproducibility this should be done before
calling this function.
References
There are many references explaining the bootstrap and its variations.
Among them are :
Booth, J.G., Hall, P. and Wood, A.T.A. (1993) Balanced importance resampling
for the bootstrap. Annals of Statistics, 21, 286–298.
Davison, A.C. and Hinkley, D.V. (1997)
Bootstrap Methods and Their Application. Cambridge University Press.
Davison, A.C., Hinkley, D.V. and Schechtman, E. (1986) Efficient bootstrap
simulation. Biometrika, 73, 555–566.
Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap.
Chapman & Hall.
Gleason, J.R. (1988) Algorithms for balanced bootstrap simulations.
American Statistician, 42, 263–266.
Hall, P. (1989) Antithetic resampling for the bootstrap. Biometrika,
73, 713–724.
Hinkley, D.V. (1988) Bootstrap methods (with Discussion).
Journal of the Royal Statistical Society, B, 50,
312–337, 355–370.
Hinkley, D.V. and Shi, S. (1989) Importance sampling and the nested bootstrap.
Biometrika, 76, 435–446.
Johns M.V. (1988) Importance sampling for bootstrap confidence intervals.
Journal of the American Statistical Association, 83, 709–714.
Noreen, E.W. (1989) Computer Intensive Methods for Testing Hypotheses.
John Wiley & Sons.
See Also
boot.array
, boot.ci
,
censboot
, empinf
,
jack.after.boot
, tilt.boot
,
tsboot
.
Examples
ratio < function(d, w) sum(d$x * w)/sum(d$u * w)
boot(city, ratio, R = 999, stype = "w")
diff.means < function(d, f)
{ n < nrow(d)
gp1 < 1:table(as.numeric(d$series))[1]
m1 < sum(d[gp1,1] * f[gp1])/sum(f[gp1])
m2 < sum(d[gp1,1] * f[gp1])/sum(f[gp1])
ss1 < sum(d[gp1,1]^2 * f[gp1])  (m1 * m1 * sum(f[gp1]))
ss2 < sum(d[gp1,1]^2 * f[gp1])  (m2 * m2 * sum(f[gp1]))
c(m1  m2, (ss1 + ss2)/(sum(f)  2))
}
grav1 < gravity[as.numeric(gravity[,2]) >= 7,]
boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[,2])
nuke < nuclear[, c(1, 2, 5, 7, 8, 10, 11)]
nuke.lm < glm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt, data = nuke)
nuke.diag < glm.diag(nuke.lm)
nuke.res < nuke.diag$res * nuke.diag$sd
nuke.res < nuke.res  mean(nuke.res)
nuke.data < data.frame(nuke, resid = nuke.res, fit = fitted(nuke.lm))
new.data < data.frame(cost = 1, date = 73.00, cap = 886, ne = 0,
ct = 0, cum.n = 11, pt = 1)
new.fit < predict(nuke.lm, new.data)
nuke.fun < function(dat, inds, i.pred, fit.pred, x.pred)
{
lm.b < glm(fit+resid[inds] ~ date+log(cap)+ne+ct+log(cum.n)+pt,
data = dat)
pred.b < predict(lm.b, x.pred)
c(coef(lm.b), pred.b  (fit.pred + dat$resid[i.pred]))
}
nuke.boot < boot(nuke.data, nuke.fun, R = 999, m = 1,
fit.pred = new.fit, x.pred = new.data)
mean(nuke.boot$t[, 8]^2)
new.fit  sort(nuke.boot$t[, 8])[c(975, 25)]
air.fun < function(data) {
ybar < mean(data$hours)
para < c(log(ybar), mean(log(data$hours)))
ll < function(k) {
if (k <= 0) 1e200 else lgamma(k)k*(log(k)1para[1]+para[2])
}
khat < nlm(ll, ybar^2/var(data$hours))$estimate
c(ybar, khat)
}
air.rg < function(data, mle) {
out < data
out$hours < rexp(nrow(out), 1/mle)
out
}
air.boot < boot(aircondit, air.fun, R = 999, sim = "parametric",
ran.gen = air.rg, mle = mean(aircondit$hours))
sum(abs(air.boot$t[,2]1) > abs(air.boot$t0[2]1))/(1+air.boot$R)