Title: | Functions for Kernel Smoothing Supporting Wand & Jones (1995) |
---|---|
Description: | Functions for kernel smoothing (and density estimation) corresponding to the book: Wand, M.P. and Jones, M.C. (1995) "Kernel Smoothing". |
Authors: | Matt Wand [aut], Cleve Moler [ctb] (LINPACK routines in src/d*), Brian Ripley [trl, cre, ctb] (R port and updates) |
Maintainer: | Brian Ripley <[email protected]> |
License: | Unlimited |
Version: | 2.23-24 |
Built: | 2024-06-15 17:27:55 UTC |
Source: | CRAN |
Returns x and y coordinates of the binned kernel density estimate of the probability density of the data.
bkde(x, kernel = "normal", canonical = FALSE, bandwidth, gridsize = 401L, range.x, truncate = TRUE)
bkde(x, kernel = "normal", canonical = FALSE, bandwidth, gridsize = 401L, range.x, truncate = TRUE)
x |
numeric vector of observations from the distribution whose density is to be estimated. Missing values are not allowed. |
bandwidth |
the kernel bandwidth smoothing parameter. Larger values of
|
kernel |
character string which determines the smoothing kernel.
|
canonical |
length-one logical vector: if |
gridsize |
the number of equally spaced points at which to estimate the density. |
range.x |
vector containing the minimum and maximum values of |
truncate |
logical flag: if |
This is the binned approximation to the ordinary kernel density estimate.
Linear binning is used to obtain the bin counts.
For each x
value in the sample, the kernel is
centered on that x
and the heights of the kernel at each datapoint are summed.
This sum, after a normalization, is the corresponding y
value in the output.
a list containing the following components:
x |
vector of sorted |
y |
vector of density estimates
at the corresponding |
Density estimation is a smoothing operation. Inevitably there is a trade-off between bias in the estimate and the estimate's variability: large bandwidths will produce smooth estimates that may hide local features of the density; small bandwidths may introduce spurious bumps into the estimate.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.
data(geyser, package="MASS") x <- geyser$duration est <- bkde(x, bandwidth=0.25) plot(est, type="l")
data(geyser, package="MASS") x <- geyser$duration est <- bkde(x, bandwidth=0.25) plot(est, type="l")
Returns the set of grid points in each coordinate direction, and the matrix of density estimates over the mesh induced by the grid points. The kernel is the standard bivariate normal density.
bkde2D(x, bandwidth, gridsize = c(51L, 51L), range.x, truncate = TRUE)
bkde2D(x, bandwidth, gridsize = c(51L, 51L), range.x, truncate = TRUE)
x |
a two-column numeric matrix containing the observations from the distribution whose density is to be estimated. Missing values are not allowed. |
bandwidth |
numeric vector oflength 2, containing the bandwidth to be used in each coordinate direction. |
gridsize |
vector containing the number of equally spaced points in each direction over which the density is to be estimated. |
range.x |
a list containing two vectors, where each vector
contains the minimum and maximum values of |
truncate |
logical flag: if TRUE, data with |
a list containing the following components:
x1 |
vector of values of the grid points in the first coordinate direction at which the estimate was computed. |
x2 |
vector of values of the grid points in the second coordinate direction at which the estimate was computed. |
fhat |
matrix of density estimates
over the mesh induced by |
This is the binned approximation to the 2D kernel density estimate.
Linear binning is used to obtain the bin counts and the
Fast Fourier Transform is used to perform the discrete convolutions.
For each x1
,x2
pair the bivariate Gaussian kernel is
centered on that location and the heights of the
kernel, scaled by the bandwidths, at each datapoint are summed.
This sum, after a normalization, is the corresponding
fhat
value in the output.
Wand, M. P. (1994). Fast Computation of Multivariate Kernel Estimators. Journal of Computational and Graphical Statistics, 3, 433-445.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.
data(geyser, package="MASS") x <- cbind(geyser$duration, geyser$waiting) est <- bkde2D(x, bandwidth=c(0.7, 7)) contour(est$x1, est$x2, est$fhat) persp(est$fhat)
data(geyser, package="MASS") x <- cbind(geyser$duration, geyser$waiting) est <- bkde2D(x, bandwidth=c(0.7, 7)) contour(est$x1, est$x2, est$fhat) persp(est$fhat)
Returns an estimate of a binned approximation to the kernel estimate of the specified density functional. The kernel is the standard normal density.
bkfe(x, drv, bandwidth, gridsize = 401L, range.x, binned = FALSE, truncate = TRUE)
bkfe(x, drv, bandwidth, gridsize = 401L, range.x, binned = FALSE, truncate = TRUE)
x |
numeric vector of observations from the distribution whose density is to be estimated. Missing values are not allowed. |
drv |
order of derivative in the density functional. Must be a non-negative even integer. |
bandwidth |
the kernel bandwidth smoothing parameter. Must be supplied. |
gridsize |
the number of equally-spaced points over which binning is performed. |
range.x |
vector containing the minimum and maximum values of |
binned |
logical flag: if |
truncate |
logical flag: if |
The density functional of order drv
is the integral of the
product of the density and its drv
th derivative.
The kernel estimates
of such quantities are computed using a binned implementation,
and the kernel is the standard normal density.
the (scalar) estimated functional.
Estimates of this type were proposed by Sheather and Jones (1991).
Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53, 683–690.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.
data(geyser, package="MASS") x <- geyser$duration est <- bkfe(x, drv=4, bandwidth=0.3)
data(geyser, package="MASS") x <- geyser$duration est <- bkfe(x, drv=4, bandwidth=0.3)
Uses direct plug-in methodology to select the bin width of a histogram.
dpih(x, scalest = "minim", level = 2L, gridsize = 401L, range.x = range(x), truncate = TRUE)
dpih(x, scalest = "minim", level = 2L, gridsize = 401L, range.x = range(x), truncate = TRUE)
x |
numeric vector containing the sample on which the histogram is to be constructed. |
scalest |
estimate of scale.
|
level |
number of levels of functional estimation used in the plug-in rule. |
gridsize |
number of grid points used in the binned approximations to functional estimates. |
range.x |
range over which functional estimates are obtained. The default is the minimum and maximum data values. |
truncate |
if |
The direct plug-in approach, where unknown functionals that appear in expressions for the asymptotically optimal bin width and bandwidths are replaced by kernel estimates, is used. The normal distribution is used to provide an initial estimate.
the selected bin width.
This method for selecting the bin width of a histogram is described in Wand (1995). It is an extension of the normal scale rule of Scott (1979) and uses plug-in ideas from bandwidth selection for kernel density estimation (e.g. Sheather and Jones, 1991).
Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66, 605–610.
Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53, 683–690.
Wand, M. P. (1995). Data-based choice of histogram binwidth. The American Statistician, 51, 59–64.
data(geyser, package="MASS") x <- geyser$duration h <- dpih(x) bins <- seq(min(x)-h, max(x)+h, by=h) hist(x, breaks=bins)
data(geyser, package="MASS") x <- geyser$duration h <- dpih(x) bins <- seq(min(x)-h, max(x)+h, by=h) hist(x, breaks=bins)
Use direct plug-in methodology to select the bandwidth of a kernel density estimate.
dpik(x, scalest = "minim", level = 2L, kernel = "normal", canonical = FALSE, gridsize = 401L, range.x = range(x), truncate = TRUE)
dpik(x, scalest = "minim", level = 2L, kernel = "normal", canonical = FALSE, gridsize = 401L, range.x = range(x), truncate = TRUE)
x |
numeric vector containing the sample on which the kernel density estimate is to be constructed. |
scalest |
estimate of scale.
|
level |
number of levels of functional estimation used in the plug-in rule. |
kernel |
character string which determines the smoothing kernel.
|
canonical |
logical flag: if |
gridsize |
the number of equally-spaced points over which binning is performed to obtain kernel functional approximation. |
range.x |
vector containing the minimum and maximum values of |
truncate |
logical flag: if |
The direct plug-in approach, where unknown functionals that appear in expressions for the asymptotically optimal bandwidths are replaced by kernel estimates, is used. The normal distribution is used to provide an initial estimate.
the selected bandwidth.
This method for selecting the bandwidth of a kernel density estimate was proposed by Sheather and Jones (1991) and is described in Section 3.6 of Wand and Jones (1995).
Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53, 683–690.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.
data(geyser, package="MASS") x <- geyser$duration h <- dpik(x) est <- bkde(x, bandwidth=h) plot(est,type="l")
data(geyser, package="MASS") x <- geyser$duration h <- dpik(x) est <- bkde(x, bandwidth=h) plot(est,type="l")
Use direct plug-in methodology to select the bandwidth of a local linear Gaussian kernel regression estimate, as described by Ruppert, Sheather and Wand (1995).
dpill(x, y, blockmax = 5, divisor = 20, trim = 0.01, proptrun = 0.05, gridsize = 401L, range.x, truncate = TRUE)
dpill(x, y, blockmax = 5, divisor = 20, trim = 0.01, proptrun = 0.05, gridsize = 401L, range.x, truncate = TRUE)
x |
numeric vector of x data. Missing values are not accepted. |
y |
numeric vector of y data.
This must be same length as |
blockmax |
the maximum number of blocks of the data for construction of an initial parametric estimate. |
divisor |
the value that the sample size is divided by to determine a lower limit on the number of blocks of the data for construction of an initial parametric estimate. |
trim |
the proportion of the sample trimmed from each end in the
|
proptrun |
the proportion of the range of |
gridsize |
number of equally-spaced grid points over which the function is to be estimated. |
range.x |
vector containing the minimum and maximum values of |
truncate |
logical flag: if |
The direct plug-in approach, where unknown functionals
that appear in expressions for the asymptotically
optimal bandwidths
are replaced by kernel estimates, is used.
The kernel is the standard normal density.
Least squares quartic fits over blocks of data are used to
obtain an initial estimate. Mallow's is used to select
the number of blocks.
the selected bandwidth.
If there are severe irregularities (i.e. outliers, sparse regions)
in the x
values then the local polynomial smooths required for the
bandwidth selection algorithm may become degenerate and the function
will crash. Outliers in the y
direction may lead to deterioration
of the quality of the selected bandwidth.
Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association, 90, 1257–1270.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.
data(geyser, package = "MASS") x <- geyser$duration y <- geyser$waiting plot(x, y) h <- dpill(x, y) fit <- locpoly(x, y, bandwidth = h) lines(fit)
data(geyser, package = "MASS") x <- geyser$duration y <- geyser$waiting plot(x, y) h <- dpill(x, y) fit <- locpoly(x, y, bandwidth = h) lines(fit)
Estimates a probability density function, regression function or their derivatives using local polynomials. A fast binned implementation over an equally-spaced grid is used.
locpoly(x, y, drv = 0L, degree, kernel = "normal", bandwidth, gridsize = 401L, bwdisc = 25, range.x, binned = FALSE, truncate = TRUE)
locpoly(x, y, drv = 0L, degree, kernel = "normal", bandwidth, gridsize = 401L, bwdisc = 25, range.x, binned = FALSE, truncate = TRUE)
x |
numeric vector of x data. Missing values are not accepted. |
bandwidth |
the kernel bandwidth smoothing parameter.
It may be a single number or an array having
length |
y |
vector of y data.
This must be same length as |
drv |
order of derivative to be estimated. |
degree |
degree of local polynomial used. Its value
must be greater than or equal to the value
of |
kernel |
|
gridsize |
number of equally-spaced grid points over which the function is to be estimated. |
bwdisc |
number of logarithmically-equally-spaced bandwidths
on which |
range.x |
vector containing the minimum and maximum values of |
binned |
logical flag: if |
truncate |
logical flag: if |
if y
is specified, a local polynomial regression estimate of
E[Y|X] (or its derivative) is computed.
If y
is missing, a local polynomial estimate of the density
of x
(or its derivative) is computed.
a list containing the following components:
x |
vector of sorted x values at which the estimate was computed. |
y |
vector of smoothed estimates for either the density or the regression
at the corresponding |
Local polynomial fitting with a kernel weight is used to
estimate either a density, regression function or their
derivatives. In the case of density estimation, the
data are binned and the local fitting procedure is applied to
the bin counts. In either case, binned approximations over
an equally-spaced grid is used for fast computation. The
bandwidth may be either scalar or a vector of length
gridsize
.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.
bkde
, density
, dpill
,
ksmooth
, loess
, smooth
,
supsmu
.
data(geyser, package = "MASS") # local linear density estimate x <- geyser$duration est <- locpoly(x, bandwidth = 0.25) plot(est, type = "l") # local linear regression estimate y <- geyser$waiting plot(x, y) fit <- locpoly(x, y, bandwidth = 0.25) lines(fit)
data(geyser, package = "MASS") # local linear density estimate x <- geyser$duration est <- locpoly(x, bandwidth = 0.25) plot(est, type = "l") # local linear regression estimate y <- geyser$waiting plot(x, y) fit <- locpoly(x, y, bandwidth = 0.25) lines(fit)