Title: | One-Sided Cross-Validation |
---|---|
Description: | Functions for implementing different versions of the OSCV method in the kernel regression and density estimation frameworks. The package mainly supports the following articles: (1) Savchuk, O.Y., Hart, J.D. (2017). Fully robust one-sided cross-validation for regression functions. Computational Statistics, <doi:10.1007/s00180-017-0713-7> and (2) Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth density functions, <arXiv:1703.05157>. |
Authors: | Olga Savchuk |
Maintainer: | Olga Savchuk <[email protected]> |
License: | GPL-2 |
Version: | 1.0 |
Built: | 2025-02-08 03:52:20 UTC |
Source: | https://github.com/cran/OSCV |
Computing , the value of the ASE function for the local linear estimator in the regression context, for the given vector of
values.
ASE_reg(h, desx, y, rx)
ASE_reg(h, desx, y, rx)
h |
numerical vector of bandwidth values, |
desx |
numerical vecror of design points, |
y |
numerical vecror of data points corresponding to the design points |
rx |
numerical vecror of values of the regression function at |
The average squared error (ASE) is used as a measure of performace of the local linear estimator based on the Gaussian kernel.
The vector of values of for the correponsing vector of
values.
Hart, J.D. and Yi, S. (1998) One-sided cross-validation. Journal of the American Statistical Association, 93(442), 620-631.
loclin
, h_ASE_reg
, CV_reg
, OSCV_reg
.
## Not run: # Example (ASE function for a random sample of size n=100 generated from the function reg3 that # has six cusps. The function originates from the article of Savchuk et al. (2013). # The level of the added Gaussian noise is sigma=1/1000). n=100 dx=(1:n-0.5)/n regx=reg3(dx) ydat=regx+rnorm(n,sd=1/1000) harray=seq(0.003,0.05,len=300) ASEarray=ASE_reg(harray,dx,ydat,regx) hmin=round(h_ASE_reg(dx,ydat,regx),digits=4) dev.new() plot(harray,ASEarray,'l',lwd=3,xlab="h",ylab="ASE",main="ASE function for a random sample from r3",cex.lab=1.7,cex.axis=1.7,cex.main=1.5) legend(0.029,0.0000008,legend=c("n=100","sigma=1/1000"),cex=1.7,bty="n") legend(0.005,0.000002,legend=paste("h_ASE=",hmin),cex=2,bty="n") ## End(Not run)
## Not run: # Example (ASE function for a random sample of size n=100 generated from the function reg3 that # has six cusps. The function originates from the article of Savchuk et al. (2013). # The level of the added Gaussian noise is sigma=1/1000). n=100 dx=(1:n-0.5)/n regx=reg3(dx) ydat=regx+rnorm(n,sd=1/1000) harray=seq(0.003,0.05,len=300) ASEarray=ASE_reg(harray,dx,ydat,regx) hmin=round(h_ASE_reg(dx,ydat,regx),digits=4) dev.new() plot(harray,ASEarray,'l',lwd=3,xlab="h",ylab="ASE",main="ASE function for a random sample from r3",cex.lab=1.7,cex.axis=1.7,cex.main=1.5) legend(0.029,0.0000008,legend=c("n=100","sigma=1/1000"),cex=1.7,bty="n") legend(0.005,0.000002,legend=paste("h_ASE=",hmin),cex=2,bty="n") ## End(Not run)
Computing the OSCV smooth rescaling constant that corresponds to using the two-sided kernel H_I
for the cross-validation purposes and the Gaussian kernel in the estimation stage. The constant is applicable for the OSCV versions in the regression and kernel density estimation contexts.
C_smooth(alpha, sigma)
C_smooth(alpha, sigma)
alpha |
first parameter of the two-sided cross-validation kernel |
sigma |
second parameter of the two-sided cross-validation kernel |
Computation of the OSCV rescaling constant (see (10) in Savchuk and Hart (2017) or (3) in Savchuk (2017)). The constant is a function of the parameters
of the two-sided cross-validation kernel
H_I
defined by expression (15) in Savchuk and Hart (2017). The Gaussian kernel is used for computing the ultimate (regression or density) estimate. The constant is used in the OSCV versions for kernel regression and density estimation. Notice that in the cases ,
and
,
the kernel
H_I
reduces to the Gaussian kernel.
The OSCV smooth rescaling constant for the given values of the parameters
and
.
Savchuk, O.Y., Hart, J.D. (2017). Fully robust one-sided cross-validation for regression functions. Computational Statistics, doi:10.1007/s00180-017-0713-7.
Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth density functions, arXiv:1703.05157.
L_I
, H_I
, OSCV_reg
, h_OSCV_reg
, OSCV_LI_dens
, OSCV_Gauss_dens
, h_OSCV_dens
, loclin
.
# OSCV rescaling constant for the robust cross-validation kernel with # (alpha,sigma)=(16.8954588,1.01). C_smooth(16.8954588,1.01) # OSCV smooth rescaling constant in the case when the kernel H_I is Gaussian. C_smooth(1,1)
# OSCV rescaling constant for the robust cross-validation kernel with # (alpha,sigma)=(16.8954588,1.01). C_smooth(16.8954588,1.01) # OSCV smooth rescaling constant in the case when the kernel H_I is Gaussian. C_smooth(1,1)
Computing , the value of the CV function in the regression context.
CV_reg(h, desx, y)
CV_reg(h, desx, y)
h |
numerical vector of bandwidth values, |
desx |
numerical vecror of design points, |
y |
numerical vecror of data values corresponding to the design points |
The CV function is a measure of fit of the regression estimate to the data. The local linear estimator based on the Gaussian kernel is used. The cross-validation bandwidth is the minimizer of the CV function.
The vector of values of for the correponsing vector of
values.
Stone, C.J. (1977) Consistent nonparametric regression. Annals of Statistics, 5(4), 595-645.
loclin
, h_ASE_reg
, ASE_reg
, OSCV_reg
.
## Not run: # Example (Old Faithful geyser). Take x=waiting time; y=eruption duration. The sample size n=272. xdat=faithful[[2]] ydat=faithful[[1]] harray=seq(0.5,10,len=100) cv=CV_reg(harray,xdat,ydat) R=range(xdat) h_cv=round(optimize(CV_reg,c(0.01,(R[2]-R[1]/4)),desx=xdat,y=ydat)$minimum,digits=4) dev.new() plot(harray,cv,'l',lwd=3,xlab="h",ylab="CV(h)",main="CV function for the Old Faithful geyser data", cex.lab=1.7,cex.axis=1.7,cex.main=1.5) legend(6,0.155,legend="n=272",cex=1.8,bty="n") legend(1,0.18,legend=paste("h_CV=",h_cv),cex=2,bty="n") ## End(Not run)
## Not run: # Example (Old Faithful geyser). Take x=waiting time; y=eruption duration. The sample size n=272. xdat=faithful[[2]] ydat=faithful[[1]] harray=seq(0.5,10,len=100) cv=CV_reg(harray,xdat,ydat) R=range(xdat) h_cv=round(optimize(CV_reg,c(0.01,(R[2]-R[1]/4)),desx=xdat,y=ydat)$minimum,digits=4) dev.new() plot(harray,cv,'l',lwd=3,xlab="h",ylab="CV(h)",main="CV function for the Old Faithful geyser data", cex.lab=1.7,cex.axis=1.7,cex.main=1.5) legend(6,0.155,legend="n=272",cex=1.8,bty="n") legend(1,0.18,legend=paste("h_CV=",h_cv),cex=2,bty="n") ## End(Not run)
Nonsmooth density with seven cusps introduced in the article of Savchuk (2017).
fstar(u)
fstar(u)
u |
numerical vecror of argument values in the range [-3,3]. |
The function consists of straight lines with different slopes connected together. The support of the density is [-3,3].
The vector of values of corresponding to the values of the vector
.
Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth density functions, arXiv:1703.05157.
## Not run: dev.new() plot(seq(-3.5,3.5,len=1000),fstar(seq(-3.5,3.5,len=1000)),'l',lwd=3, main="Nonsmooth density fstar with seven cusps", xlab="argument", ylab="density",cex.main=1.5, cex.axis=1.7,cex.lab=1.7) ## End(Not run)
## Not run: dev.new() plot(seq(-3.5,3.5,len=1000),fstar(seq(-3.5,3.5,len=1000)),'l',lwd=3, main="Nonsmooth density fstar with seven cusps", xlab="argument", ylab="density",cex.main=1.5, cex.axis=1.7,cex.lab=1.7) ## End(Not run)
Computing the ASE-optimal bandwidth for the Gaussian local linear regression estimator.
h_ASE_reg(desx, y, rx)
h_ASE_reg(desx, y, rx)
desx |
numerical vecror of design points, |
y |
numerical vecror of data points corresponding to the design points |
rx |
numerical vecror of the regression function values at |
Computing the ASE-optimal bandwidth for the local linear estimator in the regression context. The ASE-optimal bandwidth is the global minimizer of the ASE function ASE_reg
. This bandwidth is optimal for the data set at hand.
The ASE-optimal bandwidth (scalar).
## Not run: # Simulated example. n=300 dx=runif(n) #uniform design regx=5*dx^10*(1-dx)^2+2.5*dx^2*(1-dx)^10 ydat=regx+rnorm(n,sd=1/250) hase=round(h_ASE_reg(dx,ydat,regx),digits=4) u=seq(0,1,len=1000) fun=5*u^10*(1-u)^2+2.5*u^2*(1-u)^10 dev.new() plot(dx,ydat,pch=20,cex=1.5,xlab="argument",ylab="function",cex.lab=1.7,cex.axis=1.7, main="Function, data, and the ASE-optimal bandwidth",cex.main=1.5) lines(u,fun,'l',lwd=3,col="blue") legend(0,0.03,legend=paste("h_ASE=",hase),cex=1.8,bty="n") legend(0.6,-0.002,legend=paste("n=",n),cex=2,bty="n") ## End(Not run)
## Not run: # Simulated example. n=300 dx=runif(n) #uniform design regx=5*dx^10*(1-dx)^2+2.5*dx^2*(1-dx)^10 ydat=regx+rnorm(n,sd=1/250) hase=round(h_ASE_reg(dx,ydat,regx),digits=4) u=seq(0,1,len=1000) fun=5*u^10*(1-u)^2+2.5*u^2*(1-u)^10 dev.new() plot(dx,ydat,pch=20,cex=1.5,xlab="argument",ylab="function",cex.lab=1.7,cex.axis=1.7, main="Function, data, and the ASE-optimal bandwidth",cex.main=1.5) lines(u,fun,'l',lwd=3,col="blue") legend(0,0.03,legend=paste("h_ASE=",hase),cex=1.8,bty="n") legend(0.6,-0.002,legend=paste("n=",n),cex=2,bty="n") ## End(Not run)
.The family of two-sided cross-validation kernels defined by equation (15) of Savchuk and Hart (2017).
H_I(u, alpha, sigma)
H_I(u, alpha, sigma)
u |
numerical vector of argument values, |
alpha |
first parameter of the cross-validation kernel |
sigma |
second parameter of the cross-validation kernel |
The family of the two-sided cross-validation kernels , where
denotes the Gaussian kernel,
and
are the parameters of the kernel. See expression (15) of Savchuk and Hart (2017). The robust kernel plotted in Figure 1 of Savchuk and Hart (2017) is obtained by setting
and
. Note that the kernels
are also used for the bandwidth selection purposes in the indirect cross-validation (ICV) method (see expression (4) of Savchuk, Hart, and Sheather (2010)). The kernel
is a two-sided analog of the one-sided kernel
L_I
. The Gaussian kernel is the special case of
obtained by either setting
or
.
The value of .
Savchuk, O.Y., Hart, J.D. (2017). Fully robust one-sided cross-validation for regression functions. Computational Statistics, doi:10.1007/s00180-017-0713-7.
Savchuk, O.Y., Hart, J.D., Sheather, S.J. (2010). Indirect cross-validation for density estimation. Journal of the American Statistical Association, 105(489), 415-423.
L_I
, C_smooth
, OSCV_reg
, loclin
.
## Not run: # Plotting the robust kernel from Savchuk and Hart (2017) with alpha=16.8954588 and sigma=1.01. u=seq(-5,5,len=1000) ker=H_I(u,16.8954588,1.01) dev.new() plot(u,ker,'l',lwd=3,cex.axis=1.7, cex.lab=1.7) title(main="Robust kernel H_I along with the Gaussian kernel (phi)",cex=1.7) lines(u,dnorm(u),lty="dashed",lwd=3) legend(-4.85,0.3,lty=c("solid","dashed"),lwd=c(3,3),legend=c("H_I","phi"),cex=1.5) legend(1,0.4,legend=c("alpha=16.8955","sigma=1.01"),cex=1.5,bty="n") ## End(Not run)
## Not run: # Plotting the robust kernel from Savchuk and Hart (2017) with alpha=16.8954588 and sigma=1.01. u=seq(-5,5,len=1000) ker=H_I(u,16.8954588,1.01) dev.new() plot(u,ker,'l',lwd=3,cex.axis=1.7, cex.lab=1.7) title(main="Robust kernel H_I along with the Gaussian kernel (phi)",cex=1.7) lines(u,dnorm(u),lty="dashed",lwd=3) legend(-4.85,0.3,lty=c("solid","dashed"),lwd=c(3,3),legend=c("H_I","phi"),cex=1.5) legend(1,0.4,legend=c("alpha=16.8955","sigma=1.01"),cex=1.5,bty="n") ## End(Not run)
Computing the OSCV bandwidth for the Gaussian density estimator. The one-sided Gaussian kernel is used in the bandwidth selection stage. The (anticipated) smoothness of the density function is to be specified by the user.
h_OSCV_dens(dat, stype)
h_OSCV_dens(dat, stype)
dat |
numerical vecror of data values, |
stype |
specifies (anticipated) smoothness of the density function. Thus, |
Computing the OSCV bandwidth for the data vector . The one-sided Gaussian kernel
is used for the cross-validation purposes and the Gaussian kernel is used for computing the ultimate density estimate. The (anticipated) smoothness of the underlying density function is to be specified. Thus,
corresponds to the smooth density;
corresponds to the nonsmooth density.
It is usually assumed that the density is smooth if no preliminary information about its nonsmoothness is available. No additional rescaling of the computed bandwidth is needed. The smoothness of the density function , essentially, determines the value of the bandwidth rescaling constant that is used in the body of the function. Thus, the constant is equal to 0.6168471 in the smooth case, whereas it is equal to 0.5730 in the nonsmooth case. See Savchuk (2017) for details. The OSCV bandwidth is the minimizer of the OSCV function
OSCV_Gauss_dens
.
The OSCV bandwidth (scalar).
Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth densty functions, arXiv:1703.05157.
OSCV_Gauss_dens
, C_smooth
, h_OSCV_reg
.
## Not run: data=faithful[,1] # Data on n=272 eruption duration of the Old Faithful geyser. harray=seq(0.025,0.6,len=100) OSCV_array=OSCV_Gauss_dens(harray,data,0) dev.new() plot(harray,OSCV_array,lwd=3,'l',xlab="h",ylab="L_G-based OSCV", main="OSCV_G(h) for the data on eruption duration",cex.main=1.5,cex.lab=1.7,cex.axis=1.7) h_oscv=round(h_OSCV_dens(data,0),digits=4) #smoothness of the underlying density is assumed legend(0.04,-0.25,legend=c("n=272",paste("h_OSCV=",h_oscv)),cex=2,bty="n") ## End(Not run)
## Not run: data=faithful[,1] # Data on n=272 eruption duration of the Old Faithful geyser. harray=seq(0.025,0.6,len=100) OSCV_array=OSCV_Gauss_dens(harray,data,0) dev.new() plot(harray,OSCV_array,lwd=3,'l',xlab="h",ylab="L_G-based OSCV", main="OSCV_G(h) for the data on eruption duration",cex.main=1.5,cex.lab=1.7,cex.axis=1.7) h_oscv=round(h_OSCV_dens(data,0),digits=4) #smoothness of the underlying density is assumed legend(0.04,-0.25,legend=c("n=272",paste("h_OSCV=",h_oscv)),cex=2,bty="n") ## End(Not run)
Computing the OSCV bandwidth for the Gaussian local linear regression estimator. The Gaussian kernel is used in the bandwidth selection stage. The smoothness of the regression function is to be specified by the user.
h_OSCV_reg(desx, y, stype)
h_OSCV_reg(desx, y, stype)
desx |
numerical vecror of design points, |
y |
numerical vecror of data points corresponding to the design points |
stype |
smoothness of the regression function: ( |
Computing the OSCV bandwidth for the data vector . The Gaussian kernel is used for the cross-validation purposes and in the stage of computing the resulting local linear regression estimate. No additional rescaling of the computed bandwidth is needed. The smoothness of the regression function
, essentially, determines the value of the bandwidth rescaling constant that is chosen in the body of the function. Thus, the constant is equal to 0.6168471 in the smooth case, and 0.5730 in the nonsmooth case. See Savchuk, Hart and Sheather (2016). The OSCV bandwidth is the minimizer of the OSCV function
OSCV_reg
.
The OSCV bandwidth (scalar).
Hart, J.D. and Yi, S. (1998). One-sided cross-validation. Journal of the American Statistical Association, 93(442), 620-631.
Savchuk, O.Y., Hart, J.D., Sheather, S.J. (2013). One-sided cross-validation for nonsmooth regression functions. Journal of Nonparametric Statistics, 25(4), 889-904.
Savchuk, O.Y., Hart, J.D., Sheather, S.J. (2016). Corrigendum to "One-sided cross-validation for nonsmooth regression functions". Journal of Nonparametric Statistics, 28(4), 875-877.
Savchuk, O.Y., Hart, J.D. (2017). Fully robust one-sided cross-validation for regression functions. Computational Statistics, doi:10.1007/s00180-017-0713-7.
OSCV_reg
, loclin
, C_smooth
, h_OSCV_dens
, h_ASE_reg
.
## Not run: # Example (Old Faithful geyser) xdat=faithful[[2]] # waiting time ydat=faithful[[1]] # eruption duration u=seq(40,100,len=1000) h_oscv=round(h_OSCV_reg(xdat,ydat,0),digits=4) l=loclin(u,xdat,ydat,h_oscv) dev.new() plot(xdat,ydat,pch=20,cex=1.5,cex.axis=1.7,cex.lab=1.7,xlab="waiting time", ylab="eruption duration") lines(u,l,'l',lwd=3) title(main="Data and LLE",cex.main=1.7) legend(35,5,legend=paste("h_OSCV=",h_oscv),cex=2,bty="n") legend(80,3,legend="n=272",cex=2,bty="n") ## End(Not run)
## Not run: # Example (Old Faithful geyser) xdat=faithful[[2]] # waiting time ydat=faithful[[1]] # eruption duration u=seq(40,100,len=1000) h_oscv=round(h_OSCV_reg(xdat,ydat,0),digits=4) l=loclin(u,xdat,ydat,h_oscv) dev.new() plot(xdat,ydat,pch=20,cex=1.5,cex.axis=1.7,cex.lab=1.7,xlab="waiting time", ylab="eruption duration") lines(u,l,'l',lwd=3) title(main="Data and LLE",cex.main=1.7) legend(35,5,legend=paste("h_OSCV=",h_oscv),cex=2,bty="n") legend(80,3,legend="n=272",cex=2,bty="n") ## End(Not run)
fstar
.Computing the ISE function for the Gaussian density estimator obtained from a random sample of size generated from
fstar
.
ISE_fstar(h, n)
ISE_fstar(h, n)
h |
numerical vector of bandwidth values, |
n |
sample size (number of data points generated from |
The integrated squared error (ISE) is a measure of closeness of the Gaussian density estimate computed from a data set generated from fstar
to the true density.
The vector of values of the ISE function for the correponsing vector of values.
Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth density functions, arXiv:1703.05157.
## Not run: dev.new() harray=seq(0.05,1.5,len=1000) ISEarray=ISE_fstar(harray,100) h_ISE=round(harray[which.min(ISEarray)],digits=4) dev.new() plot(harray,ISEarray,lwd=3,'l',xlab="h",ylab="ISE",main="ISE(h)",cex.main=2,cex.lab=1.7, cex.axis=1.7) legend(0.35,ISEarray[5],legend=c("n=100",paste("h_ISE=",h_ISE)),cex=1.8,bty="n") ## End(Not run)
## Not run: dev.new() harray=seq(0.05,1.5,len=1000) ISEarray=ISE_fstar(harray,100) h_ISE=round(harray[which.min(ISEarray)],digits=4) dev.new() plot(harray,ISEarray,lwd=3,'l',xlab="h",ylab="ISE",main="ISE(h)",cex.main=2,cex.lab=1.7, cex.axis=1.7) legend(0.35,ISEarray[5],legend=c("n=100",paste("h_ISE=",h_ISE)),cex=1.8,bty="n") ## End(Not run)
.The one-sided counterpart of the kernel H_I
. See expressions (15) and (8) of Savchuk and Hart (2017).
L_I(u, alpha, sigma)
L_I(u, alpha, sigma)
u |
numerical vector of argument values, |
alpha |
first parameter of the cross-validation kernel |
sigma |
second parameter of the cross-validation kernel |
The family of the one-sided cross-validation kernels indexed by the parameters
and
. This family is used in the OSCV implementations in both regression context (see Savchuk and Hart (2017)) and density estimation context (see Savchuk (2017)). The special members of the family:
The robust kernel used in Savchuk and Hart (2017) and Savchuk (2017) is obtained by setting and
;
The one-sided Gaussian kernel is obtained by either setting
for any
or by setting
for any
.
The bandwidth selected by should be multiplied by a reascaling constant before it is used in computing the ultimate Gaussian (regression or density) estimate. In the case of a smooth (regression or density) function the rescaling constant is
C_smooth
.
The value of .
Savchuk, O.Y., Hart, J.D. (2017). Fully robust one-sided cross-validation for regression functions. Computational Statistics, doi:10.1007/s00180-017-0713-7.
Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth density functions, arXiv:1703.05157.
## Not run: # Plotting the robust one-sided kernel from Savchuk and Hart (2017) with # alpha=16.8954588 and sigma=1.01. u=seq(-1,5,len=1000) rker=L_I(u,16.8954588,1.01) Gker=L_I(u,0,1) dev.new() plot(u,rker,'l',lwd=3,cex.axis=1.7, cex.lab=1.7) title(main="One-sided kernels: L_I (robust) and L_G",cex=1.7) lines(u,Gker,lty="dashed",lwd=3) legend(0.5,2.5,lty=c("solid","dashed"),lwd=c(3,3),legend=c("L_I","L_G"),cex=1.7) legend(2,1.5,legend=c("alpha=16.8955","sigma=1.01"),cex=1.5) ## End(Not run)
## Not run: # Plotting the robust one-sided kernel from Savchuk and Hart (2017) with # alpha=16.8954588 and sigma=1.01. u=seq(-1,5,len=1000) rker=L_I(u,16.8954588,1.01) Gker=L_I(u,0,1) dev.new() plot(u,rker,'l',lwd=3,cex.axis=1.7, cex.lab=1.7) title(main="One-sided kernels: L_I (robust) and L_G",cex=1.7) lines(u,Gker,lty="dashed",lwd=3) legend(0.5,2.5,lty=c("solid","dashed"),lwd=c(3,3),legend=c("L_I","L_G"),cex=1.7) legend(2,1.5,legend=c("alpha=16.8955","sigma=1.01"),cex=1.5) ## End(Not run)
Computing the LLE based on data over the given vector of the argument values
. The Gausssian kernel is used. See expression (3) in Savchuk and Hart (2017).
loclin(u, desx, y, h)
loclin(u, desx, y, h)
u |
numerical vector of argument values, |
desx |
numerical vecror of design points, |
y |
numerical vecror of data values (corresponding to the specified design points |
h |
numerical bandwidth value (scalar). |
Computing the LLE based on the Gaussian kernel for the specified vector of the argument values and given vectors of design points
and the corresponding data values
.
Numerical vector of the LLE values computed over the specified vector of points.
Clevelend, W.S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368), 829-836.
Savchuk, O.Y., Hart, J.D. (2017). Fully robust one-sided cross-validation for regression functions. Computational Statistics, doi:10.1007/s00180-017-0713-7.
OSCV_reg
, h_OSCV_reg
, ASE_reg
, h_ASE_reg
, CV_reg
.
## Not run: # Example (simulated data). n=200 dx=(1:n-0.5)/n regf=2*dx^10*(1-dx)^2+dx^2*(1-dx)^10 u=seq(0,1,len=1000) ydat=regf+rnorm(n,sd=0.002) dev.new() plot(dx,regf,'l',lty="dashed",lwd=3,xlim=c(0,1),ylim=c(1.1*min(ydat),1.1*max(ydat)), cex.axis=1.7,cex.lab=1.7) title(main="Function, generated data, and LLE",cex.main=1.5) points(dx,ydat,pch=20,cex=1.5) lines(u,loclin(u,dx,ydat,0.05),lwd=3,col="blue") legend(0,1.1*max(ydat),legend=c("LLE based on h=0.05","true regression function"), lwd=c(2,3),lty=c("solid","dashed"),col=c("blue","black"),cex=1.5,bty="n") legend(0.7,0.5*min(ydat),legend="n=200",cex=1.7,bty="n") ## End(Not run)
## Not run: # Example (simulated data). n=200 dx=(1:n-0.5)/n regf=2*dx^10*(1-dx)^2+dx^2*(1-dx)^10 u=seq(0,1,len=1000) ydat=regf+rnorm(n,sd=0.002) dev.new() plot(dx,regf,'l',lty="dashed",lwd=3,xlim=c(0,1),ylim=c(1.1*min(ydat),1.1*max(ydat)), cex.axis=1.7,cex.lab=1.7) title(main="Function, generated data, and LLE",cex.main=1.5) points(dx,ydat,pch=20,cex=1.5) lines(u,loclin(u,dx,ydat,0.05),lwd=3,col="blue") legend(0,1.1*max(ydat),legend=c("LLE based on h=0.05","true regression function"), lwd=c(2,3),lty=c("solid","dashed"),col=c("blue","black"),cex=1.5,bty="n") legend(0.7,0.5*min(ydat),legend="n=200",cex=1.7,bty="n") ## End(Not run)
, the one-sided Epanechnikov kernel, in the kernel density estimation (KDE) context.Computing the values of the -based OSCV function in the density estimation context. See Martinez-Miranda et al. (2009) and Savchuk (2017).
OSCV_Epan_dens(h, dat)
OSCV_Epan_dens(h, dat)
h |
numerical vector of bandwidth values, |
dat |
numerical vecror of data values. |
Computing the values of the OSCV function for the given bandwidth vector and data vector
. The function is based on the one-sided Epanechnikov kernel
. The function's minimizer is to be multiplied by the appropriate rescaling constant before it can be used to compute the ultimate kernel density estimate. The formula for the rescaling constant depends on smothness of the density and on the kernel used in computing the ultimate density estimate.
The vector of values of the OSCV function for the correponsing vector of values.
Martinez-Miranda, M.D., Nielsen, J. P., and Sperlich, S. (2009). One sided cross validation for density estimation. In Operational Risk Towards Basel III: Best Practices and Issues in Modeling, Management and Regulation, 177-196.
Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth densty functions, arXiv:1703.05157.
OSCV_Gauss_dens
, OSCV_LI_dens
.
## Not run: # Example 1 (Data on n=272 eruption duration of the Old Faithful geyser). data=faithful[,1] har=seq(0.05,1,len=1000) dev.new() plot(har,OSCV_Epan_dens(har,data),lwd=3,'l',xlab="h",ylab="L_E-based OSCV", main="L_E_based OSCV for the data on eruption duration",cex.main=1.5,cex.lab=1.7,cex.axis=1.7) h_min=round(optimize(OSCV_Epan_dens,c(0.001,1),tol=0.001,dat=data)$minimum, digits=4) legend(0.1,-0.1,legend=c("n=272",paste("h_min=",h_min)),cex=2) # The above graph appears in Savchuk (2017). # Example 2 (Data set of size n=100 is generated from the standard normal density). dat_norm=rnorm(100) harray=seq(0.25,4.25,len=1000) OSCVarray=OSCV_Epan_dens(harray,dat_norm) dev.new() plot(harray,OSCVarray,lwd=3,'l',xlab="h",ylab="L_E-based OSCV", main="L_E-based OSCV for data generated from N(0,1)", cex.main=1.5,cex.lab=1.7,cex.axis=1.7) h_min_norm=round(optimize(OSCV_Epan_dens,c(0.1,4),tol=0.001,dat=dat_norm)$minimum, digits=4) legend(0.5,OSCVarray[1],legend=c("n=100",paste("h_min=",h_min_norm)),cex=2,bty="n") ## End(Not run)
## Not run: # Example 1 (Data on n=272 eruption duration of the Old Faithful geyser). data=faithful[,1] har=seq(0.05,1,len=1000) dev.new() plot(har,OSCV_Epan_dens(har,data),lwd=3,'l',xlab="h",ylab="L_E-based OSCV", main="L_E_based OSCV for the data on eruption duration",cex.main=1.5,cex.lab=1.7,cex.axis=1.7) h_min=round(optimize(OSCV_Epan_dens,c(0.001,1),tol=0.001,dat=data)$minimum, digits=4) legend(0.1,-0.1,legend=c("n=272",paste("h_min=",h_min)),cex=2) # The above graph appears in Savchuk (2017). # Example 2 (Data set of size n=100 is generated from the standard normal density). dat_norm=rnorm(100) harray=seq(0.25,4.25,len=1000) OSCVarray=OSCV_Epan_dens(harray,dat_norm) dev.new() plot(harray,OSCVarray,lwd=3,'l',xlab="h",ylab="L_E-based OSCV", main="L_E-based OSCV for data generated from N(0,1)", cex.main=1.5,cex.lab=1.7,cex.axis=1.7) h_min_norm=round(optimize(OSCV_Epan_dens,c(0.1,4),tol=0.001,dat=dat_norm)$minimum, digits=4) legend(0.5,OSCVarray[1],legend=c("n=100",paste("h_min=",h_min_norm)),cex=2,bty="n") ## End(Not run)
, the one-sided Gaussian kernel, in the kernel density estimation (KDE) context.Computing the values of the -based OSCV function in the density estimation context. See Savchuk (2017).
OSCV_Gauss_dens(h, dat, stype)
OSCV_Gauss_dens(h, dat, stype)
h |
numerical vector of bandwidth values, |
dat |
numerical vecror of data values, |
stype |
specifies (anticipated) smoothness of the density function. Thus, |
Computing the values of the OSCV function for the given bandwidth vector and data vector
. The function is based on the one-sided Gaussian kernel
. The (anticipated) smoothness of the underlying density function is to be specified. Thus,
corresponds to the smooth density;
corresponds to the nonsmooth density.
It is usually assumed that the density is smooth if no preliminary information about its nonsmoothness is available. The function's minimizer h_OSCV_dens
is to be used without additional rescaling to compute the ultimate Gaussian density estimate.
The vector of values of the OSCV function for the correponsing vector of values.
Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth densty functions, arXiv:1703.05157.
h_OSCV_dens
, OSCV_Epan_dens
, OSCV_LI_dens
, C_smooth
.
## Not run: dat_norm=rnorm(300) #generating random sample of size n=300 from the standard normal density. h_oscv=round(h_OSCV_dens(dat_norm,0),digits=4) y=density(dat_norm,bw=h_oscv) dev.new() plot(y,lwd=3,cex.lab=1.7,cex.axis=1.7,cex.main=1.7,xlab=paste("n=100, h_OSCV=",h_oscv), main="Standard normal density estimate by OSCV",ylim=c(0,0.45),xlim=c(-4.5,4.5)) u=seq(-5,5,len=1000) lines(u,dnorm(u),lwd=3,lty="dashed",col="blue") legend(0.75,0.4,legend=c("OSCV estimate","N(0,1) density"),lwd=c(3,3),lty=c("solid","dashed"), col=c("black","blue"),bty="n",cex=1.25) ## End(Not run)
## Not run: dat_norm=rnorm(300) #generating random sample of size n=300 from the standard normal density. h_oscv=round(h_OSCV_dens(dat_norm,0),digits=4) y=density(dat_norm,bw=h_oscv) dev.new() plot(y,lwd=3,cex.lab=1.7,cex.axis=1.7,cex.main=1.7,xlab=paste("n=100, h_OSCV=",h_oscv), main="Standard normal density estimate by OSCV",ylim=c(0,0.45),xlim=c(-4.5,4.5)) u=seq(-5,5,len=1000) lines(u,dnorm(u),lwd=3,lty="dashed",col="blue") legend(0.75,0.4,legend=c("OSCV estimate","N(0,1) density"),lwd=c(3,3),lty=c("solid","dashed"), col=c("black","blue"),bty="n",cex=1.25) ## End(Not run)
L_I
in the density estimation (KDE) context.Computing the values of the -based OSCV function in the density estimation context. See Savchuk (2017).
OSCV_LI_dens(h, dat, alpha, sigma)
OSCV_LI_dens(h, dat, alpha, sigma)
h |
numerical vector of bandwidth values, |
dat |
numerical vecror of data values, |
alpha |
first parameter of the kernel |
sigma |
second parameter of the kernel |
Computing the OSCV function for the given vector of bandwidth values and the data vector
. The function is based on the one-sided kernel
L_I
that depends on the parameters and
. The kernel
is robust in the special case of
and
. The other special case is obtained when either of the following holds:
for any
;
for any
.
In the above cases the kernel reduces to the one-sided Gaussian kernel
. The function's minimizer is to be used without additional rescaling to compute the ultimate Gaussian density estimate under the assumption that the underlying density is smooth.
The vector of values of the OSCV function for the correponsing vector of values.
Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth density functions, arXiv:1703.05157.
OSCV_Gauss_dens
, OSCV_Epan_dens
, C_smooth
, L_I
, H_I
.
## Not run: # Example 1 (Old Faithful geyser data) dev.new() data=faithful[,1] # Data on n=272 eruption duration of the Old Faithful geyser. harray=seq(0.025,0.6,len=50) alp=16.8954588 sig=1.01 plot(harray,OSCV_LI_dens(harray,data,alpha=alp,sigma=sig),lwd=3,'l',xlab="h", ylab="L_I-based OSCV",main="OSCV_LI(h) for eruption duration",cex.main=1.5,cex.lab=1.7, cex.axis=1.7) h_OSCV_LI=round(optimize(OSCV_LI_dens,c(0.001,0.5),tol=0.001,dat=data,alpha=16.8954588, sigma=1.01)$minimum,digits=4) legend(0.01,-0.2,legend=c("n=272",paste("h_OSCV_LI=",h_OSCV_LI)),cex=1.8,bty="n") legend(0.25,-0.33,legend=c("Parameters of L_I:", paste("alpha=",alp), paste("sigma=",sig)),cex=1.7,bty="n") # Example 2 (Simulated example) dat_norm=rnorm(100) #generating a random sample of size n=100 from the N(0,1) density harray=seq(0.05,1.5,len=100) OSCVarray=OSCV_LI_dens(harray,dat=dat_norm,16.8954588,1.01) dev.new() plot(harray,OSCVarray,lwd=3,'l',xlab="h",ylab="L_I-based OSCV", main="OSCV_LI(h) for data generated from N(0,1)",cex.main=1.5,cex.lab=1.7,cex.axis=1.7) h_OSCV_LI_norm=round(optimize(OSCV_LI_dens,c(0.001,1),tol=0.001, dat=dat_norm,16.8954588,1.01)$minimum,digits=4) legend(0,OSCVarray[1],legend=c("n=100",paste("h_OSCV_LI=",h_OSCV_LI_norm), "Parameters of the robust kernel L_I:","alpha=16.8954588", "sigma=1.01"),cex=1.5,bty="n") ## End(Not run)
## Not run: # Example 1 (Old Faithful geyser data) dev.new() data=faithful[,1] # Data on n=272 eruption duration of the Old Faithful geyser. harray=seq(0.025,0.6,len=50) alp=16.8954588 sig=1.01 plot(harray,OSCV_LI_dens(harray,data,alpha=alp,sigma=sig),lwd=3,'l',xlab="h", ylab="L_I-based OSCV",main="OSCV_LI(h) for eruption duration",cex.main=1.5,cex.lab=1.7, cex.axis=1.7) h_OSCV_LI=round(optimize(OSCV_LI_dens,c(0.001,0.5),tol=0.001,dat=data,alpha=16.8954588, sigma=1.01)$minimum,digits=4) legend(0.01,-0.2,legend=c("n=272",paste("h_OSCV_LI=",h_OSCV_LI)),cex=1.8,bty="n") legend(0.25,-0.33,legend=c("Parameters of L_I:", paste("alpha=",alp), paste("sigma=",sig)),cex=1.7,bty="n") # Example 2 (Simulated example) dat_norm=rnorm(100) #generating a random sample of size n=100 from the N(0,1) density harray=seq(0.05,1.5,len=100) OSCVarray=OSCV_LI_dens(harray,dat=dat_norm,16.8954588,1.01) dev.new() plot(harray,OSCVarray,lwd=3,'l',xlab="h",ylab="L_I-based OSCV", main="OSCV_LI(h) for data generated from N(0,1)",cex.main=1.5,cex.lab=1.7,cex.axis=1.7) h_OSCV_LI_norm=round(optimize(OSCV_LI_dens,c(0.001,1),tol=0.001, dat=dat_norm,16.8954588,1.01)$minimum,digits=4) legend(0,OSCVarray[1],legend=c("n=100",paste("h_OSCV_LI=",h_OSCV_LI_norm), "Parameters of the robust kernel L_I:","alpha=16.8954588", "sigma=1.01"),cex=1.5,bty="n") ## End(Not run)
Computing , the value of the OSCV function in the regression context, defined by expression (9) of Savchuk and Hart (2017).
OSCV_reg(b, desx, y, ktype)
OSCV_reg(b, desx, y, ktype)
b |
numerical vector of bandwidth values, |
desx |
numerical vecror of design points, |
y |
numerical vecror of data points corresponding to the design points |
ktype |
making choice between two cross-validation kernels: ( |
Computation of for given
(bandwidth vector) and the data values
corresponding to the design points
. No preliminary sorting of the data (according to the
variable) is needed. The value of
is used. Two choices of the two-sided cross-validation kernel are available:
() Gaussian kernel;
() robust kernel
H_I
defined by expression (15) of Savchuk and Hart (2017) with .
The vector of values of for the correponsing vector of
values.
Savchuk, O.Y., Hart, J.D. (2017). Fully robust one-sided cross-validation for regression functions. Computational Statistics, doi:10.1007/s00180-017-0713-7.
Hart, J.D. and Yi, S. (1998) One-sided cross-validation. Journal of the American Statistical Association, 93(442), 620-631.
h_OSCV_reg
, H_I
, loclin
, C_smooth
.
## Not run: # The Old Faithful geyser data set "faithful" is used. The sample size n=272. # The OSCV curves based on the Gaussian kernel and the robust kernel H_I (with # alpha=16.8954588 and sigma=1.01) are plotted. The horizontal scales of the curves # are changed such that their global minimizers are to be used in computing the # Gaussian local linear estimates of the regression function. xdat=faithful[[2]] #waiting time ydat=faithful[[1]] #eruption duration barray=seq(0.5,10,len=250) C_gauss=C_smooth(1,1) OSCV_gauss=OSCV_reg(barray/C_gauss,xdat,ydat,0) h_gauss=round(h_OSCV_reg(xdat,ydat,0),digits=4) dev.new() plot(barray,OSCV_gauss,'l',lwd=3,cex.lab=1.7,cex.axis=1.7,xlab="h",ylab="OSCV criterion") title(main="OSCV based on the Gaussian kernel",cex.main=1.7) legend(2.5,0.25,legend=paste("h_min=",h_gauss),cex=2,bty="n") C_H_I=C_smooth(16.8954588,1.01) OSCV_H_I=OSCV_reg(barray/C_H_I,xdat,ydat,1) h_H_I=round(barray[which.min(OSCV_H_I)],digits=4) dev.new() plot(barray,OSCV_H_I,'l',lwd=3,cex.lab=1.7,cex.axis=1.7,xlab="h",ylab="OSCV criterion", ylim=c(0.15,0.5)) title(main="OSCV based on the robust kernel H_I",cex.main=1.7) legend(2.5,0.4,legend=paste("h_min=",h_H_I),cex=2,bty="n") ## End(Not run)
## Not run: # The Old Faithful geyser data set "faithful" is used. The sample size n=272. # The OSCV curves based on the Gaussian kernel and the robust kernel H_I (with # alpha=16.8954588 and sigma=1.01) are plotted. The horizontal scales of the curves # are changed such that their global minimizers are to be used in computing the # Gaussian local linear estimates of the regression function. xdat=faithful[[2]] #waiting time ydat=faithful[[1]] #eruption duration barray=seq(0.5,10,len=250) C_gauss=C_smooth(1,1) OSCV_gauss=OSCV_reg(barray/C_gauss,xdat,ydat,0) h_gauss=round(h_OSCV_reg(xdat,ydat,0),digits=4) dev.new() plot(barray,OSCV_gauss,'l',lwd=3,cex.lab=1.7,cex.axis=1.7,xlab="h",ylab="OSCV criterion") title(main="OSCV based on the Gaussian kernel",cex.main=1.7) legend(2.5,0.25,legend=paste("h_min=",h_gauss),cex=2,bty="n") C_H_I=C_smooth(16.8954588,1.01) OSCV_H_I=OSCV_reg(barray/C_H_I,xdat,ydat,1) h_H_I=round(barray[which.min(OSCV_H_I)],digits=4) dev.new() plot(barray,OSCV_H_I,'l',lwd=3,cex.lab=1.7,cex.axis=1.7,xlab="h",ylab="OSCV criterion", ylim=c(0.15,0.5)) title(main="OSCV based on the robust kernel H_I",cex.main=1.7) legend(2.5,0.4,legend=paste("h_min=",h_H_I),cex=2,bty="n") ## End(Not run)
Nonsmooth regression function with six cusps used in the simulation studies in Savchuk et al. (2013) and Savchuk et al. (2017).
reg3(u)
reg3(u)
u |
numerical vecror of argument values in the range [0,1]. |
The nonsmooth function can be used in simulation studies.
The vector of values of corresponding to the values of the vector
.
Savchuk, O.Y., Hart, J.D., Sheather, S.J. (2013). One-sided cross-validation for nonsmooth regression functions. Journal of Nonparametric Statistics, 25(4), 889-904.
Savchuk, O.Y., Hart, J.D. (2017). Fully robust one-sided cross-validation for regression functions. Computational Statistics, doi:10.1007/s00180-017-0713-7.
## Not run: # n=250 data points are generated from r3 by adding the Gaussian noise with sigma=1/500. # The fixed evenly spaced design is used. u=seq(0,1,len=1000) n=250 xdat=(1:n-0.5)/n ydat=reg3(xdat)+rnorm(n,sd=1/500) h_oscv=round(h_OSCV_reg(xdat,ydat,1),digits=4) # L_G-based OSCV based on nonsmooth constant l=loclin(u,xdat,ydat,h_oscv) dev.new() plot(xdat,ydat,pch=20,cex=1.5,cex.axis=1.5,cex.lab=1.5,xlab="x",ylab="y", ylim=c(min(ydat),1.2*max(ydat))) lines(u,l,'l',lwd=3,col="blue") lines(u,reg3(u),lwd=3,lty="dashed") title(main="Data, true regression function and LLE",cex.main=1.7) legend(-0.05,0.003,legend=paste("h_OSCV=",h_oscv),cex=2,bty="n") legend(0.65,0.025, legend="n=250",cex=2,bty="n") legend(0,1.28*max(ydat),legend=c("LLE based on h_OSCV","true regression function"),lwd=c(3,3), lty=c("solid","dashed"),col=c("blue","black"),bty="n",cex=1.5) ## End(Not run)
## Not run: # n=250 data points are generated from r3 by adding the Gaussian noise with sigma=1/500. # The fixed evenly spaced design is used. u=seq(0,1,len=1000) n=250 xdat=(1:n-0.5)/n ydat=reg3(xdat)+rnorm(n,sd=1/500) h_oscv=round(h_OSCV_reg(xdat,ydat,1),digits=4) # L_G-based OSCV based on nonsmooth constant l=loclin(u,xdat,ydat,h_oscv) dev.new() plot(xdat,ydat,pch=20,cex=1.5,cex.axis=1.5,cex.lab=1.5,xlab="x",ylab="y", ylim=c(min(ydat),1.2*max(ydat))) lines(u,l,'l',lwd=3,col="blue") lines(u,reg3(u),lwd=3,lty="dashed") title(main="Data, true regression function and LLE",cex.main=1.7) legend(-0.05,0.003,legend=paste("h_OSCV=",h_oscv),cex=2,bty="n") legend(0.65,0.025, legend="n=250",cex=2,bty="n") legend(0,1.28*max(ydat),legend=c("LLE based on h_OSCV","true regression function"),lwd=c(3,3), lty=c("solid","dashed"),col=c("blue","black"),bty="n",cex=1.5) ## End(Not run)
fstar
.Taking a random sample of size from the density
with seven cusps introduced in the article of Savchuk (2017).
sample_fstar(n)
sample_fstar(n)
n |
sample size. |
The density can be used in simulation studies.
The numerical vector of size of the data values.
Savchuk, O.Y. (2017). One-sided cross-validation for nonsmooth density functions, arXiv:1703.05157.
## Not run: dev.new() plot(density(sample_fstar(5000),bw=0.1),lwd=2,ylim=c(0,0.32),xlab="argument",ylab="density", main="KDE and the true density fstar",cex.lab=1.7, cex.axis=1.7,cex.main=1.7) lines(seq(-3.5,3.5,len=1000),fstar(seq(-3.5,3.5,len=1000)),lwd=3,lty="dashed") legend(-3,0.3,legend=c("KDE","True density","h=0.1","n=5000"),lwd=c(2,3), lty=c("solid","dashed"),col=c("black","black","white","white")) ## End(Not run)
## Not run: dev.new() plot(density(sample_fstar(5000),bw=0.1),lwd=2,ylim=c(0,0.32),xlab="argument",ylab="density", main="KDE and the true density fstar",cex.lab=1.7, cex.axis=1.7,cex.main=1.7) lines(seq(-3.5,3.5,len=1000),fstar(seq(-3.5,3.5,len=1000)),lwd=3,lty="dashed") legend(-3,0.3,legend=c("KDE","True density","h=0.1","n=5000"),lwd=c(2,3), lty=c("solid","dashed"),col=c("black","black","white","white")) ## End(Not run)