^{1}

^{*}

^{1}

^{1}

^{1}

^{1}

^{2}

Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status; this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi - Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.

Variable selection in multivariate analysis is a critical step in regression, especially for high-dimensional datasets. It remains a challenge to identify a small fraction of essential predictors from thousands of variables, especially for small sample sizes. Variable selection by way of a sparse approximation of the parsimonious model can enhance the prediction and estimation accuracy by efficiently identifying the subset of essential predictors, reduce model complexity and improve model interpretability. This paper presents some unbiased, sparse and continuous methods for the judicious selection of important predictors, which allow easier interpretation, better prediction, and reduction in the complexity of the model.

This paper utilized a hyperspectral data, collected from vines at the leaf-level and the canopy-level, for a Riesling vineyard. The dataset was obtained by measuring the spectral reflectance, defined as the ratio of backscattered radiance from a surface and the incident radiance on that surface (scaled to 0% - 100%) [

This high dimensional dataset, with more number of variables than the sample size, suffers from the curse of dimensionality, ill-posedness [

One popular family of feature selection methods for parametric models is based on the penalized (pseudo-) likelihood approach. These regularization paths for Generalized Linear Models via Coordinate Descent include the Lasso [

smooth curve belonging to an infinite dimensional space. Since predictors are non-periodic functional data [

Let us consider the traditional multiple linear regression model

y = X β + ε , (1)

where X = [ x 1 , ⋯ , x n ] ∈ ℝ n × p is an input matrix, y = ( y 1 , ⋯ , y n ) T ∈ ℝ n is the corresponding response vector, β ∈ ℝ p is the regression coefficients vector, and ε ∈ ℝ n is a vector of the residual errors with variance ( ε ) = σ ε 2 . In order to remove the constant term from the regression model, let us standardize the

predictor variables, such that ∑ i = 1 n X i j = 0 , 1 n ∑ i = 1 n X i j 2 = 1 for j = 1 , ⋯ , p [

Since the grapevine dataset has more predictor variables than the sample size (which causes multicollinearity), we will discuss the various modern regularization techniques, involving simultaneous shrinkage and variable selection, in subsequent sections.

To explore the various variable selection techniques, this paper is organized as follows: In Section 2, we explain the Elastic Net regularization approach with an emphasis on its clever combination of the traditional Ridge and Lasso methods; in Section 3 we introduce the Multi-Step Adaptive Elastic Net, which provides (at least in principle) an improvement over the basic elastic Net, via various modifications of the penalty function; Section 4 deals with the method known as Minimax Concave Penalty, which is yet another technique designed to address the estimation, inference and prediction inherent in complex datasets, like the grapevine data studied in this paper; in Section 5 we discuss various aspects of iterative Sure Independence Screening; Section 6 adopts a conceptually different approach, by using a functional approach to variable selection, including smoothing by basis representation and validation; Section 7 is dedicated to the comparative analysis of the performance of all of the above techniques with the goal of wavelength selection for the Riesling grapevine dataset toward nutrient estimation; and finally, in Section 8, we discuss the advantages and limitations of the various models mentioned above before concluding.

Ridge regression, the oldest and earliest form of regularization, shrinks all coefficients of the predictors towards zero by a uniform (ℓ_{2} - norm) convex penalty to produce a unique solution. However, Ridge regression typically does not set the coefficients exactly to zero, unless l = ∞ [

β ^ ( Ridge ) = arg min β ∈ ℝ p { ‖ y − X β ‖ 2 2 + λ ‖ β ‖ 2 2 } (2)

where ‖ y − X β ‖ 2 2 = ∑ i = 1 n ( y i − x i T β ) 2 is a quadratic loss function (residual sum of squares), x i T is the i^{th} row of X, ‖ β ‖ 2 2 = ∑ j = 1 p β j 2 is the ℓ_{2} - norm penalty on β , and λ ≥ 0 is the tuning (penalty) parameter, which regulates the strength of the penalty.

Lasso, on the other hand, shrinks all coefficients by a constant value (ℓ_{1} - norm) and typically sets some of them to zero for some appropriately chosen l. It simultaneously achieves continuous shrinkage and automatic variable selection. However, when the multicollinearity is very high, Lasso tends to pick one of the predictors from the cluster in an arbitrary way and then shrink the others to zero [

β ^ ( Lasso ) = arg min β ∈ ℝ p { ‖ y − X β ‖ 2 2 + λ ‖ β ‖ 1 } (3)

where ‖ y − X β ‖ 2 2 = 1 2 n ∑ i = 1 n ( y i − x i T β ) 2 is the quadratic loss function, x i T is the i^{th} row of the matrix X, ‖ β ‖ 1 = ∑ j = 1 p | β j | is the ℓ_{1} - norm penalty on β ,

which induces sparsity in the solution, and λ ≥ 0 is the tuning parameter.

Naïve Elastic Net (NEN) overcomes these limitations by combining of the Lasso (ℓ_{1}-norm) and Ridge (ℓ_{2}-norm) penalties [

Let us consider two fixed non-negative tuning parameters: λ_{1} and λ_{2}, such that the naïve elastic net criterion is

L ( λ 1 , λ 2 , β ) = ‖ y − X β ‖ 2 2 + λ 1 ‖ β ‖ 1 + λ 2 ‖ β ‖ 2 2 (4)

where ‖ y − X β ‖ 2 2 = 1 2 n ∑ i = 1 n ( y i − x i T β ) 2 , ‖ β ‖ 1 = ∑ j = 1 p | β j | and ‖ β ‖ 2 2 = ∑ j = 1 p β j 2

The NEN estimator β ^ ( N-Enet ) is the minimizer of equation [

β ^ ( N-Enet ) = arg min β ∈ ℝ p { L ( λ 1 , λ 2 , β ) } (5)

Let us consider α = λ 2 / ( λ 1 + λ 2 ) , then the elastic net estimator β ^ ( N-enet ) is

β ^ ( N-Enet ) = arg min β ∈ ℝ p { ‖ y − X β ‖ 2 2 + λ P α ( β ) } (6)

subject to P α ( β ) = ( 1 − α ) ‖ β ‖ 2 2 + α ‖ β ‖ 1 ≤ t for some t.

where P α ( β ) is the naïve elastic net penalty and α ∈ ( 0 , 1 ) . For all α ∈ ( 0 , 1 ) , the penalty function is non-differentiable at 0 (like Lasso) but strictly convex (like ridge). Hence, by varying α, we can control the proportion of ℓ_{1}-norm (α = 0) and the ℓ_{2}-norm (α = 1) penalty. The amount of shrinkage to coefficient estimates is controlled by the parameter t ≥ 0.

Initially, the NEN computes the ridge regression coefficients for each fixed tuning parameter λ_{2} and then uses this coefficient value to acquire shrinkage along the Lasso coefficient paths. This technique of shrinkage increases the bias of the coefficients without substantial reduction in the variance, resulting in an overall increase of the prediction error. This leads to a doubled shrinkage and unnecessary extra bias, in comparison to Ridge regression or Lasso [

can correct this double shrinkage by multiplying the NEN estimate by ( 1 + λ 2 n ) :

β ^ ( Enet ) = ( 1 + λ 2 n ) arg min β ∈ ℝ p { ‖ y − X β ‖ 2 2 + λ 1 ‖ β ‖ 1 + λ 2 ‖ β ‖ 2 2 } (7)

This type of transformation reverts to ridge shrinkage while retaining the variable selection property of the NEN. Thus, the elastic net is able to improve the prediction accuracy by achieving the automatic variable selection using ℓ_{1} penalty, while group selection and stabilization of the coefficient paths on random sampling are achieved by ℓ_{2} penalty. The sparsity of the elastic net increases monotonically from zero to the sparsity of the Lasso solution as α increases from 0 to 1, for a given parameter λ. Hence, the Elastic Net is a better solution for a dataset with a sample size significantly smaller than the number of highly correlated predictors [

At times, it is advisable to include the entire group of correlated predictors in the model selection, rather than single variable from the group. In such cases, elastic net ensures that the highly correlated variables enter or exit the model together. The presence of the ℓ_{2}-norm ensures a unique minimum by making the loss function strictly convex [

Lasso and elastic-net approaches result in a substantial number of non-zero coefficients with asymptotically non-ignorable bias. The estimation bias of the Lasso can be reduced by choosing the weights such that the variables with larger coefficients have smaller weights than variables with smaller coefficients. To mitigate this bias, the adaptive Lasso uses the weighted penalty approach as given below:

β ^ ( AdaLasso ) = arg min β ∈ ℝ p { ‖ y − X β ‖ 2 2 + λ ∑ j = 1 p w ^ j | β j | } (8)

where w ^ j is a weighting parameter calculated from the data, w ^ j = ( | β ^ j i n i | − γ ) ,

and γ is a positive constant. β ^ j i n i are the initial parameters, obtained by ridge regression or least squares.

This approach ensures that the adaptive lasso is able to accomplish more shrinkage, resulting in smaller coefficients. In other words, it executes a varying amount of shrinkage for different variables. Adaptive elastic net is a mixture of the adaptive Lasso and the elastic net. Before calculating the adaptive weights, we estimate the elastic-net β ^ ( Enet ) , and use a positive constant γ. Using this information, we can calculate adaptive weights:

w ^ j = ( | β ^ j ( enet ) | − γ ) , j = 1 , 2 , ⋯ , p (9)

Now we can estimate the adaptive elastic net by the equation below:

β ^ ( AdaEnet ) = ( 1 + λ 2 n ) arg min β ∈ ℝ p { ‖ y − X β ‖ 2 2 + λ 1 ∑ j = 1 p w ^ j | β j | + λ 2 ‖ β ‖ 2 2 } (10)

The presence of the ℓ_{2}-norm ensures that the adaptive elastic net is able to overcome the collinearity problem while retaining the consistency in variable selection and asymptotic Gaussian properties of the adaptive Lasso. The use of multi-step estimation achieves higher true positives (true zeroes are estimated as zeroes) for the variable selection by pursuing more iterative steps and using separate tuning parameters for each step. The estimates of multi-step adaptive elastic net (MSA-Enet) approach are given by:

β ^ ( MSAEnet ) = ( 1 + λ 2 n ) arg min β ∈ ℝ p { ‖ y − X β ‖ 2 2 + λ 1 ( k ) ∑ j = 1 p w ^ j ( k − 1 ) | β j | + λ 2 ( k ) ‖ β ‖ 2 2 } (11)

where k = number of iterations (stages). For MSA-Enet, we use k ≥ 3. By considering k = 2, we can estimate the adaptive elastic net, and for k = 1, we obtain the normal elastic-net. We can obtain the values of λ 1 ( k ) and λ 2 ( k ) by using cross-validation.

Convex penalties fail to satisfy all three conditions of sparsity, continuity, and unbiasedness. Hence, they cannot produce true parsimonious models. To overcome these limitations, Fan & Li [_{1} penalty (Lasso) until, |x| = λ, and then smoothly relax the penalization rate to zero as the absolute value of the coefficient increases, but differs in the way that the transition takes place. The MCP relaxes the penalization rate immediately, whereas, for the SCAD, the penalization rate remains flat for a while, before decreasing.

Zhang [

ρ ( t ; λ ) = λ ∫ 0 | t | ( 1 − x γ λ ) + d x ,

ρ ( t ; λ ) = { λ | t | − t 2 2 γ , if t ≤ λ γ , λ 2 γ 2 , if t > λ γ ,

ρ ′ ( t ; λ ) = { λ − t γ , if t ≤ λ γ , 0 , if t > λ γ , (12)

for γ > 0 and λ > 0. Equation (12) clearly shows that MCP initially applies the ℓ_{1} penalty (Lasso), but continuously relaxes that penalization rate until, when t > lγ. At this stage, the rate of penalization drops to 0.

MCP minimizes the maximum concavity

κ ( ρ ) ≡ κ ( ρ ; λ ) ≡ sup 0 < t 1 < t 2 { ρ ˙ ( t 1 ; λ ) − ρ ˙ ( t 2 ; λ ) } / ( t 2 − t 1 ) (13)

subject to the following unbiasedness and features selection:

ρ ˙ ( t ; λ ) = 0 , ∀ t ≥ γ λ , ρ ˙ ( 0 + ; λ ) = λ (14)

The MCP achieves κ ( ρ ; λ ) = 1 / γ . A higher value of regularization parameter 𝛾 ensures reduction in unbiasedness and increase in concavity. According to [

β ^ ( MCP ) = arg min β ∈ ℝ p { 1 2 ‖ y − X β ‖ 2 2 + P γ ( β ; λ ) } (15)

where P γ ( β ; λ ) = ∑ j = 1 p p γ ( β j ; λ ) . Estimation of the coefficients using MCP depends on the selection of the parameters γ and λ, obtained through cross-validation. For each penalty value of λ > 0, MCP offers a continuum of penalties starting with the ℓ_{1}-norm at γ → ∞ and the ℓ_{0}-norm as γ → 0+ [

The convexity of penalties guarantees that the coordinate descent converges to a unique global minimum and β ^ is continuous with respect to λ. Convexity ensures good starting values, which in turn, reduces the number of iterations. However, when convexity fails to exist, then β ^ may not necessarily be continuous. In other words, a slight variation in the data may significantly change the coefficient estimate. The estimates obtained using non-convex penalty generally have a large variance. Although the unbiasedness and variable selection preclude convex penalties, the MCP provides the sparse convexity to the broadest extent by minimizing the maximum concavity.

For the high-dimensional grapevine dataset, global convexity is neither possible nor relevant. However, the objective function of the grapevine dataset is convex in the local region. The parsimonious solutions of this objective function have smooth coefficient paths with stable coefficients. The tuning parameter gamma ( γ = 3 ) for the MCP controls how fast the penalization rate goes to zero.

The coordinate descent algorithm (penalized likelihood) methods fail to conform to the concurrent expectations of computational expediency, statistical accuracy, and algorithmic stability in the extremely high dimensional dataset. In order to overcome this constraint, Fan & Lv [

According to Saldana & Feng [

Let X be a matrix with dimension n ´ p and ω = ( ω 1 , ⋯ , ω p ) T be a p-vector of marginal correlations of predictors with the response variable, acquired by component-wise regression, such that

ω = X T y (16)

We standardize the matrix X column-wise and rescale vector ω by the standard deviation of the response.

Let us sort p component-wise magnitudes of the vector ω in a decreasing order and define a sub-model M γ for any γ ∈ ( 0 , 1 ) ,

M γ = { 1 ≤ i ≤ p : | ω i | isamongthefirst [ γ n ] biggestofall } , (17)

where [ γ n ] signifies the integer part of γ n . In this way, we can reduce the dimension of the full model { 1 , ⋯ , p } to a sub-model M γ with size d = [ γ n ] < n . Hu & Lin [

Data collection technology has been recently developed to measure observations densely sampled over wavelength, space, time, and other continua [

When we consider a linear regression, with the response variable y and the predictor x_{ij} being a scalar, then the model takes the form:

y i = ∑ j = 0 p β j x i j + ε i , i = 1 , 2 , ⋯ , n (18)

This linear model fails to capture the smoothness of the X predictor variables with respect to the wavelength. However, if we replace at least one vector of predictor variable observations x i = ( x i 1 , ⋯ , x i p ) in the linear equation by a functional predictor x_{i}(t), we obtain a model consisting of an intercept term along with a single functional predictor variable [

Let t 1 , ⋯ , t q be a set of times, then we can discretize each of the n functional predictors x_{i}(t) on this set. Now fit the model:

y i = α 0 + ∑ j = 0 q x i ( t j ) β j + ε i (19)

If the refinement of the selected time is continued, then the summation will approach an integral equation, and we will get a functional linear regression model for the scalar response:

y i = α 0 + ∫ x i ( t ) β ( t ) d t + ε i , i = 1 , ⋯ , n ; y i ~ N ( μ , σ 2 ) (20)

where the functional regression tries to establish a relationship between a scalar outcome y_{i} and random functions x_{i}(t) [

Here the constant α 0 is the intercept term that adjusts for the origin of the response variable. The parameter β is in the infinitely dimensional space of ℓ_{2} functions (the Hilbert space of all square integral functions over a particular interval) [

When a function belongs to ℓ_{2} space, it can be represented by a basis of known functions { ϕ k } k ∈ ℕ [

x ( t ) = ∑ k ∈ ℕ c k ϕ k ( t ) ≈ ∑ k = 1 K c k ϕ k ( t ) = c T Φ ( t ) (21)

The smoothing (or hat) matrix H is symmetric and square.

H = Φ ( Φ T Φ ) − 1 Φ T , (22)

The degrees of freedom (DF) for functional fit is given by

DF = trace ( H ) = K , (23)

moreover, the associated degrees of freedom for error is n - DF.

In spline smoothing, the mean squared error (MSE) is a method of assessing the quality of the estimate. We can reduce the MSE by foregoing some bias, which will lower sampling variance thereby smoothing the estimated curve. Since the estimates are expected to vary slightly from one wavelength to another, the process is akin to appropriating information from neighboring data. This expresses our confidence in the consistency of the function x. The sharing of information increases the bias but improves the stability of the estimated curve [

The functional approach of smoothing data performs well only when the number K of basis functions is significantly small as compared to the number of observations. Higher values of K will tend to overfit or undersmooth the data [

A proper choice of selection methods, applied under appropriate conditions, helps to build consistent parsimonious models and estimate coefficients simultaneously for better prediction accuracy.

Here, we compare four regularized and one functional regression methods: elastic net, multi-step adaptive elastic net, minimax concave penalty, sure independence screening, and functional data analysis based on their predictive performance. To select the significant variables using regularized path elastic-net, multi-step adaptive elastic net, minimax concave penalty, and sure independence screening, R packages glmnet, msaenet, ncvreg, and SIS, respectively, have been used. Thereafter the predictive performance has been enhanced using stepwise regression. Functional regression is performed using R package “fda.usc”.

_{1} penalty of the elastic net, reduces the values of adjusted and predicted R-squared, thereby contradicting Xiao & Xu [

grapevine data. Selection of lambda is obtained using 10-fold cross-validation, based on the mean squared error criterion.

Since these grapevine datasets are high dimensional with multicollinearity, statistical inference is possible only by dimensionality reduction through sparse representation. A parsimonious model of significant predictors is selected by reducing the coefficients of unimportant predictors to zero, which improves the estimation accuracy and enhances the model interpretability. The first four models deal with linear regression, while the fifth one uses functional approach.

Elastic Net averages the highly correlated wavelengths, before incorporating the averaged wavelength into the model. The predictive ability of elastic net for this high dimensional grapevine dataset with high multicollinearity is the best among all the methods discussed above [

It is worth mentioning that the elastic net is also practically desirable because it provides interpretable output in the form of the solution path plot, which helps to visualize the variable selection.

As an example, analysis of the grapevine dataset reveals that the following wavelengths are important for predicting the Nitrogen nutrient level: 334.3, 347.8, 571.3, 684.6, 756.9, 1434.4, 1826.4, 1858.4, 1872.6, 1893.8, 1903.8, 1906.6, 1912.2, 1928.9, 1934.5, 1942.8, 1956.6, 1962.1, 1994.8, 2355.2, 2371, 2386.7, 2393.3, 2419.7, 2426.2, 2430.5, 2439.1, 2483.6 nm. These values are in accordance to physiological expectations, although future studies could explore specific

absorption features in detail. For instance, Elvidge & Chen [

Application of a data-driven weighted approach to the ℓ_{1}-penalty for varying amounts of shrinkage at different variables reduces the bias and variance inflation factor. However, in the bargain, the coefficient of a large number of significant variables is reduced to zero, resulting in the poor predictive ability for the multi-step adaptive elastic net.

Convergence and estimation of the coefficient using MCP depend upon the tuning parameters gamma (γ) and lambda (λ). Hence, the choice of tuning parameter fails to capture all of the significant predictors of the highly correlated grapevine dataset, resulting in poor predictive ability.

Sure Independence Screening, based on component-wise regression or equivalently correlation learning, is computationally attractive because this approach does not require matrix inversion. However, SIS is known to select unimportant predictors, which are highly correlated with the important predictors, instead of significant predictors, which are weakly related to the response. Hence, the predictive performance of SIS is adversely affected in the presence of multicollinearity.

The functional approach of smoothing data performs well only when the number K of basis functions is significantly small as compared to the number of observations. The presence of multicollinearity fails to reduce the number of basis functions significantly, based on minimum MSE, which negatively affects the predictive ability of grapevine dataset based on functional data analysis.

There has been a continuous endeavor to enhance the predictive ability of the high dimensional data by refining the coefficient estimates. These modern variable selection techniques generate a sparse model, based on the assumption that the predictor variables are independent. These models yield extremely good predictive accuracy when the assumption of independence is satisfied. However, for a dataset like these hyperspectral reflectance grapevine data, which is highly correlated with a large number of predictors (wavelengths), clustered together, only Elastic Net has the ability to select the groups of correlated variables. This outcome is especially critical to the burgeoning field of precision agriculture, which is making increasing use of such hyperspectral imaging datasets but cannot reach a large sample size through traditional field work (time and monetary constraints). These data are also highly correlated (~97%). In all such cases, the predictive ability of Elastic Net is likely to outperform the other modern variable selection techniques.

We are grateful to Dr. Justine Vanden Heuvel (Cornell University) for her expertise in vineyard physiology, as well as the field teams from Rochester Institute of Technology and Cornell University for their help in collecting field data.

Jha, U.K., Bajorski, P., Fokoue, E., Heuvel, J.V., van Aardt, J. and Anderson, G. (2017) Dimensionality Reduction of High-Dimensional Highly Correlated Multivariate Grapevine Dataset. Open Journal of Statistics, 7, 702-717. https://doi.org/10.4236/ojs.2017.74049