Estimation Procedure for Reduced Rank Regression , PLSSVD

This paper presents a procedure for coefficient estimation in a multivariate regression model of reduced rank in the presence of multicollinearity. The procedure permits the prediction of the dependent variables taking advantage of both Partial Least Squares (PLS) and Singular Value Decomposition (SVD) methods, which is denoted by PLSSVD. Global variability indices and prediction error sums are used to compare the results obtained by classical regression with reduced rank (OLSSVD) and the PLSSVD procedure when applied to examples with different grades of multicollinearity (severe, moderate and low). In addition, simulations to compare the methods were performed with different sample sizes under four scenarios. The new PLSSVD method is shown to be more effective when the multicollinearity is severe and especially for small sample sizes.


Introduction
A new procedure for regression coefficient estimation under reduced rank conditions is developed using partial least squares (PLS) and compared with classical methods which use ordinary least squares (OLS).Hoskuldsson [11] asks the question, "What situations are typical of those where PLS methods can be expected to be good for modelling purposes?"and gives the reply "They are the ones where there are many variables but not necessarily many samples or observations.This is a common situation in many laboratories.Typically it may take a lot of time to get a new sample, but each sample may give a large amount of information (variables)."He also replies to another important question, "Why can one expect PLS regression methods to perform better than multiple linear regression, ridge regression and other well known regression techniques?".The answer involves the increased stability of predictors derived from PLS methods.It turns out that the essential criteria for the predictability of models is to keep the number of variables as low as possible.In PLS, components are selected that give 'maximal' reduction in the covariance X T Y of the data.In that sense PLS will give the minimum number of necessary variables.
Xu et al [30] also note that principal component regression (PCR) is similar to PLS in constructing new variables (components): each new component is a linear combination of x 1 , x 2 , ..., x m , where x i is the ith column of data matrix X.Both of them shrink the estimated coefficient vector by leaving some of the components out of the model.The major difference between PCR and PLS is that PCR's components are determined only by X, whereas

The Basic Model
Multivariate regression with reduced Rank (RRR), a special case of the classic multivariate regression model derived from the works of Anderson [4], Izenman [12] and Davies and Tso [5], is based on the model: in which Y is a (n × q) matrix, X is a (n × p) matrix, Θ is a (p × q) matrix of parameters or regression coefficients and E is the (n × q) matrix of errors or random perturbations.Davies and Tso (1982) Davies and Tso (1982) [5] base their procedure on the reduced rank regression model Y = XΘ + E assuming r = rank (Θ ), with r ≤ p, The essential steps of their procedure are:

Reduced Rank Regression Procedure of
1. Form data matrices X (n × p) and Y (n × q) containing values for regressor and response variable on n samples.Scale each response variable and subtract sample means from each variable.2. Determine the unconstrained least-squares estimate Θ by Θ = X + Y where X + is the Moore-Penrose generalized inverse of X (see, for example, Graybill, 1969 [8], Theorem 7.4.1 or Davies and Tso, 1982, p.245 [5]).When X has full column rank this is equivalent to use the standard formula where U is a matrix of order n × q whose columns are the eigenvectors associated with the nonzero eigenvalues of the matrix Ŷ Ŷ ′ ; D is a diagonal matrix of order q which contains the square roots of the nonzero eigenvalues of Ŷ Ŷ ′ or Ŷ ′ Ŷ and V is an orthogonal matrix of order q whose columns are the eigenvectors of Ŷ ′ Ŷ .The matrices U and V satisfy the properties U ′ U = I and V ′ V = I, and τ is rank of ( Ŷ )).Evaluate the matrix P s = ∑ s i=1 λ i v i v ′ i (P s is projection matrix de rank s, with s < τ ) 5. Evaluate the reduced-rank fitted values Y * s = Ŷs = X ΘP s 6. Compute the reduced rank matrix of coefficients given by Θ * = ΘP s This procedure by Davies and Tso has been the object of much subsequent research.For example, Anderson [3] compared the asymptotic distribution of the coefficient matrix with complete rank with that of reduced rank.The estimation model with reduced rank was shown to be superior in general.The same author in 2002 studied the influence of the rank in both models finding that the model with reduced rank is most efficient when the rank is small.In the next section a different approach based on Ter Braak and Looman (1994) [21] which uses a biplot of reduced rank regression is presented.The need to factor the regression coefficient matrix using the techniques of Gabriel (1971) [7] obliged them to develop reduced rank regression using OLS and SVD.

The Reduced Rank Regression Procedure of Ter Braak and Looman (OLSSVD)
Just as Davies and Tso did in 1982 [5], Ter Braak and Looman in 1994 [21] based their reduced rank regression model on the classical linear regression model Y = XΘ + E, under the assumption that the matrix Θ of order (p × q) and rank r does not have complete column rank, that is, r < q.It is thus possible to factor: where A and B are matrices of order (p × r) and (q × r) respectively.Inserting (2) in (1) we have: This model can be written as in factorial analysis (see Mardia et.al, 1979 [15]) as: where G = XA is a matrix of order (n × r) that contains the scores of the factors and B the matrix that contains the factorial weights (Ter Braak and Looman, 1994 [21]).
As in Davies and Tso [5], we have assumed errors in the matrix E to be uncorrelated for different samples (i.e.across rows), but correlation between the errors in responses measured on the same sample may need to be considered.If the errors are distributed normally with unknown covariance matrix, it has been shown that canonical correlation analysis is applicable (Tso, 1981 [23]).Given an independent estimate covariance matrix Σ, then maximum-likelihood estimation of Θ requires only a slight modification of the given least-squares procedure.
Let Σ have eigenanalysis , where Γ = V D −1 and D is understood to have positive elements along its main diagonal.Ter Braak and Looman (1994) [21] take this into consideration and replace the criteria for least squares by: ( The consequence of this observation is the following.If in (5), Γ is a conveniently chosen symmetric positive definite matrix, the principal component analysis (PCA) and canonical correlation analysis (CCA) can be derived from RRR as shown by Izenman [12].The estimation in regression with reduced rank takes into account the observation when (3) is introduced in (5) to obtain the orthogonal sum: where Ŷ is the matrix of values adjusted by separate multiple regressions (equivalent to an adjustment by multivariate regression for each of the p independent variables) and Γ a symmetric positive definite matrix selected conveniently according to the context.Considering (4), we obtain: According to the decomposition (6), in order to do the estimation in reduced rank regression it is necessary to find the minimum of ∥(Y − XAB ′ )Γ ∥ 2 .The solution by least squares is thus obtained by minimizing the right hand term of (7) using singular value decomposition (SVD), (see Eckart and Young [6]).In other words: where U and V are orthogonal matrices of order (n × t) and (q × t) respectively, where t = min(n, q), that contain the singular values of Ŷ Γ ( Ŷ Γ ) ′ and ( Ŷ Γ ) ′ ( Ŷ Γ ) respectively, and D is a diagonal matrix of order q formed by the eigenvalues of these matrices in non-increasing order (λ 1 ≥ λ 2 ≥ ... ≥ λ t ≥ 0).Remember also that Γ is a positive definite matrix.
The minimum of (7), when the rank of Y is r < q, is the sum of the eigenvalues obtained by singular value decomposition: where the matrices U r and [V D] r contain only the first r columns of the matrices U and V D of (8), respectively.By (4), G = XA and by ( 9), Ĝ = U r , so that it is possible to estimate A by ordinary least squares using the regression Ĝ = XA which gives: The matrix of regression coefficients in the model ( 3) is thus: Observe that the matrix of regression coefficients is thus obtained by a procedure that combines ordinary least squares estimation and singular value decomposition.The steps of this OLSSVD procedure for the RRR model Y = XΘ + E can be summarized as follows, Ter Braak and Looman (1994) [21]: 1. Centralize or standardize the matrices X (n × p) and Y (n × q). 2. Express the multivariate regression model of ( 1) as: This algorithm is very important since by factorizing the matrix of regression coefficients with reduced rank as ΘOLS−SV D = ÂOLS B′ SV D it is possible to interpret graphically by means of a biplot following the ideas of Ter Braak and Looman [21] (see for example Alvarez and Cardenas [1]).Ter Braak and Looman also note that in this algorithm, if Γ = I, the solutions can be obtained by a principal components analysis of the values Ŷ .Reduced-rank regression inherits from multiple regression the property that it is invariant for linear rescaling of the regressors, but from principal components analysis we know that it is not invariant for rescaling of the response variables (Jollife [14]).There are three methods to overcame this latter problem: Method (1) leads to redundancy analysis (Van den Wollenberg, 1977 [25]), whereas method (2) leads to canonical correlation analysis as was first noted by Velu et.al. (1986) [27] .Method (3) was proposed by Ter Braak (1990) [22] as an intermediate between redundancy analysis and canonical correlation.

Estimation in Multivariate Reduced Rank Regression using Partial Least Squares and Singular Value Decomposition
It was shown above that the coefficient matrix in reduced rank regression is given by ΘOLS−SV D = ÂOLS B′ SV D .ÂOLS and BSV D can be expanded using ( 9) and ( 10) to obtain When multicollinearity exists, if it is perfect then it is mathematically impossible to estimate Θ since (X ′ X) −1 does not exist.If the multicollinearity is not perfect but strong, the estimations obtained will have high standard errors which result in values far from reality.Remember that, in the presence of multicollinearity, the variance, the covariance and the absolute values of the regression coefficients are very large.

PLS Regression
Partial Least Squares (PLS) methods were introduced by Wold H. [28] and PLS regression by Wold S., Martens H. and Wold H. (1983) [29].Since then many algorithms and applications have been proposed, chiefly in the fields of chemometrics, medicine, biology, petroleum, agriculture and continuous industrial processes.PLS regression (PLSR) is a technique that generalizes and combines the characteristics of principal component analysis and multivariate regression.It is particularly useful when it is necessary to predict a set of dependent variables in terms of a set of independent variables.
As noted previously, the classical multivariate regression procedures have serious stability problems in the presence of severe or strong multicollinearity.PLS regression is a good alternative in such circumstances since it uses basic interactive concepts to adjust bilinear models in several blocks of variables giving a solution to multivariate statistics in the presence of many collinear variables in the data and a modest number of observations.The stability problems in identification and convergence as described by Martens et al in [16] can be thus avoided.

The PLS Regression Model
Many estimation algorithms have been developed for this type of regression as noted for example in Wold et al [29].One of the most simple algorithms is that of Hoskuldsson [11] described in high level as follows in terms of the weights of the independent variables w h , the weights of the dependent variables c h , the score vectors of the independent variables t h , the score vectors of the dependent variables u h and also the weight vectors p h and c h .These vectors resulting from the algorithm give the PLS estimation.We base the calculations on the modified PLSR algorithm described in Tenenhaus (1998) [20] and emphasize that our objective is centered on calculating the matrix of PLS regression coefficients.Before showing the modified algorithm, we show the NIPALS (Nonlinear iterative Partial Least Squares) algorithm which is the basis for PLSR.It begins by standardizing the sets X and Y .

Stage 1: X
In Tenenhaus (1998) [20], it is shown that the weight vectors, w h and c h , and the latent variables t h and u h are obtained as solutions of the following equations associated with the highest common eigenvalue λ h The PLS regression principle can be summarized as obtaining the components The calculation of Y h residues is unnecessary for the determination of PLS components.Rather, these residues are used for choosing the number of components.Let us therefore look at the modified version of the PLS regression algorithm leading to the same vector w h , t h , c h and p h as the classical algorithm.The components The modified PLS algorithms is as follows: 1. Stage 1: The latent variables t h and u h are obtained as solutions of the following equations associated with the greatest common eigenvalue λ h The solution of each stage permits the construction of the following matrices

., wh ]
To obtain the regression coefficient matrix and the regression equation we use the following mathematical expression relating T h and Wh :

To express the regression of Y on the PLS components
Finally, we can write the equation for the regression coefficient matrix Θ P LS as The matrix of regression coefficients of ( 15) is determined from the components t h which are mutually orthogonal thus assuring that the multicollinearity does not effect this matrix of parameters contrasting with the estimation of regression coefficients using ordinary least squares.The idea to take advantage of this fact will now be presented.

Reduced Rank Regression Procedure PLSSVD
As a contribution to reduced rank regression in the presence of multicollinearity the estimation procedure denoted by PLSSVD is now presented.It is an integration of the PLS method and the procedure of Ter Braak and Looman (1994) [21].The integrated procedure assures that the estimation of the regression coefficients are affected as little as possible by problems of multicollinearity.The procedure consists of the following steps: 1. Begin with the multivariate regression model (with the variables of the two sets X and Y centered or standardized): 2. Factor the coefficient matrix as Θ = AB ′ so that the multivariate reduced rank regression model becomes As in the equation (3), it is then possible to express the model as Y = GB ′ + E, where G = XA.

Minimize ∥(Y − XAB
using as an initial estimation for Y that obtained by SVD as was done in the equation (8) obtaining where U and V are orthogonal matrices of order (n × t) and (q × t) with t = min(n, q), that contain the eigenvectors of ( Ŷ Γ )( Ŷ Γ ) ′ and ( Ŷ Γ ) ′ ( Ŷ Γ ) respectively; D is a diagonal matrix of order q containing the eigenvalues of these matrices in non-increasing order (λ 1 ≥ λ 2 ≥ ... ≥ λ t ≥ 0); and Γ is a positive definite matrix.4. Since Ĝ = U = XA , it is possible to make a comparison with the regression model with U representing the matrix of dependent variables, X the matrix of independent variables and A the matrix of regression coefficients.Suppose that the matrix of independent variables presents high multicollinearity so that it is not convenient to do the estimation using the OLS model and that the application of the PLS estimation for A is proposed.According to the described PLS method, the estimation given in the equation ( 15) is 5. The matrix of regression coefficients is obtained by combining the PLS estimation of A with the estimation of B given by SVD so that the reduced rank regression coefficient matrix given by PLSSVD is Finally, the estimations of Y are given by Note that the calculation of (X ′ X) −1 is not present in the PLSSVD procedure so that multicollinearity does not affect the estimation of the coefficient matrix.

Comparison of the Procedures
In this section, comparisons are made between the reduced rank regression estimation procedure of Ter Braak and Looman [21], called hereafter OLSSVD, and the estimation procedure proposed in this article called PLSSVD.
The comparison is done using data already studied by other authors.Following the classification by condition index given in Montgomery and Peck [17], three scenarios are presented according to the grade of multicollinearity found in the data: severe multicollinearity, moderate to strong multicollinearity and light multicollinearity (see Table 1).All the analysis is performed using standardized data and weight matrices Γ = I and Γ = R 1/2 yy .
Table 1.Condition Indices for the three data sets Data Set Condition Index Grade of Multicollinearity Hervé Abdi (2003) [10] 5.2298 * 10 16 Severe Alvin Rencher (2002) [19] 224.1492Moderate to Strong Valencia et. al.(2004) [24] 59.5997 Light The estimations are evaluated using indices which measure the global variability of the residual matrix as explained by Peña (2002,p.77)[18].The first index is the trace of the covariance matrix of the residual matrix (T ); however, realizing that this measure does not take into account the dependence structure among the variables, a second index called generalized variance, (GV), defined as the determinant of the covariance matrix of the residuals is also used.A third index, called generalized standard deviation (GSD), is obtained by taking the square root of the generalized variance.A fourth and last index called the effective variance (EV) is defined as the generalized variance elevated to the 1/q.The mathematical expressions of each of these index measures are found in Table 2. EV = [det(S)] 1/q S = cov( Ê); Ê=matrix of residuals generated by the estimation procedure

Data with Severe Multicollinearity
The first set of variables, taken from Hervé Abdi (2003, p.3 [10]), consists of measurements of five types of wine considering price (x 1 ), quantity of sugar (x 2 ), alcohol (x 3 ) and acidity (x 4 ) as independent variables and classification by experts indicating the pleasure received in drinking them and categorizing them in 3 groups (y 1 :Pleasure, y 2 : Meat and y 3 : Dessert) according to their use in parties or to accompany meat or desserts.Table 1 shows that this data presents severe multicollinearity.

Data with Moderate to Strong Multicollinearity
The second set of data is given by 19 observations of an experiment involving a chemical reaction (Box and Youle in Rencher Alvin, 2002 p. 340 [19]).As shown in Table 1, this data presents moderate to strong multicollinearity.The variables are identified as follows: Percentage of the initial material preserved or unchanged (x 1 ), Percentage converted to the desired product (x 2 ), Percentage of flawed product (x 3 ), Temperature (y 1 ), Concentration (y 2 ), Time (y 3 ).

Data with light Multicollinearity
The third and last set of data (Table 1) taken from Valencia et. al. (2004, p.73 [24]), consists of three dependent variables (y 1 , y 2 , y 3 ), three independent variables (x 1 , x 2 , x 3 ) and seven observations.The condition index of this data, (see Table 1), shows that the independent variables do not present serious problems of multicollinearity so that the data set is said to have light multicollinearity.

Comparison of the PLSSVD and OLSSVD procedures using the weight matrices
Table 3 shows the correlations of the independent variables of the three sets of data.In the data with severe multicollinearity, the variable price (x 1 ) has a strong negative correlation with variables alcohol and acidity (x 3 and x 4 ) and a positive correlation of 0.32 with the quantity of sugar (x 2 ).The quantity of sugar has no relation with the other variables.Regarding the data with moderate multicollinearity, x 1 and x 3 have the highest correlation among the independent variables followed by a moderate correlation between x 1 and x 2 and lastly x 2 and x 3 present a very low correlation.The data with light multicollinearity presents independent variables with high correlations.

M.= Multicollinearity
Table 4 gives the global variability measures already described for the matrix of residuals obtained by each of the procedures and for each of the two weight matrices.These measures show that both procedures give exactly the same results for light and moderate to strong multicollinearity whereas for severe multicollinearity the PLSSVD procedure provides residuals with much less global variability than the OLSSVD procedure for both weight matrices.For both procedures the variability indices are greater when Γ = R −1/2 yy than when Γ = I.

Comparison of the predictive power of the two procedures for the given data
Another important aspect to consider is the predictive power of the two procedures.In order to compare the predictive power, the sum of squares of the prediction errors (PRESS) was calculated for each dependent variable and as the variables were standardized, the sum of the Press (Total Press) was also calculated to provide a single measure for the global predictive power of each procedure for each data set following the ideas of Valencia et.al. (2004).These indices were calculated for each of the three situations or data sets.The PLSSVD method has higher predictive power than the OLSSVD method when the data presents severe multicollinearity as evidenced by the fact that for Γ = I, Total PRESS PLSSVD is 82.59 against Total PRESS OLSSVD equal to 253.18.For the other two grades of multicollinearity the OLSSVD procedure performs better.Similar results are seen for Γ = R −1/2 yy .Furthermore, the OLSSVD procedure in some cases is not able to make estimates due to the multicollinearity and although OLSSVD improves somewhat for large samples, it is not stable.In the second scenario, (Figure 1(b)), the MSE for PLSSVD is again observed to be lower than those of OLSSVD for sample sizes less than 100 whereas the behavior of the two procedures is similar for larger sample sizes.For the third and fourth scenarios (Figures 1(c) and 1(d)), similar behavior including the instability of OLSSVD is again observed.

Conclusions
The comparisons of the PLSSVD and OLSSVD procedures show that in the presence of severe multicollinearity of the independent variables, the PLSSVD estimation procedure results in lower residual values than the OLSSVD procedure however when the multicollinearity of the independent variables decreases, the opposite is true.For both Γ = I and Γ = R −1/2 yy and for both the real data and simulated data, the PLSSVD has better predictive power than OLSSVD when the data presents severe multicollinearity.Even though in some cases the OLSSVD procedure presents good estimations, it is not consistently stable in the presence of severe multicollinearity.

Figure 1 .
Figure 1.Simulations of severe collinearity The mean square error (MSE) was used to compare the OLSSVD and PLSSVD procedures for simulated data with severe multicollinearity.Multiple samples of different sizes, from 5 to 500, were compared in four different scenarios.The first using Γ = I and standardized data (Figure 1(a)); the second using Γ = I and centralized data (Figure 1(b)); the third using Γ = R −1/2 yy and standardized data (Figure 1(c)) and the fourth using Γ = R −1/2 yyand centralized data (Figure1(d)).The results for each of the simulations are shown in the following figures.In the first simulation (Figure1(a)), the PLSSVD procedure is seen to have lower MSE than the OLSSVD procedure.Furthermore, the OLSSVD procedure in some cases is not able to make estimates due to the multicollinearity and
Stage 2: for h = 1, 2, ..., a : 3. Stage 2.1: u h = the first column of Y h−1 4. Stage 2.2: Repeat until convergence of w h : 5. Stage 2.2.1: w h = X ′ h−1 u h /(u ′ h u h ). 6. Stage 2.2.2:Scale w h to be of length one.7. Stage 2.2.3: t h = X h−1 w h /(w ′ h w h ). 8. Stage 2.2.4: the h-th step beginning with the first stage.One of the interesting properties of the previous PLSSVD algorithm is that the components t h are orthogonal which implies that the matrices X h and Y h are the respective residuals of the regressions of X and Y on the components t 1 , t 2 , ..., t h .The following modifications give origin to the PLS regression algorithm: Stage 2: for h = 1, 2, ..., a : 3. Stage 2.1: ũh = the first column of Y 4. Stage 2.2: until convergence of w h : 5. Stage 2.2.1: w h = X ′ h−1 ũh /(ũ ′ h ũh ). 6. Stage 2.2.2:Scale w h to be of length one.7. Stage 2.2.3: t h = X h−1 w h /(w ′ h w h ). 8. Stage 2.2.4: H as C H = [c 1 , c 2 , ..., c H ] and the regression equation becomes Y = T H C ′ H + Y H which in turn can be expressed as t 1 , .., t H on variables X Y = t 1 c ′ 1 + t 2 c ′ 2 + ... + t H c ′ H + Y H (13) Stat., Optim.Inf.Comput.Vol. 4, June 2016 WILLIN ÁLVAREZ AND V. GRIFFIN 113 we can define C

Table 2 .
Global Variability Measures for the covariance matrix of the residuals

Table 3 .
Correlation of the Independent variables of the three data sets

Table 4 .
Global Variability Indices of the residuals for the three data sets