Matrix Factorizations Based on Induced Norms

We decompose a matrix Y into a sum of bilinear terms in a stepwise manner, by considering Y as a mapping from a finite dimensional Banach space into another finite dimensional Banach space. We provide transition formulas, and represent them in a duality diagram, thus generalizing the well known duality diagram in the french school of data analysis. As an application, we introduce a family of Euclidean multidimensional scaling models.


Introduction
Matrix factorization, named also decomposition, in data analysis is at the core of factor analysis; and one of its principal aims, as clearly stated by Hubert et al. (2000), is to visualize geometrically the statistical association existing among the rows or the columns of the matrix. So the way that we factorize a matrix is of fundamental interest and concern in statistics. What is surprising is that the oldest method, the centroid factorization, see Burt (1917) and Thurstone (1931), has been rediscovered recently many times, see for instance proposal 1 in McCoy and Tropp (2011). Singular value decomposition (SVD) is the most used matrix decomposition method in statistics; the aim of this paper is to present in a coherent way the theory of SVDlike matrix factorizations based on subordinate or induced norms; and at the same time, review the existing literature. This presentation generalizes the SVD by embedding it in a larger family: It belongs to the class of optimal biconjugate decompositions; biconjugate decompositions are based on Wedderburn rank-one reduction theorem as described by Chu et al. (1995). Other alternative generalization of SVD, GSVD, is presented by Hubert et al. (2000), and which forms the basis of the french school of data analysis as reviewed recently by Holmes (2008) and De La Cruz and Holmes (2011). We also incorporate the GSVD in our representation. This paper is organized as follows: Section 2 presents the preliminaries concerning induced or subordinate matrix norms; section 3 presents the matrix factorizations based on induced norms, and we conclude in section 4.
The proof of (N4) is based on Hölder and Minkowski inequalities. We define the unit sphere to be S n p = {x ∈ R n : ||x|| p = 1}, and (p, p 1 ) designate the conjugate pair, that is, 1 p + 1 p 1 = 1 for p ≥ 1 and p 1 ≥ 1.
Hölder inequality: x ′ is the transpose of the row vector x; further, < x * , x > represents a scalar product only when the conjugate pair (p, p 1 ) = (2, 2). The next result is an application of Hölder inequality.
Remark: In more general settings, Lemma 1 is proven as a corollary to the famous Hahn-Banach theorem, see for instance Kreyszig (1978, p.223).
Let B(l n r , l m p ) be the set of bounded linear maps (operators) from l n r to l m p , which we identify with the set of m × n real matrices in the usual way. If A ∈ B(l n r , l m p ), then A ′ ∈ B(l m p 1 , l n r 1 ). Let A ∈ B(l n r , l m p ) be an operator, then its induced or subordinate norm is defined to be Then and the next theorem is a well known central result.
Theorem 1: and and the last two equations are known as transition formulas.
Proof: For u ∈ S n r and v ∈ S m p 1 , we consider the bilinear form Now using (5 and 6) and replacing A by A ′ we have which is the required result.
The transition formulas (3 and 4) can be represented by the following duality diagram a) The geometrical-statistical interpretation of Theorem 1 is that λ 1 is the largest dispersion value by which the operator A stretches an element of u ∈ S n r ; u 1 is called the first principal axis of the rows of A, and a 1 represents the projected values of the rows of A on u 1 , and we name it the first projected row factor or the first principal component. And by duality, we also have λ 1 is the largest dispersion value by which the operator A ′ stretches an element of v ∈ S m p 1 ; v 1 is called the first principal axis of the columns of A, and b 1 is the first column projected factor which represents the projected values of the columns of A on v 1 . Essentially, we are computing the quintiplet (a 1 , b 1 , u 1 , v 1 , λ 1 ).
b) The vectors a 1 , b 1 , u 1 and v 1 belong to four different spaces: a 1 ∈ l m p , b 1 ∈ l n r 1 , u 1 ∈ S n r ⊂ l n r and v 1 ∈ S m p 1 ⊂ l m p 1 . c) The transition formulas provide us an iterative algorithm to compute a maximum of {||Au|| p : u ∈ S n r }; this maximum value can be a relative maximum. The norm ||A|| r→p corresponds to the absolute maximum. The algorithm is named the power method for l p norm by Boyd (1974); Wold's (1966) NIPALS (nonlinear iterative partial alternating least squares) algorithm, named also criss-cross regression by Gabriel and Zamir (1979), is a particular case. The algorithm can be summarized in the following way, where b is a starting value: Step 1: u =ϕ(b), a = Au and λ(a) = ||a|| p ; Step 2: v =ϕ(a), b = A ′ v and λ(b) = ||b|| r 1 ; Step 3: If λ(b)−λ(a) >0, go to Step 1; otherwise, stop.
The proof of the convergence of the algorithm is based on application of Hölder inequality twice: Let u (k) , v (k) and λ (k) for k ≥ 1 represent the kth iteration values, then: The rows or the columns of A can be used as starting values for a or b.

Particular norms
Let A ∈ B(l n r , l m p ), then A ′ ∈ B(l m p 1 , l n r 1 ). In general the conjugate pairs (r, r 1 ) and (p 1 , p) are not equal, which implies that the geometry of the rows is different from the geometry of the columns. If (r, r 1 ) = (p 1 , p) = (r, p), then the geometric structure defined on the rows of A is identical to the geometric structure defined on the columns of A; this class was named transposition invariant by Choulakian (2005).
For two particular values of the conjugate pairs (r, r 1 ), explicit formulas are available; however, the spectral norm is the most well known, which is transposition invariant. a) (r, r 1 ) = (∞, 1), then The proof is based on Hölder inequality: For any v ∈ S m p 1 consider The proof is similar to the proof in a). For any v ∈ S m p 1 consider where λ max is the greatest eigenvalue of AA ′ or A ′ A, and it is named spectral norm. Drakakis and Pearlmutter (2009) and Lewis (2010) discuss the following nine cases of ||A|| r→p for r, p = 1, 2, 3, which can be easily deduced from the above results.

Matrix factorizations
Let X ∈ B(l n r , l m p ). Let (a 1 , b 1 ) be the first projected factors associated with λ 1 . We repeat the above procedure on the residual dataset Equations (7,8,9) are known as Wedderburn's rank one reduction formula, see Chu, Funderlic and Golub (1995). By repeating the above procedure we get the data reconstitution formula for the matrix X as a function of the projected row and column factor coordinates (a α , b α ) associated with the dispersion values λ α , for α = 1, ..., k, and k = rank(X), or elementwise Equation (10) represents the decomposition of X based on l n r → l m p induced norm.

The case of X symmetric
When the matrix X is symmetric, we can have a symmetric decomposition or a nonsymmetric factorization. a) If the norms are transposition invariant, that is, the conjugate pairs (r, r 1 ) = (p 1 , p) = (r, p), then for the geometric structure defined on the rows of X is identical to the geometric structure defined on the columns of X. b) If the norms are not transposition invariant, that is, the conjugate pairs (r, r 1 ) = (p 1 , p), then for the geometric structure defined on the rows of X is different from the geometric structure defined on the columns of X.

A review
Here, we review published discussed cases in the statistical literature. a) The centroid decomposition based on ||A|| ∞→2 = ||A ′ || 2→1 ; its transition formulas are Au 1 = a 1 and v 1 = a 1 / a ′ 1 a 1 and it is the oldest to our knowledge. First used by Burt (1917), then by Thurstone (1931) to factorize covariance matrices, and used extensively in the psychometric literature before the advent of the computers, see for instance Thurstone (1947), Horst (1965) and Harman (1967). Burt-Thurstone formulation was based on the following criterion: its relationship with the matrix norm formulation was shown by Choulakian (2003). In different, but related contexts, it is discussed by Galpin and Hawkins (1987), Chu and Funderlic (2002), Choulakian (2005), Kwak (2008), McCoy and Tropp (2011). Further, Choulakian (2012) considered it as a particular MAXBET procedure which takes into account the block structure of the variables. b) ||A|| 1→1 = ||A ′ || ∞→∞ is used by Galpin and Hawkins (1987); its transition formulas are Au 1 = a 1 and v 1 = sgn(a 1 ) and A ′ v 1 = b 1 and u 1 = e α such that α = arg max j |b 1j | = arg max j ||A * j || .
c) The taxicab decomposition is based on ||A|| ∞→1 = ||A ′ || ∞→1 ; its transition formulas are Au 1 = a 1 and v 1 = sgn(a 1 ) and It is the most robust among all the transposition invariant induced norms considered in this paper, and is used extensively by Choulakian and coworkers in developing taxicab correspondence analysis: Choulakian (2004Choulakian ( , 2006Choulakian ( , 2008aChoulakian ( , 2008b, Choulakian et al. (2006Choulakian et al. ( , 2013aChoulakian et al. ( , 2013b. We also note that the taxicab decomposition of a covariance matrix is equivalent to the centroid decomposition of the centred dataset. The taxicab norm was first considered by Grothendieck, see the interesting story of the Grothhendieck theorem and its many versions by Pisier (2012). Here, we cite this remarkable result Grothendieck Inequality: Let A = (a ij ) be a real matrix of size m×n; then for i = 1, .., m and j = 1, ..., n where K d is a universal constant that depends on d for d = 2, 3, ..., but does not depend on m and n. By defining K 1 = 1, we see that K d ≥ K d−1 for d = 2, 3, .... The open problem is that there exists a universal constant K G such that (12) is true.
It is conjectured that An elementary proof of the inequality is given by Blei (1987) or Jameson (1987). A randomization algorithm to compute ||A|| ∞→1 via the Grothendieck inequality is studied by Alon and Naor (2006), and it is used by McCoy and Tropp (2011) to compute (11). Rohn (2000) shows that the computation of ||A|| ∞→1 is NP-hard.

A family of Euclidean multidimensional scaling models
Let ∆ = (δ ij ) be a symmetric n × n matrix with nonnegative elements and zeros on the diagonal, representing the dissimilarities of n objects. The aim of multidimensional scaling (MDS) techniques is to find a configuration of the n points, which best match the original dissimilarities as much as possible. We shall consider the framework of the classical MDS, named also principle coordinate analysis, where we suppose that the dissimilarities represent Euclidean distances, which implies that there exists a set of n centered points in a Euclidean space, denoted by {f i : i = 1, ..., n}, such that We thus have the following well known relationship .., f n ] and H = I n − 1 n 1 ′ n /n is the centering matrix, I n the identity matrix and 1 n the vector of ones. So the matrix Q is positive semidefinite, and it is equal to F ′ F, where F is unknown. Now suppose that F ∈ B(l n r , l m p ). Then ||F|| r→p = max{||Fu|| p : u ∈ S n r } can be expressed as a linear finction of Q if p = 2 and r ≥ 1; that is Factorizing Q into F ′ F by (18), we obtain a family of Euclidean multidimensional scaling (MDS) models as a function of r ≥ 1. Three particular cases are worthy of mention: a) For r = 2, we obtain the classical MDS, see for instance Torgerson (1952) and Gower (1966). Each f α is an eigenvector of Q; see also subsection 2.1 part c). b) For r = ∞, we get the centroid MDS, where we maximize the Burt-Thurstone criterion (11), see also subsection 2.1 part a). c) For r = 1, we get the dominant MDS; its computation is extremely simple and fast, see subsection 2.1 part b).
Example 3: We consider the Facial Expressions data found in Borg and Groenen (2005, p. 76) of dimension 13x13, where n = 13 is the number of person's facial expressions. The aim of the study is the correct identification of intended emotional message from a person's facial expression. Furthermore, Table 4.3, p. 75 in Borg and Groenen, provide Schlosberg empirical scale values that classify the facial expressions into three classes: pleasantunpleasant (PU), attention-rejection (AR) and tension-sleep (TS). Borg and Groenen (2005, subsection 4.3) found that the first two dimensions of ordinal MDS reproduced quite accurately the three classes: the first dimension representing PU and the second dimension representing AR and TS, because the correlations between the first two calculated dimensions and the Schlosberg empirical scale values are quite high for ordinal MDS. Table 1 compares the correlation values obtained by four MDS approaches; the ordinal MDS correlation values are reproduced from Borg and Groenen (2005, p.77, Table 4.6): The centroid MDS produced results as good as the ordinal MDS. 4 The French school of data analysis Benzécri (1973), who was a pure mathematician in geometry in the 1950s, is considered the father of the french school of data analysis; he developed a geometric generalized Euclidean framework for multidimensional data analysis by introducing two metric matrices (square and positive definite) M and N. In this setting, the duality diagram of Theorem 1 becomes S n 2 (N) = {x ∈ R n : x ′ Nx = 1}, and S m 2 (M) = {x ∈ R m : x ′ Mx = 1}. Note that, for N = I n , then S n 2 (I n ) = S n 2 . In this particular Euclidean setting, the duality diagram represents the following transition fomulas Au 1 = a 1 and v 1 = Ma 1 / a ′ 1 Ma 1 ) and A ′ v 1 = b 1 and u 1 = Nb 1 / b ′ 1 Nb 1 ). The solution of the last two equations can be reexpressed as a generalized eigenvalue-eigenvector problem in the following way: From the last two equations we get from which one gets and similarly NA ′ MAu 1 = λ 1 u 1 .
In the last two equations the eigenequations are functions of principal axes u 1 and v 1 . However one can reexpress them as functions of projected factor scores ANA ′ Ma 1 = λ 1 a 1 ; and similarly NA ′ MANb 1 = λ 1 b 1 .

Conclusion
We embedded the ordinary SVD into a larger family based on induced matrix norms, and provided the transition formulas and a simple criss-cross iterative procedure to compute the principal axes and principal factor scores. Given that there are infinite number of SVD like decompositions, depending on the underlying induced norms, one is tempted to ask which is the best? It is quite ironic that the centroid decomposition, the oldest method, was recently rediscovered and restudied as a robust method, see Choulakian (2005), Kwak (2008), McCoy and Tropp (2011), after being dumped almost sixty years ago for the following reason given in Hubert et al. (2000, p.76) "Comments in Guttman (1944) and elsewhere (e.g., Horst (1965) and Harman (1967)) with regard to this centroid strategy generally considered it a poor approximation to what could be generated from Hotelling's method that would choose successive unit length vectors to produce a rank reduction by identifying (through an iterative strategy) the eigenvector associated with the largest eigenvalue for each of the residual matrices successively obtained. At the time, however, the centroid method was computationally much less demanding than Hotelling's iterative (or power) method for obtaining each of the principal components (again, one-at-a-time and reducing rank at each iteration); for this reason alone, the centroid method was a very common factorization strategy until electronic computing capabilities became more widely available". This comment shows that the centroid method was a victim of the habit of using mathematical methods in statistics based on optimal criteria, as if optimality is a guarantee of efficiency.
The arguments advanced by Benzécri on the advantages of the Euclidean geometry over the taxicab geometry for multidimensional data analysis are both computational and metaphysical. On the use of L1 distance in data analysis, Benzécri (1977, page 13) commented in the following way " elle (L1) ne permet pas d'utiliser la géométrie euclidienne multidimensionnelle; elle donnera des résultats qui qualitativement ressemblerontà ceux obtenus par la distance...quadratique; mais au prix de calculs plus compliqués et sous une forme moins commode. Sans permettreà l'outil mathématique de défigurer le réel, on doit lui concéder que la transmissionà l'esprit humain d'un vaste ensemble de données synthétisé (résumé; rendu perceptible par le calcul) ait ses lois propres. (On se souvient que le primat de la géométrie euclidienne est admis par Torgerson)". The ease of computation argument is very similar to Gauss's argument in the adoption of least squares criterion in the linear regression model. While the metaphysical argument, if my understanding is correct, is that: the transmission to the human spirit of a synthesis of a collection of data has its proper laws, which are based on the Euclidean geometry.
Ten Berge (2005, personal communication) thought that the centroid method produced good results, but it was not mathematically well understood during the last century. Benzécri (1973b, page 1 in Avant-propos) considered data analysis an experimental science, and that he has a predeliction for the case studies in his books. A similar thought is also found in Tukey: "The test of a good procedure is how well it works, not how well it is understood".
As to the question asked which decomposition is the best? Mathematically, the SVD is the best and the reference, but quite sensitive to outlying observations: So, we suggest the joint use of SVD and the taxicab decomposition, or, the SVD and the centroid decomposition.