The parameter estimation of the multivariate matrix regression models

In this paper, we consider the parameter matrix estimation problem of the multivariate matrix regression models. We approximate the parameter matrix B and the covariance matrix by using the method of the maximum likelihood estimation, together with the Kronecker product of matrices, vectorization of matrices and matrix derivatives.


Introduction
The Linear model (LM), or called linear regression model, is a basic tool for statistical analysis, among which multivariate linear models (MLM) are widely used in many fields such as agriculture, engineering, pharmaceutical chemical engineering, aerospace, theoretical research and data analysis (see e.g.[1,4,5,7,8,12,16,20]).The MLM is the case where the number of response factors is greater than 1.Similar to the general LM, in the MLM it is always assumed that the response variable is a linear function of some explanatory variables (vectors or matrices).In a LM, the covariance matrix of the response variables and the parameter matrix B (generally consists of the linear regression coefficients) are unknown and to be estimated by some methods such as maximum likelihood (ML) and ordinary least square (OLS) method in terms of the given data of the response variables and the interpretable variables, and the predict is thus followed after the parameter estimation.The application of the linear model mainly involves the following two aspects.
• Prediction and minimization of the errors.The regression function between the response and the explaining factors can be obtained by observations, and the regression coefficients are obtained by some methods.• The correlation analysis.This may cause the partitioning or clustering of the observed data, leading to a hierarchical dataset, or to a different model such as the Envelope model [10].
The common approach to estimate the parameters in a linear model is the ordinary least square (OLS) [4,5], the maximum likelihood (ML) estimation [6,7,11], the minimization of error norm (such as minimizing absolute deviation regression analysis [4], and the cost function least squares penalty minimization method [16,20] (l 2 -norm penalty ) and Lasso ( l 1 -norm penalty [7,8] ) etc..Note that the OLS can also be used to estimate parameters of the nonlinear regression model [9].The general MLM can be indicated by where it satisfies the following assumptions 287 1.The rows of the random matrix Y ∈ R n×d , denoted Y i. , are mutually independent.2. The design matrix X ∈ R n×p is fixed and known.
3. The parameter matrix B ∈ R p×d is unknown and is to be estimated.4. The response matrix Y ∈ R p×d has a covariance matrix, which is fixed unknown. 5.The mean of the random error is 0. 6.The covariance matrix of the random error ϵ is Σ ⊗ I n .
The multivariate linear model under the above assumptions is called the Gauss-Markov model.Note that , called a direct product( Kronecker product) of the matrices A and B is defined as the blocking matrix Now we present here some basic propositions related to the Kronecker product.The reader is referred to the first chapter of [13] for more detail on Kronecker product.

Proposition 1.2 Let
The vectorization is an operation such that any matrix A ∈ R m×n can be made into a column vector vec(A) ∈ R mn by vertically stacking in order all the columns of A. Thus the vec can be regarded as a 1-1 correspondence from the matrix space R m×n to R mn .From the vectorization, we have Matrix vectorization plays an important role in solving the regression model of multivariate linear matrix.We will use this method in the following to figure out the parameter estimation in the model.Firstly, we introduce the derivative of the matrix function.The matrix derivative is one of the key notions in matrix theory for multivariate analysis such as in some extreme problems, maximum likelihood estimation, parameter asymptotic expression of multivariate limit distributions etc..The real starting point for matrix derivative is by Dwyer MacPhail [9], then further developed by Bargmann [2] and MacRae [14].A useful tool employed in matrix derivatives is the vectorization of matrices, see Neudecker [17] and Tracy Dwyer[18], McDonald Swaminathan [15] and Bentler Lee [3].The notion of a matrix derivative is a realization of the Fréchet derivative known from functional analysis.
We first recall the definition of the derivative of a vector y = (y 1 , . . ., y n ) ′ w.r.t.another vector x = (x 1,...,xm ).Then dy dx = (J ij ) ∈ R m×n where J ij = dyj dxi for i = 1, . . ., m; j = 1, . . ., n.Now consider matrix Y = (y ij ) ∈ R m×n each of whose entries y ij is a differentiable function of X = (x st ) ∈ R p×q , i.e., y ij can be regarded as a function with pq arguments x st .Then we define The second order derivative, d 2 Y dX 2 , is defined by Thus we have d 2 Y dX 2 ∈ R pq×mnpq .We can also define any order derivative by using the induction on the order.Actually we already defined the 1st and the 2nd order derivative of Y w.r.t.X.Now suppose we have defined the (k − 1)th order derivative, i.e., Then the kth order derivative of Y w.r.t.X is defined by We have Proposition 1.5 Let X = (x ij ) ∈ R m×n and the elements of X are all independent variables, and A be a matrix of proper size with constant elements, and c is a constant.Then we have (1) dX dX = I mn , where I k represents the k × k identity matrix. ( ( The following results concerns the derivatives of the inverse, determinant and the trace of a random matrix.
(  In this section, we use the maximum likelihood (ML) function and combine the results we obtained in the last section to estimate B in (1.1).Let the random matrix Y satisfying model (1.1) obeys the normal distribution with parameter matrix B and the covariance matrix Σ.Then the corresponding distribution density function is

Proof
We regard L(B, Σ, Y ) as a function of B. To get the maximum likelihood of B, we compute the derivative on the logarithm of L w.r.t.B , since We have Consequently we get (2.9) under the hypothesis of r := rank(X) = p ≤ n.
In order to estimate the parameter matrix B in (1.1) for the case when X is not full rank, i.e., rank(X) < p, we need to introduce a class of generalized inverse-the group inverse or g-inverse, which is also called a {1} −inverse.There are a lot of literatures on generalized inverses of matrices.We refer the reader to the first chapter in [13] for reference.
Given a matrix A ∈ C m×n , the g-inverse of A, denoted A − , is an n × m matrix satisfying condition An equivalent definition for the g-inverse is: Note that the g-inverse of a matrix is usually non-unique.Another useful fact we will utilize in the proof of the next result is that when matrix A is a square nonsingular matrix, the g-inverse is exactly the inverse matrix (and therefore it is unique).The following proposition presents a general form of g-inverses of a given matrix A after given a specific g-inverse.

Proposition 2. 2
Let A ∈ C m×n , B ∈ C n×m .Then B is an g-inverse of A if and only if for any vector b ∈ Col(A), x = Bb is a solution to the linear system Ax = b, where Col(A) := {y = Ax : x ∈ C n } denotes the range space of A.