Polynomial Chaos based on the parallelized ensamble Kalman ﬁlter to estimate precipitation states

This article develops a methodology combining methods of numerical analysis and stochastic differential equations with computational algorithms to treat problems which have complex nonlinear dynamics in high dimensions. A method to estimate parameters and states of a dynamic system is proposed inspired by the parallelized ensamble Kalman ﬁlter (PEnKF) and the polynomial chaos theory of Wiener-Askey. The main advantage of the proposal is in providing a precise efﬁcient algorithm with low computational cost. For the analysed data, the methods provide good predictions, spatially and temporally, for the unknown precipitation states for the ﬁrst 24 hours. Two goodness of ﬁt measures provide conﬁdence in the quality of the model predictions. The performance of the parallel algorithm, measured by the acceleration and efﬁciency factors, shows an increase of 7% in speed with respect to the sequential version and is most efﬁcient for P = 2 threads.


Introduction
Stochastic differential equations are considered an effective tool to model physical processes in distinct fields of science. The equations are generally expressed in terms of a process in continuous time with random coefficients, random initial conditions or random base functions and the solutions are expressed using random fields. For example if we consider the stochastic differential equation given in [33] of the form where x t is a signal in continuous time, t is a time variable, θ is a parameter vector, f (.) is a nonlinear drift or transfer function, L(.) is a dispersion matrix, and W t is an independent vector whose components represent a Wiener process. Given that the equation (1) is defined in continuous time, the behaviour of the system is observed imperfectly by means of an experimental process y t which is related to the process x t through the equation where y t represents a vector of observations sampled in discrete times t 1 , . . . , t k , H is a matrix defined in terms of the parameters and states, and v t is a random sampling error with Gaussian distribution, mean 0 and covariance matrix R t (θ). Instead of finding the point estimators of the parameters θ = (θ 1 , . . . , θ p ) and the states x t , we are The main contribution of this study consists in estimating and predicting the rainfall states p(x t |y 1:t ) in ten meteorological stations in Venezuela using the stochastic differential equation proposed by [1] and the polynomial chaos and the parallelized ensamble (PEnKF) methodologies. To achieve this objective it is necessary to solve a stochastic differential equation in terms of the Fokker-Plank equation using a numerical method of finite differences ( [7]). The solution is expressed using an expansion of the polynomial chaos. A data assimilation technique is then used to generate large samples of solution ensambles which are used to filter the sampling error in the prediction model given by the equation (2). Two goodness of fit measures are calculated to validate the quality of the model predictions. Finally the acceleration and efficiency factors are calculated to measure the performance of the parallelized algorithm.
In an article related to this proposal, [47], Wuan used numerical methods based on Weiner's chaos expansion to solve stochastic partial differential equations in which the stochastic solution is represented by a spectral expansion with respect to a set of random base functions. It was shown that the methods were more efficient and exact than Monte Carlo methods for short and medium times. In another work, [31], Li and Xiu implemented the EnKF using the expansion of the generalized polynomial chaos to approximate the solution of some known stochastic differential equations obtaining as a result an algorithm with good estimates and low computational cost. Madankan, [32], proposed two recursive algorithms to estimate the posterior moments of higher order for the parameters and states of a stochastic differential equation using the idea of a generalized polynomial chaos. He shows that the advantage of using these methods is that it is possible to obtain confidence intervals as well as point estimates for the parameters and states.
The organization of this article is as follows. Section 2 briefly describes the methodology of the polynomial chaos. Section 3 develops the technical aspects of the stochastic differential Fokker-Plank equation and that of Allen. Section 4 describes the EnKF and PEnKF algorithms. Section 5 defines the goodness of fit measures. Section 6 shows the results and Section 7 summarizes the conclusions and final discussions.

Polynomial Chaos
The fundamental idea of the polynomial chaos is that the random processes of interest can be approximated by sums of orthogonal polynomials chaos of independent random variables. In this context, the uncertainty about the parameters and states can be considered as a second order random process X(ω) with finite variance that can be represented in terms of the polynomial chaos. This random process in the original discrete version of the Weiner polynomial chaos expansion is defined by where ω is a random event, and H n (ξ i1 (ω), . . . , ξ in (ω)) is a Hermite polynomial of order n, defined in terms of a multidimensional random variable ξ(ω) = (ξ i1 (ω), . . . , ξ in (ω)). The Hermite polynomials are given in terms of normal random variables with mean zero and variance one. The general expression of the Hermite polynomials is given by For notation convenience the equation given in (3) can be rewritten in the following way. Let (Ω, F, P) be a complete probability space where Ω is a sample space, F is the σ−algebra of subsets of Ω, and P is a probability measure. Then define a second order random process X(ω) ∈ L 2 (Ω, F, P), the space of square integrable functions, in terms of a polynomial chaos as whereã i are the deterministic coefficients of the polynomial chaos, and ϕ i (ξ(ω)) is a multidimensional Hermite polynomial. The family {ϕ i } is an orthogonal basis in L 2 (Ω, F, P). There exists a one to one correspondence between the Hermite polynomials {H n (ξ)} and the base polynomials {ϕ i (ξ)}. In real situations, the series can be truncated to a finite number of n p terms so that the general truncated expression of the Hermite chaos is given by with the number of terms, n p , determined by where n p is the order of the y expansion and n is the dimension of the random vector ξ(ω) so that n p grows rapidly with the dimension and decomposition order and a good approximation is obtained.
The orthogonal relation is expressed in terms of an internal product defined in a Hilbert space as is the kernel of a standard Gaussian distribution, so that where δ is the Kronecker delta.

Polynomial Chaos of Wiener-Askey
The Hermite chaos expansion is performed by resolving stochastic differential equations of Gaussian random variables, but in general when the random variables do not have Gaussian distributions, the exponential convergence 82 POLYNOMIAL CHAOS BASED ON THE PENKF rate is not achieved. Other general polynomials chaos exist and are recommended. Some examples are Laguerre chaos which is associated with Gamma random variables, the Jacobi chaos which is associated with the Beta distribution, the Charlier chaos which is associated with the Poisson distribution, the Meixner chaos which is associated with the negative Binomial distribution, the Krawtchouk chaos which is associated with the Binomial distribution and the Hahn chaos which is associated with the Hypergeometric distribution (see [48] for more details). The polynomial chaos expansion to be used for these random processes is that of Wiener-Askey defined as I n (ζ(ω)) denotes the Wiener-Askey polynomial chaos of order n, and ζ(ω) = (ζ i1 (ω), . . . , ζ in (ω)) is a random vector. Again for notation convenience, the equation given in (10) can be rewritten as where a one to one correspondence exists between the functions I n (ζ(ω)), and the polynomials ϕ j (ζ). The Wiener-Askey polynomials chaos form a complete basis in the Hilbert space which guarantees that each chaos expansion converges to some functional in the sense L 2 (Ω, F, P). The relation of orthogonality of the Wiener-Askey polynomial chaos is given in terms of the internal product where ρ(ζ(ω)) is a probability distribution of the random variables ζ(ω). A general procedure is now described for applying the Wiener-Askey polynomial chaos to solve a stochastic differential equation. Consider the stochastic differential equation where u = u(x t , t; ζ(ω)) is the solution of the equation, and f (x t , t; ζ(ω)), is a source term. The operator ℓ generally involves differentiations in space and time and is possibly nonlinear. Appropriate initial and frontier conditions should be assumed although the presence of the random vector ζ(ω) permits the introduction of uncertainty in the system for different sources of variation including those of initial and frontier conditions. The solution, considered to be a random process, can be expressed by the Wiener-Askey polynomial chaos as where u i (x t , t) is a coefficient to be determined and ϕ i (ζ) is a Wiener-Askey polynomial. In particular a Laguerre polynomial is used in this work given that the data used to illustrate the application come from precipitation observations measured in meteorological stations in Venezuela. Finally, substituting (14) in (13), we obtain The representation given in (14) can be considered as a spectral expansion between the coefficient u i (x t , t) and the vector ζ(ω) evaluated in the orthogonal polynomial ϕ i . The total number of terms in the expansion is determined by the equation defined in (7).

The Stochastic Differential Equation of Fokker Planck
When a dynamic system is governed by a stochastic differential equation, the evolution in time of the probability density of the state x t satisfies the Fokker-Planck equation. This equation when discretized on a network permits the modeling of physical processes which have nonlinear structures, non-Gaussian distributions and multi-mode distributions ( [27]). In order to numerically model the evolution of the states in time, the density can be truncated which permits the use of finite degrees of freedom. One way to do this is to project the density in terms of a mixture of base functions that evolve with time such as convolution or wave methods or an integro difference equation model ( [19], [45]). To illustrate the procedure, suppose that it is required to estimate the states of the system given by the simplest stochastic differential equation of an Ito process where x t represents the state vector of the system, f (.) is a possibly nonlinear vectorial function, L(.) is a matrix of real values, and dW t is a vector of a process describing a Brownian movement with E(dW t dW T t ) = Q t dt. The function f (.) represents the deterministic part of the system while L(.) represents the random part of the dynamics. Given that the model defined in (16) is a continuous process, it can be assumed that the data is obtained through an observation process as described in (2).
On the other hand, by the Theorem of Fokker-Planck, if p(x t , t) is a solution to the equation given in (16), then it solves the equation given by The uncertainty associated with the states x 0:t is characterized by the marginal probability density function p(x t , t).
In this study estimation of the states is performed with filtering techniques to determine the posterior filtered distribution p(x t |y 1:t ). To implement a recursive filter, the Fokker-Planck equation is replaced by the Chapman-Kolmogorov equation (17) and the update equation is obtained by the Theorem of Bayes In the following section a precipitation model is defined and described in terms of the Fokker-Planck stochastic differential equation. The Wiener-Askey polynomial chaos is then used to expand the solution and finally a finite difference method is used to resolve the resulting system.

Stochastic Differential Equation for Precipitation
Consider the simple stochastic model for rainfall described in [1], where x t denotes the total rainfall at time t in a locality of s = (x, y) ∈ R 2 . Suppose that the change in total rainfall during a very short time interval ∆t and (∆x) 1 = γ, and (∆x) 2 = 0 have probabilities p 1 = λ∆t, and p 2 = 1 − λ∆t, respectively. The expected value and the mean squared change in the total precipitation is given by 84 POLYNOMIAL CHAOS BASED ON THE PENKF Letting µ = γλ, and σ = √ γ 2 λ, a stochastic differential equation to model the total precipitation x t is obtained: where dW t is a vector of a Brownian process with mean 0 and covariance matrix Q t . Given that the model defined in (20) is a continuous process, the data is assumed to be obtained according to the discrete equation given in (2). Then the samples taken at time t k are generated and denoted by y 1:k . Given that the probability density function represents all the statistical information of the states x 0:t k conditioned in the observations y 1:k , the prior density for the system exists and is continuously differentiable with respect to time t k and the state vector x 0:t k . The evolution in time of the marginal density of the state is given by the posterior marginal distribution p(x t k |y 1:k−1 ) which satisfies the Fokker-Planck equation where L is a diffusion operator defined by with p = p(x t k |y 1:k−1 ). The initial condition is given by p(x t k −1 |y 1:k−1 ). Applying the Fokker-Planck equation to the equation (20), we obtain The next step is to find a solution to the equation given in (23). To do this, a truncated expansion of the polynomial chaos is used as in the equation (14).
where p(x t , t, ξ) is a solution of the equation given in (23), p i (x t , t) represents the deterministic coefficients, ξ is a vector of orthogonal random variables, and ϕ i (ξ) is a chaotic polynomial of Laguerre. Then, substituting the expansion of the equation (24) in (23) we obtain By the orthogonality relations There are many ways to solve the system. This article proposes a finite difference method to numerically solve the partial differential equations developed by [7]. The result is where i = {0, . . . , n p }, and m = {0, . . . , n p }. The equations form the system defined as Ap m+1 = Bp m The elements of the matrices A and B are obtained as The matrices A and B are tridiagonal due to the structure of the base polynomials. A −1 exists for some set of n p + 1 distinct points. The system p m+1 = A −1 Bp m is then solved after generating an initial state p m 0 ∼ Unif(0, 1). The solution is expressed in terms of a spectral expansion of the variables of uncertainty a polynomial of Laguerre given that the data used for the application is that of precipitations, ξ ∼ Gamma(a, b), and p i (x t , t) = p m+1 . Once the system of stochastic differential equations is solved, the ensambles of the solution states are generated using a technique of data assimilation. The general ensemble Kalman filter procedure will be now defined and expressed in terms of the polynomial chaos to serve as the basis for the PEnKF algorithm.

Ensemble Kalman filter (EnKF)
The EnKF is a sequential method that originated in data assimilation techniques used in many areas of applied science such as ocean modelling, weather predicting in meteorology and the development of models in Geoscience. It has been found especially useful in situations involving complex dynamic systems in spaces of high dimension. To fix notation, suppose that the states x 0:t at a time t are obtained from the previous states x 0:t−1 generated at time t − 1. This procedure is called previous knowledge in the assimilation cycle t, and is denoted by x f t . Given the experimental observations y 1:t , the Bayes theorem of linear adjustment is used to update the states. This process, called analysis, is denoted by x a t . The EnKF developed by [10], [11], [12], [13] is an algorithm based on sequential Monte Carlo methods used to approximate the posterior distribution of the states of high dimensional nonlinear systems. The algorithm starts with an approximation of an ensamble of random state solutions. Let be an ensamble of state forecasts x f t , where each member of the ensamble is indexed by i = 1, . . . , N , and is obtained from a solution of equation (20). The analysis step of EnKF is given by where and P f t y P a t are the approximated covariances of the prediction and analysis respectively, and R t = E(vv T ) is the covariance of the observation error.

Ensamble Kalman Filter based on the Polynomial Chaos
In this section the EnKF based on polynomial chaos methodology is introduced. Consider the equations defined in (2) and (20) and let called an initial condition, be a solution of the equation (20) where ξ is a set of random variables parameterized by the initial condition with density function ρ(ξ), (see [31] for more details). The prediction states are stochastic variables and can be parameterized by the same set of random variables: The expansion of the polynomial chaos of order n p in terms of the solution of the equation given in (20), p np (x t , t, ξ) can be expressed as are orthogonal polynomial base functions constructed as products of single variable polynomials ϕ i k (ξ k ). The ϕ i k (ξ k ) are orthogonal polynomials of order i k in the dimension ξ k which satisfy where δ ij is the Kronecker delta, and the polynomials ϕ i are normalized.
The expansion coefficientsp i (x t , t) are approximated by an orthogonal proyection given bŷ Classical approximation theory guarantees that this is the best approximation in the linear space of polynomials of degree n p in the sense of mean squares.
The EnKF algorithm can be summarized as follows. To fix notation letx b t andP b t , be the mean and the covariance matrix of the posterior distribution filtered at time t − 1, whilex a t andP a t are the mean and updated covariance matrix of the posterior distribution filtered at time t. Then perform the following steps: Step 1. Propagation.
Begin by generating N initial states that are solutions of the equation (20), using a method of finite differences, say x f 0,i = p(x 0 , 0, ξ), and suppose and defining The mean of the ensamble is given byx Step 2. Predicting the ensamble.
The N members of the ensamble forecast at time t + 1 are generated by a solution of equation (20), using a method of finite differences, say x f t+1,i = p(x t+1 , t + 1, ξ), and suppose The forecast error is estimated by The unbiased estimator of the sample covariance is given bŷ In practice, it is not common to estimateP b t+1 . Instead calculatê whereP cr t+1 is the sample cross covariance of the previous ensamble and its predicted projection in the observation space, whileP pr t+1 is the sample cross covariance of the predicted projection of the previous ensamble within the observation space.
If no data is available at time t, the update step is . If observations are available at time t, then the update is done using perturbed observations as follows. First generate observations using the observation equation This supplies the sample of the joint distribution P (x t , y t |y 1:t−1 ). The update is performed using the linear adjustment of Bayesx is the Kalman gain matrix.

Parallelized Ensamble Kalman Filter (PEnKF)
In contrasting to previous decades, computers now exist which have one or more CPUs with multi-core processors which permit the reprogramming of sequential algorithms in parallel. One of the advantages of using parallel programming is that it is possible to divide large data sets in smaller data sets which can be evaluated separately.
The results are then combined by simple summing operations. The paradigm of P-thread programming in multi-core processors plays an important role in understanding and improving software through the help of multitasking. For scientific computing, it is thus possible, in principle, to divide the calculations in groups and each group executed in its own processor.
In particular, the POSIX Pthread library of the ANSI C programming environment provides an interface which permits the interaction of p-threads within the execution of a program [46]. The parallelization with the POSIX library of the EnKF is used in this article to perform the filtering of the large precipitation data sets where the data is divided in small lots and each lot is filtered independently. The approach executes simultaneously the EnKF for several meteorological stations independently. The high level algorithm is:

a. Initialize
• Read the data of daily precipitation of the meteorological stations. • Assign the precipitation data to r independent variables. • Define the Fokker-Planck equation for the precipitation model given by the equations (20) and (2). • Read the parameters Q t , µ, σ, ∆x and ∆t, for the system obtained from the Fokker-Planck equation given in (28).

b. Create threads
• For each variable, create P threads. • The P threads execute the EnKF as functions.
-In the prediction ensemble of the EnKF function * Resolve the system of equations p m+1 = A −1 Bp m given in (28) for each time t using the scientific numerical library GSL. * Using the values of p i (x t , t) = p m+1 obtained in the solution of the system (28), calculate the , to obtain the approximate value of the solution of the stochastic differential equation given in (20).  To measure the calculation speed of PEnKF, the acceleration and efficiency factors as defined in [46] were used.

Criteria for Model Validation
To compare the quality of the predictions and forecasts obtained from the adjusted model, we use the two validation criteria proposed in [5]: the root mean square error (RMSE), and the mean absolute error (MAE). These validation criteria are defined as: 1. Root mean square error

Mean absolute error
where • x j t is the true state for the j−th simulation, j = 1, . . . , N . • x j t , is the Monte Carlo estimator of E(x t |y 1:t ) for the j−th signal, j = 1, . . . , N .

Results
In this work, a space time model was implemented based on the stochastic differential equation for the prediction of rainfall using the daily precipitation series from January 2011 through May 2012 for ten (10) meteorological stations in Venezuela, three located in Aragua state (Ceniap, Tamarindo and Tucutunemo), two in Portuguesa state (Araure and Turen), one in Merida state (Mucuchies), one in Barinas state (La Planta), one in Guarico state (San Pedro), one in Tachira state (Pueblo Hondo) and another in Yaracuy state (Yaritagua), (see Figure 1). The data is available in http : //agrometeorologia .inia.gob.ve/. Missing data was estimated by generating independent random variables with the algorithm of Gibbs and sampling from a truncated normal distribution ( see [37] for more details). The EnKF algorithm was programmed in parallel using the POSIX (Pthread) library in the ANSI C programming environment, in an Intel CPU Core i7 3.6GHz with 16GB RAM running 64Bit Debian Linux.
The EnKF was parallelized supposing that the meteorological stations are independent. This supposition is reasonable since the stations are located in different regions distant one from another and permits performing each analysis separately. Even though it is possible to consider a covariance structure among the stations and perform a joint analysis, this is not studied in this article because the objective is to study the prediction capacity of PEnKF. To initialize the algorithm in the stations of Aragua state, the following prior distributions were used: • Ceniap station Q t = 1, σ 2 v = 100, µ = 18.56, σ =5.10 4 , ∆x = 10 −2 and ∆t = 10 −6 . • Tamarindo station Q t = 1, σ 2 v = 10, µ = 18.56, σ = 10, ∆x = 10 −1 and ∆t = 10 −1 . • Tucutunemo station Q t = 1, σ 2 v = 1000, µ = 18.56, σ = 10 −1 , ∆x = 10 −1 and ∆t = 10 −1 . In Figure (2) the real daily precipitation data is shown for the Ceniap station located in Aragua state, together with the posterior means of the states estimated by PEnKF. A good approximation is observed between the real and estimated data. The other stations, Tamarindo and Tucutunemo, have a similar behaviour.
To initialize the PEnKF for the stations of Portuguesa state, the following prior distributions were used:  To initialize the PEnKF for the station in Merida state the following prior distributions were used: • Mucuchies station Q t = 1, σ 2 v = 15.10 3 , µ = 18.56, σ = 4.10 4 , ∆x = 10 −1 and ∆t = 10 −1 . Figure (3) shows the real daily precipitation data for the Araure station located in Portuguesa state, together with the posterior mean of the states estimated by PEnKF. The graph shows a good correspondence between the predicted and real data. The Turen station shows similar behaviour. Figure (4) shows the real daily precipitation data for the Mucuchies station located in Merida state, together with the posterior states estimated by PEnKF. As in the other cases, the algorithms for the reconstruction of the data perform well.
To initialize the PEnKF for the stations of Barinas state the following prior distributions were used:   For rainfall predictions, the data series from January 2011 to May 2012 were used to predict the precipitation for the first day of June 2012. Figure 6 shows the distribution of the average daily rainfall in millimeters (mm) for the first day of June in the 10 meteorological stations under consideration; the left panel shows a map of the real rainfall and the right panel shows a map of the predicted rainfall. Considering that only a small sample of stations was used, the precipitation model does a good job of reproducing the main spatial features. The range of rainfall greater than 50 mm occurs in San Pedro (Guarico) and is represented by only one station. Analogously, the range of rainfall less than 2 mm occurs in the stations of Yaritagua and Tucutunemo.
Rainfall volume for Guarico state is 60 mm, while the rest of the stations present a volume of less than 15   features of the rainfall pattern although the precipitation varies greatly both spatially and temporally in Venezuela. The topography introduces significant variations in the east zone (Guarico state) where an increase in rainfall is seen in contrast to the eastern zone (Tachira state) that presents a decrease in the rainfall.
The rainfall predictions for the Turen, Araure, Mucuchies, la Planta, Tamarindo, and San Pedro stations correspond exactly to the real values whereas the predicted rainfall for the Yaritagua, Tucutunemo, Ceniap, and Pueblo Hondo stations are close to the real values. Tables (I) and (II) show two goodness of fit criteria for the prediction model applied to the meteorological stations under consideration. Low error levels and little significant difference between the two measures is indicative that the estimator is very close to the true signal.   Table (III) shows a faster execution of the PEnKF algorithm with respect to the sequential version for P = 2, P = 4 and P = 32 threads. As the number of P -threads increase, the efficiency lowers. Considering this, the ideal case would be P = 2 which provides a 7% increase in speed with respect to the sequential algorithm.

Conclusions and Discussions
This article presents an implementation of the PEnKF algorithm based on the polynomial chaos theory of Wiener-Askey. After using a finite difference method to determine a solution of the Fokker-Planck stochastic differential equation which models rainfall, the solution is expressed in terms of a truncated expansion of the polynomial chaos which is then sampled to generate the ensambles. The state estimation uses bayesian inference in a two step data assimilation process. First, the states of the system are predicted using the prior information at time t − 1 and second, the posterior mean and covariance of the predicted states are updated at time t using the Bayes linear adjustment. A graphical representation of the real data is shown together with the predicted states (see Figures  (2), (3), (4), and (5)) showing that the algorithm reconstructs the true states with high precision. Maps are used to show the real rainfall in the ten stations and the posterior means of the estimated rainfall (see Figure (6)). These maps show that the principal spatial features are reproduced exactly for six of the ten stations and almost exactly for the other four stations. To validate the quality of the estimations, the root mean squared error and the mean absolute error of the posterior states were calculated and little variability between the two goodness of fit measures was observed showing that the model predicts adequately. Finally, the performance of the parallel algorithm was measured by determining the acceleration and efficiency factors which showed an optimal performance for P = 2 threads with a speed increase of 7% with respect to the sequential version of the algorithm.