Spatial assessment of water river pollution using the stochastic block model: Application in different station in the Litani river, Lebanon

Water pollution is a major global environmental problem. In Lebanon, water pollution threatens public health and biological diversity. In this work, a non-classical classification method was used to assess water pollution in a Mediterranean River. A clustering proposal method based on the stochastic block model ( SBM ) was used as an application on physicochemical parameters in three stations of the Litani River to regroup these parameters in different clusters and identify the evolution of the physicochemical parameters between the stations. Results showed that the used method gave advanced findings on the distribution of parameters between inter and intra stations. This was achieved by calculating the estimated connection matrices between the obtained clusters and the probability vector of belonging of the physicochemical parameters to each cluster in the different stations. In each of the three stations, the same two clusters were obtained, the difference between them was in the estimated connection matrices and the estimated cluster membership vectors. The power of SBM proposed methods is demonstrated in simulation studies and a new real application to the sampling physicochemical parameters in Litani River. First, we compare the proposed method to the classical principal component analysis (PCA) method then to the Hierarchical and the K-means clustering methods. Results showed that these classical methods gave the same two clusters as the proposed method. However, unlike the proposed SBM method, classical approaches are not able to show the blocks structure of the three stations.


Introduction
Water is an essential natural resource for any ecosystem. Maintaining its quality is a major concern for society, especially with the increasing future water needs. Freshwater ecosystems are natural compartments necessary for the continuity of life. They are essential for various activities such as the production of drinking water, industrial and agriculture activities, hydropower generation [Ridzuan et al.(2020)], and recreational activities [Simpi et al.(2011)]. Unfortunately, they are among the most seriously threatened ecosystems due to anthropogenic activities over the past century [Dudgeon et al. (2006)]. However, the quality of groundwater and surface water in Lebanon, in most cases, does not meet international standards levels. The impact of human activities in the sense of simplification, modification, and alteration of natural ecosystems is becoming increasingly alarming 1206 SPATIAL ASSESSMENT OF WATER RIVER POLLUTION USING THE STOCHASTIC BLOCK MODEL aims to produce classes, called blocks, or more generally clusters in networks. Several authors have proposed some generalization of this model [Mariadassou et al.(2010)] have treated the case of weighted networks by using the SBM model ], [Ng and Murphy (2021)] and then [Latouche et al.(2011)] have focused on the SBM model with overlapping clusters. More recently, Barbillon et al.[Barbillon et al.(2017)] have dealt with the case of multiplex networks, where several edges of different types can exist between a pair of nodes and [Zreik et al.(2017)] and [Matias and Miele (2017)] have extended the model to deal with dynamic networks. However, ] have proposed a binomial SBM to deal with co-citation networks. We are interested here in estimating the parameters in the SBM model as well as in clustering the vertices of the considered network. Several authors have focused on estimation methods. First, [Snijders and Nowicki (1997)] has proposed a maximum likelihood inference based on the expectation-maximization (EM) algorithm to estimate the probabilities of connection between nodes and to predict the clusters in the SBM model having only two blocks. Then, [Nowicki and Snijders (2001)] have generalized the previous work by proposing a Bayesian approach based on Gibbs sampling to deal with the case of the SBM model with an arbitrary number of blocks. However, the EM approach is intractable in the case of the SBM model because of the dependency of the edges in the network. To tackle this issue, [Daudin et al.(2008)] has proposed a variational approach by introducing the variational expectation maximization (VEM) algorithm while [Latouche et al.(2012)] has proposed a variational Bayesian approach based on variational Bayes EM algorithm (VBEM) to estimate these parameters. To prove the validity of this method in the environmental field, we introduce some applications of the proposed approach using first simulated data, then using the physicochemical parameters. We develop and implement the method on simulated data sets (to validate the procedure) and to show the efficiency of our approach by calculating the root mean square error and the Adjusted Rank Index. Also, we propose to use the variational approach developed by [Daudin et al.(2008)] to estimate the parameters as well as to classify the nodes in a Gaussian SBM model. This approach is consistent under the SBM model according to Celisse et al.[Celisse et al. (2012)].

Site Selection and Description
Three sites have been selected in the Litani basin for the sampling to achieve the SBM study, Jeb-Jenin station, Lake Qaraoun, and Ghzayel. The selection of these 3 stations in the Litani river is due to two main concepts; the first one is, the stationed distribution of zones, one at the upper (Jeb-Jenin, Ghzayel), of the river, the second in the middle (Lake Qaraoun). The second concept is due to the importance of the three zones, while the (Jeb-Jenin and Ghzayel) station in the Upper Litani River is located in the zone with complex industries and a huge population, the second is Lake Qaraoun, the only lake in Lebanon, located in the middle of the river and divided it into two parts (see Figure1). The Litani River has 16 tributaries with Berdawni, Chtoura, and Ghzayel rivers being the most important ones. The average discharge of the Ghzayel stations is 2.8 (m 3 /s) and 891 (m) as an altitude, with the coordinate: N 33 • 43 ′ 56 ′′ N (Latitude), E35 • 56 ′ 52 ′′ E (Longitude). Noted that the description of the sites Jeb-Jenin (Upper Litani River), with the coordinate: N 33 • 38 ′ 331 ′′ E35 • 46 ′ 79 ′′ with elevation above the sea level equal to 920 (m). Jeb-Jenin is the largest and most populated town in its district. Lake Qaraoun (Middle of Litani River), with the coordinate: N 33 • 46 ′ 42 ′′ (Latitude) E35 • 53 ′ 26 ′′ (Longitude), with elevation above the sea level equal to 864 (m). Qaraoun is the biggest lake in Lebanon with a total capacity of 220 Million cube meters.

Sampling test
Several parameters such as Temperature, DO, pH, conductivity, total dissolved solids (TDS), Salinity, Ammonia, Nitrite, Nitrate, SO4, PO4, were analyzed for each water sample collected during the 2008-2018 periods in three stations using spectrophotometry (Table 1) [Fadel et al.(2019)] [Fadel et al.(2021)]. All samplings were taken from the same points in the three stations. The same sampling method was used for all samples, 2 (m) from the side, and 20 (cm) deep from the sub-surface. Each sample consists of 5 trials, while each measurement was performed 5 times and the average is recorded. These samples were taken at the beginning of each month.  In this section, we present the necessary tools to develop our methodology to clustering the physicochemical parameters. For this reason, we consider a weighted undirected network represented by G : ([n], X), where [n] is the set of weighted nodes {1, . . . , n} for all n ≥ 1 and X is the symmetric weighted matrix of dimensions n × n encoding the of intensity of the observed interactions between nodes. In this context, the weighted nodes denotes the physicochemical parameters and the adjacency matrix X encodes the interaction between these parameters such as, for all i, j ∈ {1, . . . , n}, X ij = m ij if the nodes i and j interact with an interaction weight m ij 0 otherwise.
We denote by Q the number of clusters (Q > 1) and by Z the binary indicator matrix labeling the assignment of the physicochemical parameters into groups. We have for all i ∈ {1, . . . , n} and q ∈ {1, . . . , Q}, Z iq = 1 if node i belongs to group q 0 otherwise.

Mixture model with latent classes
We propose to generate the stochastic block model as follows: • The (latent) vectors Z i , for i ∈ {1, . . . , n}, are independent and sampled from a multinomial distribution as follows where α = (α 1 , . . . , α Q ) is the vector of class proportions of dimension 1 ×Q such as Q q=1 α q = 1.
• The (observed) variables {X ij , i, j ∈ [n], i < j} are independent conditionally on {Z i = q, Z j = l}, and are sampled from a Gaussian distribution as follows where µ ql and σ 2 ql denotes respectively the mean and the covariance parameters associated to the Gaussian distribution.

Inference
We are interested here in estimating the parameter θ = (α, µ, Σ) of the model in a weighted undirected network. However, we claim that all results obtained in this paper can be extended to directed networks.
Since the variable Z is latent, our model belongs to the class of incomplete data models. The log-likelihood of the incomplete data can be expressed as follows where P θ (X, Z) is the joint distribution such that The equation (1) is intractable for large networks since it requires a summation over all the possible values of Z. Thus, we propose to use the expectation-maximization (EM) algorithm. It is an iterative method that consists in computing P θ (Z | X). Hence, it is intractable in this context due to the dependency of the variables X ij . Therefore, we use in the sequel the variational expectation maximization (VEM) algorithm. This method overcomes the issue by maximizing a lower bound of the log-likelihood based on an approximation of the true conditional distribution of the latent variable Z given the observed variable X. We rely on a variational decomposition of the incomplete log-likelihood as follows The log likelihood of the incomplete data log P θ (X) does not depend on the distribution R X (Z), thus, maximizing the lower bound J θ with respect to R X (Z) is equivalent to minimize the Kullback Leibler divergence KL.

SPATIAL ASSESSMENT OF WATER RIVER POLLUTION USING THE STOCHASTIC BLOCK MODEL
According to [Blei et al. (2003)], the approximate distribution R X (Z) can be factorized over the latent variables Z i as follows where {τ i ∈ [0, 1] Q , i = 1, . . . , n} are the variational parameters associated with {Z i , i = 1, . . . , n} such as q τ iq = 1, ∀i ∈ {1, . . . , n} and h is the multinomial distribution with parameters τ i . By combining equations (1), (3) and (4), we obtain The VEM algorithm alternates between the optimization of τ and θ = (α, µ, σ) until the convergence of the lower bound. During the E-step, the parameter θ of the model is fixed. We maximize J θ (R X ) with respect to τ . Under the condition q τ iq = 1, for i ∈ {1, . . . , n}, we obtainτ by a fixed point relation The estimation of τ is obtained from (6) by iterating a fixed-point algorithm until convergence.
During the M-step, the parameter τ is fixed. We maximize first J θ (R X ) with respect to α. Under the condition Then, we maximize J θ (R X ) with respect to µ and Σ respectively, we obtain

Choice of the number of groups
In practice, the number of groups is unknown and should be estimated. We use the integrated classification likelihood (ICL) criterion in order to perform the selection of the most adequate number of blocksQ. Roughly, this criterion is proposed by [Daudin et al.(2008)] and is based on the complete data variational log-likelihood penalized by the number of parameters.
The (ICL) is of the form The VEM algorithm is run for different values of Q thenQ is chosen such that ICL is maximized Q = argmax Q (ICL(Q)).

Simulated data
First, before applying the SBM on physicochemical parameters, we perform the stochastic block model using simulated data with a Gaussian output distribution. The graph has n = 100 vertices. We choose a number of clusters Q equal to three. We use in the simulation the following parameters: Now we sample S = 100 random graphs according to the same mixture model. Then we calculate in Table 2, Table  3 and Table 4, for each parameter, the estimated Root Mean Square Error (RMSE) defined by: where the superscript s labels the estimates obtained in simulation s. According to Table 2, Table 3 and Table 4, we can clearly show that the RMSE of the model's parameters are close to zero. This means that the obtained estimated parameters are close to the observed parameters. In order to compare the estimated clustering results to the simulated ones, we propose to calculate the Adjusted Rand Index (ARI) proposed by [Hubert et al.(1985)]. It is a measure of agreement between two data partitions. The ARI has a value between 0 and 1, with 0 indicating that the two data clustering do not agree on any pair of points and 1 indicating that the data clustering is exactly the same.
The average of the ARI between the simulated clustering results and the estimated clustering results obtained using the proposed method is equal to 0.88. This means a high agreement between the two partitions of the nodes.

Clustering in environmental network
The data has the form of a physicochemical parameter-by-river station matrix. Each cell represents the value of the physicochemical parameter for each river station. We apply the SBM with Gaussian distributed weight method to cluster the physicochemical parameter of the considered network. Thus, we transform the matrix data into a physicochemical parameter-by-physicochemical parameter matrix of dimension 11 × 11. The associated network has 11 vertices connected by weighted edges. A weight associated with a pair of physicochemical parameters represents the Wasserstein distance between the values of these two physicochemical parameters. The Wasserstein distance (also known as the earth mover's distance) is a measure of the distance between two probability distributions. It is available in the package "transport" of the software R under the name "Wasserstein". In each station, the physicochemical parameters data are collected monthly between 2008 and 2018. It has the form of a matrix of dimension 95×11, where each cell C i,j of this matrix represents the value of the j th physicochemical parameter at the i th month. We transform this matrix into a weighted matrix X of dimension 11×11, where each cell m i,j represents the Wasserstein distance between the i th and the j th physicochemical parameters. The network associated with this matrix is formed by 11 nodes, where each node represents one of the 11 physicochemical parameters, and an edge presented between two of these parameters is weighted by the computed Wasserstein distance between this pair of parameters. In the following, we are interested in comparing the clustering of the networks associated with the three river stations: Qaraoun, Jeb-Jenin, and Ghzayel. By applying the Gaussian SBM, we show that it reveals two clusters for each Jeb-Jenin station as shown in figure 2 by using "Gephi" software with the layout algorithm "Force Atlas". Several classical statistical methods are used in the literature to show the relationship between the physicochemical properties, among them we count the principal component analysis (PCA), the hierarchical classification, etc. We used in this article the SBM method, which is suitable for clustering big data to recover the community structure. We noted that hierarchical clustering requires the measure of similarity and dissimilarity between clusters. So, it is required to determine the proximity matrix which contains the distance between each pair of physicochemical parameters using a distance function (i.e. Euclidean distance, Manhattan distance, etc.). In the following tables, we give the weights matrix associated with the nodes. Based on the above tables we can conclude the following: In the three stations, two clusters are obtained after applying the SBM with Gaussian distributed weight method. TDS, salinity, and conductivity form the first cluster, and the rest of the parameters form the second one. Although the clusters contain the same parameters in all three stations, the weight matrix differs from one station to another. In other words, the relation between the parameter is more or less strong depending on the type of activity around each station.
First, we will explain the relationship between the parameters of each cluster, then we will demonstrate how can this relationship be affected by the surrounding environment.
In the first cluster, TDS, salinity, and conductivity are directly related because firstly, TDS is the dissolved solids in water, of which a high percentage is usually mineral salts, therefore, water salinity increases with high TDS levels. Secondly, water salinity boosts its conductivity, the more saline the water, the higher its ability to pass an electrical flow.
In the second cluster, ammonia, nitrite, and nitrate are part of the nitrogen cycle. Ammonia goes through the nitrification process to produce nitrite, which will then transform into nitrate after oxidation using the dissolved oxygen in the water. This cycle plays an important role in the fluctuation of water's pH, because the presence of ammonia increases water's pH, and after nitrification, pH levels decrease again.
pH is an important parameter since it is responsible for the biodiversity in the water, which in turn affects the amount of dissolved oxygen (DO). Also, phosphate and nitrogen products initiate the eutrophication phenomenon responsible for the growth of plants and algae, and this is directly related to the decrease in DO levels, as shown by [Fadel et al.(2015)].
We can see a correlation between temperature and other parameters. Temperature affects water solubility in general, that is why we can see connections between temperature and all the other parameters, whether it is a positive or negative relation, the temperature has a direct effect on the presence of many elements in the water, and thus on its overall quality. We can also find an indirect relation, in fact, the change in some physicochemical parameters is mainly due to precipitations and springs during the wet season, where, by default, the temperature is lower than usual.
Regarding the weight matrix, the intensity of a relationship between a parameter and another is mainly affected by the type of activity around the station we are studying. For example, Qaraoun lake was initially created to generate hydropower, it is then surrounded by factories and engines. On the other hand, Jeb-Jenin is located in the Bekaa valley, where the Lebanese agricultural activity is mainly practiced, and the use of fertilizers and pesticides is uncontrolled. Contrarily, the Ghzayel river mainly contains springs and it is considered a touristic attraction, which makes it less subjected to chemicals and industrial wastes, but it is still subjected to pollution from wastewater. This difference in human activities resulted in different interactions between the physicochemical parameters. In Jeb-Jenin station, a strong relation between nitrite and conductivity is tracked, and this can be explained as follows: during the wet season, springs carry the dissolved solids into the river increasing water is conductivity, and since this station is surrounded by agricultural activity, springs will also carry phosphate, which is present in most fertilizers, from cultivated land to the river. The presence of phosphate at a high concentration will eventually lower the amount of DO as explained before, thus the oxidation of nitrite into nitrate will be inhibited resulting in higher concentrations of nitrite. In Qaraoun lake, we can see the same correlation as in Jeb-Jenin, because the type of activity and the sources of pollution around these two stations are similar. This relationship between nitrite and conductivity doesn't exist in the Ghzayel river. Instead, we can see a strong relation between ammonia and conductivity, which is a sign that the origin of pollution in the Ghzayel river comes from wastewater.

Estimation of the parameters
We compare here the estimated parameters obtained by applying the proposed method to the three different stations of the Litani river: Qaraoun, Jeb-Jenin and Ghzayel. We compute now the error for the estimated parameters. Letĝ be the estimated community membership vectors, N k = {i : 1 ≤ i ≤ n,ĝ i = k} the set of all the physicochemical parameters that belong to the cluster K and n k =|N k | for all 1 ≤ k ≤ Q. The plug-in estimator of µ and σ 2 are respectivelỹ We compute for the three river stations the mean error E µ between the estimated mean connection matrixμ ql and the plug-in estimatorμ ql and the mean error E σ betweenσ ql andσ ql defined respectively as: We obtain E µ = 8.75 × 10 −4 , E µ = 5 × 10 −5 and E µ = 0.011 for the Qaraoun, Jeb-Jenin and Ghzayel stations respectively. However for the standard deviation parameters, we obtain E σ = 0.0113, E σ = 0.0822 and E σ = 4.45 × 10 −3 for these stations respectively. Thus, E µ and E σ are close to 0 and then the results obtained using the proposed SBM method are satisfying.

Classical clustering method
We have already used the SBM method to classify the physicochemical parameters in the three stations of the Litani river in order to assess the pollution of the Litani river. The SBM method was successfully treated on a basis of data available over 11 years, thus it would be more interesting and useful if we apply the method of SBM on a bigdata. In order to give more of the SBM method's advantages compared to the classical classification methods which are used many times in the literature, we summarize below the different classification methods which have been used to classify the physicochemical parameters at the Quaraoun station as an example.

PCA method
Unfortunately, there is no specific method for deciding how many main axes are sufficient. In general, the choice of axes in the PCA will depend on the specific field of application and data set. In practice, we tend to look at the first major axes with the highest percentage of inertia in order to find interesting patterns in the data [Jolliffe (1986)]. For that, we choose the first two principal components which explain approximately 52% of the variation.  Figure 3 shows the relationships between the physicochemical parameters. First, the positively correlated parameters are grouped together (TDS, Conductivity, Salinity). Then the negatively correlated parameters are positioned on opposite sides of the origin of the graph (opposite quadrants). The distance between the parameters and the origin measures the quality of the representation of the parameters. Parameters that are far from the origin are well represented by the PCA. We point out that the quality of representation of the parameters on the PCA graph is called cos 2 (cosine squared); the parameters with the moderate values of cos 2 will be colored in "blue" and the parameters with high values of cos 2 will be colored in "red". So, Conductivity, Salinity, and TDS are close to a circle which indicates a good representation of these parameters on the main axes. PO4 and SO4 have a low cos 2 (close to the center of the circle) which indicates that these two parameters are not perfectly represented by the first two main axes. So, we can see that the parameters are contributed on 5 axes in the below figure 4. We notice that TDS, Salinity, and Conductivity are related just between them by axis 1 (Dim1), while the other parameters are related to each other on different axes, we can classify the parameters according to two clusters (cluster 1: Conductivity, TDS, Salinity) and (cluster 2: Temperature, pH, DO, Ammonia, Nitrite, Nitrate, SO4, PO4).

Hierarchical Cluster
In this subsection, we present the classical hierarchical method to classify the physicochemical parameters of the Quaraoun station. Note first that the hierarchical classification does not require determining the number of classes previously, unlike the k-means method. Indeed, we can choose a number of classes by observing the dendrogram.  Based on the figure 5, we classify the physicochemical parameters into two clusters. We also notice that the hierarchical classification gives the same distribution of the physicochemical parameters in two clusters but without knowing the intensity of connection between these parameters. In addition, the hierarchical classification has a strong algorithmic complexity in time and space. Hierarchical clustering is therefore more suitable for small samples.

K-means Cluster
The k-means method is generally simple, understandable, and applicable to large data. We note that the number of the classes must be fixed at the start by the method of k-means, in addition, it does not detect the noisy data, and the results depend on the initial drawing of the center's classes [Bradley et al. (1998)]. According to the figure 6, the k-means method classifies the physicochemical parameters into two clusters; the distances between the parameters of cluster 1 are close. Contrarily, in cluster 2 the distance between conductivity and the two parameters Salinity and TDS is far. We notice that the two k-means and hierarchical methods classify the parameters into two groups similar to the SBM method, then the three classification methods are indeed applicable to the datasets of the physicochemical parameters of the Quaraoun over a period between 2008 and 2018. Note that each of these methods has different characteristics that are applicable according to the types of data.

Conclusion
In this article, we have introduced a new approach in the environmental field based on statistical models for the classifications of physicochemical parameters at the Litani River in Lebanon, in particular the SBM. This method differs from the other classical methods since it can show the block structure of the three station networks by estimating the connection matrices between the obtained clusters and the membership vectors belonging of the physicochemical parameters to each cluster of these three stations. Using the SBM, we were able to identify two clusters in each station. The clusters were similar in each station, however, the significance of the correlation between the parameters of each network was different. We noticed that the type of human activity around the stations was mainly responsible for this difference in correlations. We explained the different types of pollution caused by certain activities and how they can change the behavior of these parameters and how they react together. Thus, we found that to make a good plan to improve water quality, it is best to locate the station correctly, study the type of activity around it, and specify the type of pollution to which the station is subjected. After analyzing the results, we can deal with the parameters as groups of related parameters instead of treating each parameter separately, which is cost-effective and time-saving.