Community Detection in Social Networks Using Consensus Clustering

Community detection is one of the most appealing research ﬁelds in computer science. Although many different methods have been proposed to cluster the nodes of a graph, none of these methods is complete. Each method has strengths and weaknesses in extracting highly coherent groups of nodes (i.e. communities or clusters). The differences that various methods show are typically due to two main factors: 1) structure of the network they operate on, and 2) the strategy they use to ﬁnd clusters. Since none of these methods is optimal, it seems a good idea to combine them to take advantage of their strengths while minimizing their weaknesses. In this paper, we present a new approach for the community detection problem by considering an ensemble of community detection methods. We refer to our approach as “ Mitra ”. The base methods employed in Mitra , use different techniques and strategies to ﬁnd communities for different applications of network data analysis. Mitra employs some known base community detection methods, receives their results on a network and builds a bipartite network to combine the communities found by the base methods. Then the fast projection technique compresses and summarizes the bipartite network to a new unipartite one. Then we ﬁnd the communities of the unipartite network in the ﬁnal step. We evaluate our approach against real and artiﬁcial datasets and compare our method with each one of the base methods. Artiﬁcial datasets include a diverse collection of large scale benchmark graphs. In this work, the main experimental evaluation function is Normalized Mutual Information (NMI). We also use several measures to compare the quality and properties of ﬁnal community structures of the partitions found by all methods.


865
Several community detection algorithms give satisfactory results when they are tested on networks. Nevertheless, since some methods disclose serious limitations and biases, and most algorithms are likely to fail in some limit, an ideal method to detect communities in graphs, does not exist [30]. To overcome this problem, it seems a good idea to take advantage of the strengths of different existing methods to find a better solution and avoid the weaknesses of these methods. To reach this goal some approaches are proposed. One of them is ensemble clustering (also known as consensus clustering), that is inspired by ensemble learning, where multiple community detection algorithms run as an ensemble and the identified communities are then combined to improve the communities. Moreover, ensemble clustering is a technique used in some applications such as machine learning and is useful in merging several clustering results into one [25]. In this work, we use ensemble clustering to make a combination of the outputs of some famous community detection algorithms (called base classifiers or base methods), compress their results and get a consensus on the viewpoints of these algorithms on a community detection problem. This fusion process decreases the generalization error, because the more predictors differ, the better the performance of the ensemble is. This also explains why an ensemble of classifiers performs better than a more advanced single classifier, as the error rate can be decreased by increasing the number of classifiers included in the ensemble [62]. In order to create our consensus method Mitra, we select a number of broadly used community detection algorithms that perform well on graphs and get their results on the same network. A bipartite network is then generated to combine the information prepared by different base methods in the consensus pool. As the next step, a special projection [42] is applied on this bipartite network to convert it to a unipartite network, called consensus graph. At this step we detect communities of the consensus graph by one of the existing community detection algorithms.
We evaluate our approach on several artificial networks and compare it with each one of the methods used in the fusion. The comparison of the reference communities and the detected ones is according to NMI values [21] (see Subsection 5.1). In order to do more evaluations on the results, we investigate some quality scores on the detected communities. It is known that there is no single perfect quality metric for the comparison of the communities detected by different algorithms [15]. Therefore, we use a number of structural quality functions such as conductance [40] and modularity [51] to evaluate the quality of the detected communities. Before we proceed, it is worthwhile to clarify some nomenclature. We use "consensus clustering" and "ensemble clustering" interchangeably. Moreover, "cluster" is equivalent to "community".
The following of this paper is organized as follows. Section 2 reviews some related work on ensemble clustering. Our new approach Mitra is explained in 3, then the introduction of employed base community detection methods and our fast projection method are presented in this Section. Evaluation criteria are explained in Section 5. Important implementation results of the proposed method and comparison with base methods are reported in Section 6 and, finally the paper is concluded in Section 7.

Ensemble Clustering
In recent years a large number of methods for detecting community structures have been developed, drawing on knowledge from many different fields, e.g. computer science, statistical mechanics, discrete mathematics, statistics, and sociology. These methods have also been improved to handle weighted, directed, and multi graphs. For a comprehensive review of the field as a whole, the readers are referred to the references [31,28,64]. Another approach for improving community detection is to use ensemble clustering which is inspired by ensemble learning. In Ensemble clustering, multiple community detection algorithms run as an ensemble and the identified communities are combined to improve the community qualities. By considering an ensemble of clustering methods, it is possible to consider different definitions of community structure. In addition, more effective algorithms can be found by merging (aggregating) many runs of fast stochastic algorithms as well as several runs of the same algorithm using different settings. Moreover the latter method can be used to analyze the community structure of the network at many different scales, providing insight into the relations between community structures at different levels. The integration of consensus clustering with popular existing techniques leads to more accurate partitions than those delivered by the methods alone on LFR benchmark graphs [43]. Interestingly, this holds even 866 COMMUNITY DETECTION IN SOCIAL NETWORKS USING CONSENSUS CLUSTERING for methods whose direct application gives poor results on the same graphs, like modularity optimization [31]. In other words, to identify communities one is often confronted with the problem that one has a large number of potential partitions and often no good way to select a single best partition. Consensus clustering attempts to mitigate the problem by identifying common features of an ensemble of partitions [37].
One widely applied method for ensemble clustering is based on constructing a consensus graph. A consensus graph is composed of the set of partitions to be combined, returned by the base community detection algorithms [27,60]. The consensus graph G cons is defined over the set of nodes of G. Two nodes v i , v j ∈ V are linked in G cons if there is at least one partition where both nodes are in the same cluster. Each link (v i , v j ) is weighted by the frequency of instances that nodes v i , v j are placed in the same cluster. The obtained graph is not necessarily a connected one. If there is not a priori information about the relative importance of the individual groupings, a reasonable goal for the ensemble result is to seek a clustering that shares the most information with the original clusterings [60]. Different approaches can be applied in order to compute the aggregated clustering out of the consensus graph. In the following paragraphs, some of these approaches are pointed out. In [60], authors transform the consensus graph into a complete one by adding missing links with a null weight, then nodes are finally partitioned into clusters using agglomerative hierarchical clustering with some linkage rule, or by using a classical graph partitioning method such as the Kernighan-Lin algorithm [49]. In another work [52], by maximizing the overlap of the weak input partitions, the authors search for a strong partition. They show that if the process of restarting begin from maximal overlaps of the initial partitions, the quality of the initial weak partitions is not so important and the final result will be good. The point is if the base algorithms do not correspond on the communities, the method will not return any ensemble solution. The authors of [43] show that consensus clustering can be combined with any existing method in a self-consistent way, enhancing the stability and the accuracy of the resulting partitions. This method integrate consensus clustering in a giving method which is different from our approach to combine several different algorithms. The proposed ensemble clustering method in [38] works on the derived network partitions computed around a seed instead of growing communities around selected seeds. An ensemble ranking method to fuse different local modularity functions, computes the local communities. An iterative agglomerative algorithm expands the seeds. One of the drawbacks of this work is that the networks used for evaluating the method are all small networks. In [39] given a set of r base partitions, the authors compute an r × r pair-wise clustering similarity matrix M . A similarity graph GV is defined over the set of base partitions by using M . Then the given samples are clustered by a community detection algorithm applied on the similarity graph. A modularity-driven ensemble-based approach to multilayer community detection is proposed in [61] by aggregating the community structures separately generated for each network layer. The method is only proposed for multi layer networks and is different from our approach. The other difference is consensus methods use different specifications of network nodes to partition a network whereas, in this work all the base methods use the same specification of nodes. In [24] the proposed ensemble consists of only two clustering algorithms on some networks with simple structures (small values of µ as a network parameter) not for sophisticated networks. Using a modified version of modularity with a null model based on the ensemble of partitions for the consensus clustering step is presented in [37]. In this work the authors propose a hierarchical consensus clustering procedure, based on this modified modularity. A genetic algorithm for detecting a community structure in attributed graphs is proposed in [53]. The method optimizes a fitness function that combines node similarity and structural connectivity. The communities obtained by the method are composed by nodes having both similar attributes and high link density.
The key distinguishing feature of our ensemble clustering procedure compared to these existing approaches is that we use different community detection algorithms instead of multiple runs of one algorithm. On the other hand, our approach is not agglomerative or extending method. A notable point is since some of the recent proposed ensemble approaches employ basically different strategies (e.g. using only one community detection algorithm instead of several different algorithms ( [43,52])) in comparison with Mitra, or ensemble approaches work on the other types of networks (such as multilayer networks [61]), or ensemble approaches work based on multiple specifications of network nodes [61] (not only one specification like our work), or designed to reach a dissimilar target function (such as [52]), comparison of Mitra with the recent similar approaches does not seem fair. Hence, we do not compare Mitra with other similar ensemble algorithms in our experiments.
In this paper, we perform an empirical comparison and quality evaluation of the communities identified by a variety of community detection algorithms. These algorithms use different strategies to find communities. We then compare their produced results with the results of our consensus clustering algorithm. Our approach is different from the others in some aspects, such as using different community detection algorithms instead of only one algorithm. The base algorithms are limited to use the same node specification unlike some of the similar approaches which use some different specifications. Moreover our approach is not an agglomerative or extending method. We propose to use a bipartite network as an intermediate network to fuse previously generated results and reach to a consensus graph by fast projection method [42]. The next Section explains our ensemble approach with more details.

Our Approach
In this section the overview of the proposed algorithm is presented. Then the based methods are introduced briefly and fast projection technique [42] is explained.

Our algorithm
Our approach "Mitra" consists of the following phases: 1. Apply k − 1 base community detection algorithms {A 1 , A 2 ,... , A k−1 } on G to create a set P of partitions {P 1 , P 2 ,... , P k−1 } produced by these algorithms 2. Combine the partitions of P by constructing a bipartite graph G b 3. Create unipartite graph G u by applying fast projection on G b 4. Apply A k that is also one of the base community detection algorithms on G u to find the final partition P * Figure 1 depicts the above algorithm.In the first phase, a number of popular community detection algorithms (called base methods) are applied to the network graph G to extract node communities, but One desired base method (called A i ) is kept for using in the last step. The set of base methods consists of algorithms that use different strategies to find the communities. The reason for selecting different methods is to add more diversity to the consensus, and consider different strategies or functions in order to find clusters. Each base method produces a partition P i of the nodes independently. This phase can be done in a parallel manner if sufficient resources are available, to decrease the execution time. The base methods set includes CNM [19], Walktrap [54], CONCLUDE [22], Copra [35], Oslom [45], LPM [56], MCL [23], Infomap [57], Louvain [16]. The base methods are briefly introduced in subsection 3.2. In the second phase, a bipartite network G b is created based on the original network G. The bipartite network G b includes two types of nodes: the real nodes of G as primary nodes and, the secondary nodes associated with the communities found by the base community detection methods. In other words, for each base method we add as many community nodes as the communities found by this base method. Then connect this community node to all its real node members in G b . As a consequence, at the end of this step, G b consists of all the real nodes of G and all the community nodes produced by all the base methods. G b is a weighted graph. In G b each community node divides probability 1 between all the real nodes connect to it evenly. The number of community nodes is As a result, the number of all the nodes of Gb is where N is the number of real nodes of G. In the third step, G b is reduced to a unipartite network G u by fast projection technique [42]. In fact G u is our consensus graph. In other words, only the real nodes of G are kept and the community nodes are removed. The removal process of community nodes is done in such a manner that the effect of the removed community nodes is kept by the weights of links of G u . Therefore G u consists of the real nodes of G, but the links of G u are different from the links of G. In other words, the links of G u show the compressed result produced by the partitions of all the base methods. It is possible to construct G u in four different ways: it may constructed as an undirected/directed and unweighted/weighted graph. Since fast projection builds links between primary nodes according to two-step walks, G u is a directed graph. Moreover, fast projection selects links according to the highest probabilities. Consequently, In the last step, the final partition P * is detected by applying A k on G u . A k is one of the base algorithms not applied in the first step. Each base algorithm can be selected as A k . Since G u is essentially weighted and directed, some base methods are not capable of detecting communities on G u . For example, CNM can only be applied to connected graphs, thus it is eliminated from the set of candidate methods which can apply in the last step. Also Walktrap is not able to work on directed graphs, therefore it could only be used in the last step if G u considered undirected. There are some techniques for converting a directed graph to undirected one such as the techniques in Ref. [39]. We report the case of undirected and unweighted graph in our implementations. It is notable that some base methods are capable of finding overlapping communities, only the versions able to find disjoint communities are used.
To describe "Mitra" more formally, let G =< V, E > be an undirected simple graph where V is the set of nodes and E is the set of edges. Suppose we have a set of base community detection algorithms A = {A 1 , A 2 , ..., A k }. We apply algorithms {A 1 , A 2 , ..., A k−1 } on G to generate a set of different partitions P = {P 1 , P 2 , ..., P k−1 } defined over the same set V by k − 1 algorithms respectively, i.e. P i is a partition of the set V generated by the base algorithm A i applied on graph G. We have by definition is a bipartite graph generated to show the relations of real nodes and community nodes. G b is reduced to a unipartite graph G u that is the consensus graph. A k is a member of base algorithms applied on the consensus graph G u to produce consensus clustering P * . We also consider P G as the ground truth partition that is the reference partition we are looking for. We also consider average NMI value The goal of our ensemble clustering function is to reach to a consensus clustering P * such that the number of disagreements between P * and P G is less than the number of disagreements between the average of P i 's and P G . In other words, Mitra is not producing a simple mean of ensemble results and it is better than an average. In a formal way we have: Where N M I() is a function measuring the distance between two partitions [21]. N M I value equals 1 if the partitions are identical. it has an expected value of 0 if the partitions are independent. The closer the NMI value is to 1, the more similar the partitions are. NMI function is described formally in subsection 5.1. In the following the set of employed base algorithms is introduced.

Base Methods
In this work, a wide spectrum of community detection methods have experimented that we call base methods. The base methods come from a variety of different theoretical frameworks, as we try to select a set of community detection algorithms, which are comprehensive and exploit some of the most interesting ideas and techniques that have been developed over the last years. These base algorithms use different techniques such as modularity, label propagation, and random walk. Some of these algorithms guide their clustering by employing objective functions and some does not. Base method set includes CNM [19], Walktrap [54], CONCLUDE [22], Copra [35], Oslom [45], LPM [56], MCL [23], Infomap [57], Louvain [16]. The implementation code for all the algorithms is publicly available in [3, 7, 9, 8, 11, 6, 1, 10, 4] respectively. here is a short introduction of the base methods. CNM is based on maximization of modularity. Walktrap uses a measure of similarities between vertices based on random walks using modularity optimization. CONCLUDE generates random, non-backtracking walks of finite length to compute the importance of edge centralities. Copra is an extension of the label propagation technique. Oslom is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations. LPM simulates the spreading of labels based on the simple rule that at each iteration a given vertex takes the most frequent label in its neighborhood. MCL simulates a peculiar process of flow diffusion in a graph. Infomap uses a random walk as a proxy for information flow on a network and minimizes a map equation over all the network clusters. The goal of Louvain is the optimization of modularity, by means of a hierarchical approach.

Fast projection method
In the following, a bipartite network is defined and our fast projection method (that is applicable on bipartite networks) is explained. Then our algorithm is presented. A bipartite network is a graph whose nodes are divided into two disjoint sets such that no two graph nodes within the same set are adjacent. These node sets are called primary nodes and secondary nodes. In this work, the real nodes of the original network are considered as primary nodes and, community nodes are considered as secondary nodes in the bipartite network graph. Each community node is associated with a community that has links to its node members.
In the following steps the construction steps of a bipartite network G b is presented which is associated with a simple network graph G. Bipartite graph G b supposed to be a null graph firstly.
1. Add all the nodes of G to G b as real nodes 2. Add a community node to G b for each community detected by each base method applied on G 3. connect each community node to all the real nodes of that community 4. assign a weight to each link such that each community node divide weight 1 between its neighbors equally.
Fast projection approach is based on the sampling of important links. In fact most links in a weighted projection carry redundant information for community detection. Therefore, only the important non-redundant links must be sampled. Communities are considered as an indication of similarity of real nodes. Therefore, we apply fast projection technique to identify similar real nodes frequently visited in a sequence by a random walker on a bipartite network. In this work, to show the directed hidden relation between real nodes, the bipartite network is compressed by a one-mode projection called fast projection [42]. The projected network is a unipartite network. The unipartite network only include real nodes. Two nodes of a unipartite network are connected only if they have at least one common neighbor community node in the associated bipartite network. In the fast projection technique, random walk is deployed to compress information of a bipartite network G b and produce a unipartite one G u . A random walker starts walking on G b from a random real node to a community node, then from the community node to another real node by passing two consecutive links. Therefore, the walker is forced to pass a community node in each step, which is (one of)the common neighbors of the two real nodes.
Here an illustrative example is explained. let C be a detected community in the original graph G by a base method and v i and v j be two real nodes belong to C. In the process of building G b one community node with label C and two links from C to v i and v j are added to G b (see fig. 3a). Suppose C consists of five real nodes. Therefore, the weight of each link from C to the associated real nodes becomes 0.2. After generating G b fast projection builds G u from G b . While applying fast projection, suppose a two-step random walker starts walking on G b form v i to C as the first pass and, from C to v j as the second pass. Therefore, the weight of the candidate link between v i and v j becomes 0.4 in G u . It is possible that other candidate link connects v i and v j with a different weight as a result of another community node say D (detected by another base method). At the end of fast projection, v i and v j are added to G u . Moreover, the highest probability between v i and v j from the candidate set is selected as w ij . Then a permanent weighted link is placed between v i and v j in G u . The weight of this link is w ij . Figure 2 clarify these concepts. In detail, fast projection associates each community node with the top X real nodes selected by link weights. For each real node, we take the top X real nodes associated with each of its connected community nodes and include them in a candidate set. For each node in the candidate set, we compute the two-step random walk probability to go to other nodes also in the candidate set and create links to the top Y nodes.

Network datasets
To evaluate the performance of our method, two types of datasets are deployed, real and artificial datasets. Zachary [66], Polbooks [2] and Football [33] datasets are considered as real datasets. The datasets comprise 34, 105 and 115 nodes respectively. These datasets are famous and the most of researches in community detection field are tested on these networks. We also use the state-of-the-art artificial benchmark graphs with built-in community structures, i.e., the LFR benchmark [44]. The main reason for using benchmark graphs is the lack of ground truth information for the communities in real-world networks. LFR graphs are characterized by power law distributions of vertex degree and community size, features that are frequently observed in real world networks. An LFR graph has a clear community structure. Thus, it serves as a baseline reference for a network with known and detectable structure. LFR networks were created with standard LFR code [5]. One of the parameters of LFR networks is network size, that determines the number of nodes in the network. The other parameter is mixing parameter, that is a measure of the degree of fuzziness of the clusters. Mixing parameter µ is the ratio of the external degree of each node (with respect to the node cluster) divided by the total degree of the node, so it varies from 0 to 1. The values of µ close to zero correspond to well-separated clusters, whereas the values near 1 indicate a system with very mixed communities (thus hard to identify).

Evaluation criteria
The comparison between different clustering algorithms with our approach is done according to two aspects. First, we are interested in the structures of the clusters that various methods are able to find. Basically, we would like to understand how well algorithms perform in terms of optimizing the notion of NMI. In other words, maximization of NMI is the objective of this work. Second, we are interested in the quality of the clusters identified by the algorithms to clarify the behavior of the methods. These two aspects are explained in the following subsections.

Comparing community structures
Testing an algorithm on any graph with built-in community structure implies defining a quantitative criterion to estimate the goodness of the answer given by the algorithm as compared to the real answer that is expected. This can be done by using suitable similarity measures. For reviews of similarity measures see Refs. [18,28,48].
Here N ormalized M utual Inf ormation(N M I) is selected, borrowed from information theory [21] which have proved to be a reliable measure to assess our approach. To evaluate the Shannon information content [47] of a partition, one starts by considering the community assignments x i and y i , where x i and y i indicate the cluster labels of vertex i in partition X and mathcal(Y ), respectively. One assumes that the labels x and y are values of two random variables X and Y , with joint distribution P (x, y) = P (X = x, Y = y) = n ( xy)/n, which implies that P (x) = P (X = x) = n X x /n and P (y) = P (Y = y) = n Y y /n, where n X x , n Y y and n xy are the sizes of the clusters labeled by x, y and of their overlap, respectively. The mutual information I(X, Y ) of two random variables is defined as The measure I(X, Y ) tells how much we learn about X if we know Y , and vice versa. Actually is the Shannon entropy of X and H(X|Y ) = − ∑ x, yP (x, y)logP (x|y) is the conditional entropy of X given Y . The mutual information is not ideal as a similarity measure: in fact, given a partition X , all partitions derived from X by further partitioning (some of) its clusters would all have the same mutual information with X , even though they could be very different from each other. In this case the mutual information would simply equal the entropy H(X), because the conditional entropy would be systematically zero. To avoid that, Danon et al. adopted the normalized mutual information [21]

COMMUNITY DETECTION IN SOCIAL NETWORKS USING CONSENSUS CLUSTERING
which equals 1 if the partitions are identical. It has an expected value of 0 if the partitions are independent. The normalized mutual information is currently very often used in tests of community detection algorithms. In our experiments we use max(H(X), H(Y )) instead of (H(X) + H(Y ))/2 in Eq.3, which is a strict version of NMI. Therefore, the following version of NMI is implemented in this work.

Quality Scores
In order to investigate more properties of the methods, we analyze the quality of the communities detected by the base methods and Mitra according to some criteria. In general there are two criteria of interest when thinking about how good of a cluster a set of nodes is. The first is the number of internal edges between the members of a cluster, and the second is the number of external edges between members of the cluster and the remainder of the network [46]. This work follow the trend presented in [46]. In this trend there are two types of objective functions. The first group combines both criteria (number of edges inside and the number of edges crossing) into a single objective function; the second group employs only one of the two criteria (e.g., the number of edges cut). • Conductance: f (S) = c S 2m S +c S measures the fraction of total edge volume that points outside the cluster [59,40]. Although various functions are considered to model community scores, we will primarily work with conductance, that arguably is the simplest notion of cluster quality. Conductance can be simply thought of as the ratio between the number of edges inside the cluster and the number of edges leaving the cluster [59,40,46]. Many experiments have shown that Conductance is the best scoring function for networks with well-separated and non-overlapping communities [65,46,26]. More formally, conductance ϕ(S) of a set of nodes S is ϕ(S) = c S /min(V ol(S), V ol(V S)), where c S denotes the size of the edge boundary, is the degree of node u. Thus, in particular, more community-like sets of nodes have lower conductance. Now the following four notions of community quality are considered that are based on a single one of the two criteria mentioned at the beginning of this subsection: is the expected number of edges between the nodes in set S in a random graph with the same node degree sequence. The most popular quality function is the modularity of Newman and Girvan [51]. Modularity measures the number of internal community edges, relative to a null model of a random graph with the same degree distribution. It is based on the idea that a random graph is not expected to have a cluster structure, so the possible existence of clusters is revealed by the comparison between the actual density of edges in a subgraph and the density one would expect to have in the subgraph if the vertices of the graph were attached regardless of community structure. This expected edge density depends on the chosen null model, i.e. a copy of the original graph keeping some of its structural properties without a community structure. Although modularity has been shown to have a resolution limit [29], some of the most popular clustering algorithms use it as an objective function [16,63]. Modularity is given by Eq.((5)) where the sum runs over all pairs of vertices, A is the adjacency matrix, m the total number of edges of the graph, and k i represents the degree of the node i. The δ-function yields one if vertices i and j are in the same community Since the only contributions to the sum come from vertex pairs belonging to the same cluster, it is a good idea to group these contributions together and rewrite the sum over the vertex pairs as a sum over the clusters [28]: Here, k is the number of clusters, m S the total number of edges joining vertices of module S and d S the sum of the degrees of the vertices of S. In Eq.((6)), the first term of each summand is the fraction of edges of the graph inside the module, whereas the second term represents the expected fraction of edges that would be there if the graph were a random graph with the same expected degree for each vertex. The reader is referred to Ref. [26] for a good example of modularity computation.
A higher modularity is often taken as an indication of a better community structure as it is different from the random null model. As such, the modularity can only be used as a comparative measure [20] and has drawbacks, for example random networks usually have higher maximum modularity than real-world networks, and the resolution limit [29]. It can be shown that modularity optimization, i.e. finding the optimal community structure, is an NP-complete problem [17]. It is also possible to show that modularity has many local maxima [34] making the identification of a global maximum very difficult.

Experimental results
Experimental comparisons of different community detection algorithms have been conducted in this Section. We compare Louvain, Infomap, CNM, Walktrap, LPM, MCL, Copra, Oslom and CONCLUDE algorithms with our Mitra algorithm, on real datasets and LFR benchmark graphs. Infomap is used as the last base community detection method in our implementations. However, as explained in section 3 all the other base methods are usable as the last base method. In the following, the implementation of LFR benchmarks are explained and, NMI results are then provided. After that the results of quality scores for real and artificial networks are reported.

LFR Dataset Implementation
Each LFR network is associated by some parameters such as network size N and mixing parameter µ which determine the structural properties of the network. We implement our method against the benchmark graphs with three network sizes N =1000, 5000 and 10000 nodes. Mixing parameter µ get a value in range {0.1, 0.2, ..., 0.9}. We generate 100 realizations of the LFR benchmark for each value of µ and for each network size, creating 2700 graphs (i.e. 100 samples of each network for each pair of N and µ values). In other words, 900 network samples are produced for each network size (i.e. 100 samples for each value of µ). All the values in the diagrams and tables are averaged over these 100 samples. All the generated networks are undirected and unweighted. Table 1 shows the parameters used for the generated networks. The construction method of an LFR netwrok is provided in [44]. The selected parameter values are based on the literature [44,45]. As described later in LFR graphs, node degree (i.e. the number of edges connect to a node) and community size (i.e. the number of nodes belong to a community) follow power law distributions. As we show in Table 1, the community sizes are taken from a power law distribution with exponent β and each node is given a degree taken from a power law distribution with exponent γ. Depending on the size of the graph N , other parameters vary accordingly (see Table 1). For instance, one possible network size is 5000. Mixing parameter µ ranges in {0.1, 0.2, ..., 0.9}, β and γ

NMI results
Experimental results on both real and artificial datasets are presented in this section. Table 2 reports NMI values for all the base methods, Mitra and the average value produced by the base methods. All the methods are experimented on Zachary, Polbooks and Football datasets, some of the famous datasets in community detection field. The datasets contain 34, 105 and 115 nodes respectively. Average column in the table represents the mean NMI values produced by the base methods. It is observable in Table 2 that all the methods perform almost the same. The table did not show an interesting priority for Mitra in comparison with the base methods. However, these data are special cases in the world and, it is necessary to evaluate our method on more various datasets. Therefore, the artificial datasets are used in the most researches in our field and we also analyze them in the following part.

LFR dataset results
In the following, the comparison between ground truth (reference) partition with the partitions produced by the base methods is presented, according to NMI. The base methods are Louvain, Infomap, CNM, Walktrap, LPM, MCL, Copra, Oslom, CONCLUDE. The comparison is applied on LFR benchmark graphs. All the base methods are applied on original graph (i.e. G) whereas, Mitra is applied on consensus graph (i.e. G u ). The average curve is the average NMI value of all the base methods. Figures 3-5   between 20 to 100 nodes, whereas in the diagrams correspond to N = 1000 nodes, the range goes from 10 to 50 nodes (see Table 1).
As an important point, it is observable in Figures 3b, 4b and 5b that for all networks the average curve of base methods is always under Mitra curve. This means that for these networks, our Mitra is not a simple average and is better than the average, especially when µ increases and the networks become more complicated. This conclusion is considerable and shows the base methods reinforced each other by the designed consensus. In other words, Mitra reinforce the strengths points of the base methods. The distance between the NMI of Mitra and average NMI of base methods, is close to 0.2 for the most complicated networks, and the difference is even much more for µ = 0.7. The results show the enhancement of quality (about 0.2) according to NMI values for the most complicated networks where µ goes from 0.6 to 0.9. Same or greater distance is seen for larger networks. In other words, Mitra works better for more complicated networks. This seems interestingly considerable and shows the power of Mitra.
It is observable in NMI diagrams (Figures3a, 4a and 5a) that Oslom, CONCLUDE and MCL show good performances. However, for greater values of µ, MCL almost assigns each node to a separate community. This assignment is not informative. Oslom shows this behavior for N = 1000 as well. Another considerable property of Mitra appears here; Mitra produces solutions with an acceptable structure and the close number of communities to the ground truth in all the networks. Moreover, the effect of low quality solutions of some base methods, is covered by Mitra. This shows the superiority of Mitra over the base methods. This is observable, for example, in Figure 3a. In this case, for N = 1000 and µ = 0.9 the number of ground truth communities is 45. The corresponding number is 130 for Mitra, but it is 927 for MCL and 1000 for Oslom. This conclusion is also valid for MCL in the cases of N = 5000 and N = 10000. As another example for MCL in the case of µ = 0.9 and N = 10000, NMI value is 0.56, but the number of communities is 10000 (each node is in a separate community), whereas the community number is 1459 for Mitra and 190 for ground truth. CONCLUDE has a close community number to our Mitra method and a close value of NMI in all the networks in Figures 3-5. To avoid excess explanation and keep the consistency of this paper, the number of communities are not reported completely. Now the properties and behavior of the base methods are investigating. To increase the readability of the paper, Figures .3a, 4a and 5a are referred as NMI diagrams in the following. Generally, it is illustrated in the NMI diagrams that all the methods start very well and the NMI values are close to 1 for the smaller values of µ, but all have a decreasing trend and approaching zero for the bigger values of µ. For example in Figure 3a  almost all methods can find the correct clusters. When µ grows, the networks become more complicated and the chance of finding the correct clustering decreases.
The comparison of Figures 3-5 illustrates that almost all of the algorithms perform better on the networks of larger size. For example Oslom is not stable for N =1000 (see Figure 3a), but this method is stable and shows a good performance in Figures 4a and 5a. For this reason, Oslom is removed from the ensemble for the case of N = 1000 nodes. As another example, for N =1000 the performance of LPM and Copra heads down to zero starting from µ = 0.6 (see Figure 3a), but in the cases of N =5000 and 10000 (Figures 4a and 5a), their performances become zero starting from µ = 0.7 and µ = 0.8 respectively. As illustrated in Figure 3a, most algorithms perform well except for CNM. The diagrams also show that the performance of MCL decreases faster than the competing diagrams. Nevertheless, considering NMI results, MCL is performing better than other methods when it comes to greater values of µ (e.g. when µ = 0.9). As you see, NMI of other methods drop faster down to 0, while NMI of MCL stays above 0. We can see in the NMI diagrams that CNM always generates an approximately linear curve that is almost the worst case between the base methods. Moreover, the performances of LPM and Copra become zero before all the other base algorithms. On the other hand, some methods are able to find communities even for the most difficult cases and their associated NMI never become zero, e.g. CONCLUDE. CONCLUDE is one of the best performing methods according to NMI diagrams and it also works well based on the quality scores provided in later subsection.
It is shown in the NMI diagrams that NMI becomes zero for some methods. In these cases, the method is not able to find meaningful clusters and in some cases (e.g. Copra) groups all the nodes in one cluster. It is also interesting that although NMI of Infomap on G becomes zero in some points (i.e. this method is unable to find clusters of G), Infomap is still able to find a good partition for consensus graph G u , when applied as the last base method on the consensus graph. It is notable that most of the observations are similar in Figures 3-5. They only differ in some details. Therefore, we do not repeat our reasoning again for the larger networks.
Finally we conclude from the comprehensive findings that it is necessary for getting a good consensus result to employ good base methods. In addition, removing the better quality base methods generates a weak consensus method. Moreover using a weaker base method as the final algorithm A will produce weak results although the initial base methods generate sufficiently good quality results.

Quality scores results
For further investigation of the properties of methods, in this part we present the quality scores illustrations associated with all the methods. These functions model community scores. For each benchmark network the average value of each score is computed for all the communities detected by each community detection method. This computation is repeated for all 100 samples of network configurations. The mean value produced for these 1000 samples is plotted for each score. Each community score is plotted as a function of µ. Lower value of score f(S) signifies a more community-like set of nodes. The score results are reported for real and artificial datasets.

Real datasets results
As describe later, three real datasets are selected to evaluate. Table 3, Table 4 and Table 5 show the results quality scores for the real datasets. In these tables, for the first five scores, smaller values signify a more community-like set of nodes. For modularity and modularity-ratio, greater values represent better community-like groups. The tables do not represent a super excellence for Mitra. Nevertheless, the tables are provided to keep consistency of the paper. The It is apparent that all these scores try to capture the same basic intuition. They reward the sets of nodes that have many internal edges and a few external, pointing to other clusters. It is apparent that the score definitions, although different in value, are highly correlated [65]. Figures. 6-8 illustrate that for simpler networks (i.e. smaller µ values), all the methods show a similar behavior. However, for the more complicated networks (i.e. bigger µ values), the methods produce diverse results and exhibit dispersion. In the following, the focus is on Figures 6-8 for N = 1000, 5000 and 10000 nodes respectively. These figures are referred as score diagrams. The diagrams illustrate cut-ratio and expansion are more correlated scoring metrics, as well as conductance and normalized-cut. This inference also confirms the conclusion in Ref. [65]. Consequently, cut-ratio and normalized-cut are not investigated in the following.
Considering the score diagrams, ground truth and Mitra show very similar trend in all the diagrams for different network sizes. However, MCL, Copra, Infomap and LPM show different trends. Interestingly, it means Mitra clusters are similar to the reference clusters. This confirms that Mitra shows a good performance according to quality scores.
In the score diagrams, conductance has an increasing trend. Although smaller values are preferred, the value of zero means all the nodes are placed in one class and such clustering is not informative. This is seen in LPM, Copra and Infomap for higher values of µ, and in CNM for all generated networks of any size. As another point, MCL and Copra show the worst conductance among all the base methods and their distance to other methods is considerable in all figures. This means that for MCL and Copra, the ratio between the number of edges inside the cluster and the number of edges leaving the cluster raises quickly. In the case of expansion (Figures 6b, 7b and 8b), MCL is the worst performing method again. This means that for MCL, the normalized number of edges pointing outside of the cluster raises quickly. Although MCL is a well performing method according to NMI, deep investigation of its application and produced results makes the method unfavorable according to the number of clusters found and the quality of scores. As illustrated in Figures 6c, 7c and 8c internal-density values are diffused for different methods. Albeit like the other scores, the base methods show a similar behavior. Specially for simpler networks, CNM has a different manner and its distance to the other methods increases in all of the networks with N =1000, 5000 and 10000. It is observable that this event occurs more often in larger networks. In such networks CNM shows a lot of fluctuations in the values. Generally speaking, CNM is not preferred in any of the networks, according to NMI and internal-density. On the other hand, edges-cut of CNM is very high and extremely far from the other methods in all of the figures, which is not favorable. Another general observation is modularity and modularity-ratio tend to decrease roughly monotonically (Figures 6g, 7g, 8g,6f, 7f, 8f). This is not surprising, as the networks are becoming more complicated gradually and, the algorithms are not able to find the correct communities. Although the modularity ratio of the methods are clearly different, they show very similar behavior. This suggests that modularity ratio is not a practical measure to detect a community, and thus may not be preferred. Considering edges-cut, it is observable that some methods (e.g. Oslom and CNM in Figure 6h) produce a diverse set of numbers, from a few edges to more than 10000. For example edges-cut is 7500 for CNM and 4200 for Oslom in Figure 8h. This diversity of edges-cut values shows the structure of clusters produced by Oslom and CNM is different from other methods. Another important result is for larger networks (e.g. N = 100, 000), Walktrap is not able to finish the execution and halts. CONCLUDE is also generates a solution in a very long time (almost some days) for these network sizes. We can also draw conclusions about the general behavior of the base algorithms. According to NMI and quality scores, it is observable that almost all of the algorithms follow the same trend in their application for lower values of µ, although they use different definitions for communities and apply different strategies to find them.

Conclusion and Future work
In this work, we proposed a new ensemble clustering approach to enhance community detection by employing bipartite networks. The method is called Mitra. The proposed method fuses results of base community detection methods and provides a more accurate community structure for a network. We performed an empirical comparison and evaluated the quality of communities identified by a variety of community detection algorithms. Moreover, we compared the performance and quality of the methods according to Normalized Mutual Information (NMI) and extended our study by evaluating some community scores.
We can conclude about the general behavior of our Mitra method. The method is good performing, in the sense that its performance never becomes zero according to NMI and is far better than the average performance of the other employed base methods. This conclusion confirms our first assumption that weak points of one base method are covered by the strengths of the others by using Mitra approach. Therefore, Mitra helps in mitigating the limitations of base methods. Moreover, the experimental evaluations show Mitra is not a simple average of base methods and, is far better than the average, especially when networks become more complicated. This conclusion is interestingly remarkable and shows Mitra reinforces the strengths points of base methods. Furthermore, Mitra produces solutions with an acceptable structure and the close number of communities to the ground truth in all the networks. In other word, its behavior is very close, according to trend and value, to the ground-truth.
As the last point, base methods play the same role in Mitra. Therefore, it is appealing to change the priority of roles of base methods to find a better ensemble method. We are going to investigate this in future. Moreover, employing more evaluation criteria for quality estimation in community detection methods is our future plan for continuing this work.