Automated Noise Detection in a Database Based on a Combined Method

Data quality has diverse dimensions, from which accuracy is the most important one. Data cleaning is one of the preprocessing steps in data mining which consists of detecting errors and repairing them. Noise is a common type of error, that occur in database. This paper proposes an automated method based on the k−means clustering for noise detection. At first, each attribute (Aj) is temporarily removed from data and the k−means clustering is applied to other attributes. Thereafter, the k−nearest neighbors is used in each cluster. After that a value is predicted for Aj in each record by the nearest neighbors. The proposed method detects noisy attributes using predicted values. Our method is able to identify several noises in a record. In addition, this method can detect noise in fields with different data types, too. Experiments show that this method can averagely detect 92% of the noises existing in the data. The proposed method is compared with a noise detection method using association rules. The results indicate that the proposed method have improved noise detection averagely by 13%.


Introduction
The progressive increase in data has caused organizations to face a large amount of data as well as heterogeneous and distributed sources. These data are used in decision-making and knowledge acquiring. The potential business value of these decisions depends on the quality of data used to make them [31]. When data are transferred from different sources and systems to another system, errors or problems such as heterogeneous format or domain may arise. Making decisions based on the low quality data not only causes damages to the structure of the organizations but also imposes high costs to them. According to the studies, more than 30% of real world data lacks in quality leading to damages of three trillion dollars per year to the US government [26]. That is why the data quality is highly important in data sources. Although preparing high quality data is time and cost consuming [8], it is significantly better than making mistake because of the low quality data.
Data quality has various dimensions, from which accuracy is of higher importance. Problems such as noise, incompleteness, inconsistency, and missing values cause this dimension to be violated. Data correction is a process used to detect noisy, incomplete, and inconsistent data and improves data quality by correcting detected errors [15]. Data correction procedure may be boring and time-consuming, however, it cannot be overlooked [30]. Considering the high volume of data, interactive correction approaches are inapplicable and thus, automated approaches are required.
Data mining is a key technique for data correction [15]. Various approaches have been provided up to now for data correction using ontologies [5], classifiers such as decision trees [17,16] and neural networks [4], rules [26,13,19], ensemble learning [4], Markov logic networks [10], functional dependencies [12] , and conditional 667 correction. Some of these approaches can detect and correct only a special type of data [17,16]. By contrast, some other ones can detect and correct all types of data. In this section, some of these approaches and their disadvantages are explained.
The method proposed in [5] uses an ontology for detecting valid and invalid values in records. Invalid values are detected and provided a user together with all valid values existing in the ontology. After that, the system asked the user to select a value for erroneous field among valid values. The selected values by the user are saved together with invalid values. When the number of corrected erroneous fields by the user reaches a predetermined threshold level, rules are extracted from them. Thereafter, the extracted rules are used to correct fields which are detected by the ontology as erroneous fields. As the method need to interact with the user in the first phase, it is not appropriate for a large amount of data.
In [22] an approach has been proposed to impute missing data using grey based fuzzy c-means, mutual information based feature selection, and regression model. At first, this algorithm finds the priority of each missing attribute. Then, data are clustered using Fuzzy c-Means. Next, the algorithm chooses clusters which satisfy a minimum condition. After that, mutual information is used to select highly dependent features of instances within each cluster. To provide estimations for a missing value, regression models will be applied to the selected features. Finally, the missing value is imputed through a weighted average of estimated values obtained from the previous step.
CMIM † method estimates missing value using correlation maximization [23]. At first, a base set is selected from complete instances. This method uses ten algorithms to maximize correlation. Then, each maximization algorithm attempts to find subsets of data with strong correlations with respect to the missing attribute of a missing instance. Finally, a method imputes the missing value by applying regression model to the highly correlated subset. DMI method has been developed based on EM ‡ algorithm and C4.5 decision tree [17,16]. This method is used to impute missing values. For this purpose, records with missing values are detected and a decision tree is built for each attribute using complete records. The decision tree classifies records into several leaves in such a way that records existing in a leaf have the same class label. Records containing missing values are assigned to each tree, and the label of the related leaf is selected as the correct value. This method can correct only missing values and is limited to numeric and categorical data.
SiMI algorithm is an extended form of DMI algorithm [17]. In this method, the decision tree has been replaced by a decision forest. After building trees, this method starts to search for the intersection of different leaves. SiMI merges the smallest intersection with the intersection in which records are highly similar to each other. In addition, it maximizes the dependency between attributes when intersections are merged. In the last step, SiMI assigns each record with missing value to an intersection for finding a replacement value. As this algorithm is the extended form of DMI, it can correct only missing values of numeric and categorical types.
In [5], the researcher introduces bagging predictors as a cleaning method, in which multiple versions of the predictor are generated and at last, a final prediction is provided based on votes of all predictors. Multiple predictors are formed by bootstrap repeatedly on the training set. After that, they are used as new training sets. In this method, correction and detection phases cannot be separated. Therefore, the user is not able to employ another approach for the correction phase. SCARE § algorithm combines machine learning algorithms with likelihood to correct erroneous data [32] using correct values. To use likelihood, two criteria have been employed for maximizing the likelihood benefit and minimizing the cost of changes. For this purpose, a classifier is formed for each attribute and the value of each dirty attribute is predicted. In the next step, a graph is constructed for each record where nodes are predicted values and edges represent the dependency between attributes. After constructing the graph, the node whose edges have the minimum weight is deleted until only one value of each attribute remains. In this method, it is supposed that the error has been identified and the approach only corrects erroneous records. † Correlation Maximization-based Imputation Methods ‡ Expectation Maximization § SCalable Automatic REpairing

AUTOMATED NOISE DETECTION IN A DATABASE BASED ON A COMBINED METHOD
The algorithm provided in [26] uses predefined rules for the correction of an inconsistent data source. These rules consist of three components. The first and second ones build the left side of the rules and the third one builds the right side. The first part which is called an evidence pattern indicates attributes that are related to each other. The second part of the fixing rule consists of a negative patterns indicating wrong values of attributes. The last part is the actual value indicating the correct value of each wrong value. At first, each attribute provided in the evidence pattern is selected as a key. Then, they are saved together with their corresponding values in a list. In each record, keys which are saved in the list are searched. In the case of any correspondence, the second part of the rule is considered. If the second part of the rule is found in the record, it is corrected by the third part of the rule. This procedure is repeated for all records.
In [13], an approach has been introduced for correcting inconsistencies using rules. The rules existing in a data source are extracted using available algorithms. Then, the confidence of each rule is calculated. Any transaction violating these rules is suspected as errors. The algorithm assigns a score to each transaction based on the number of violated rules by it. That means each transaction which violates more rules, receives the higher score. Finally, transactions with their scores are displayed to users and able them to decide about the transaction based on the scores. As this method requires an expert in the correction phase, it is not appropriate for a large amount of data.
In [27,28,29], it is tried to identify attributes which are suspected of being noisy and, correct them by a polishing algorithm. This algorithm has two phases: prediction and adjustment. In the prediction phase, one algorithm is selected from the classification algorithms. Data is selected by a ten-fold cross-validation. In this phase, the value of the considered attribute is predicted for each record by each ten classifiers. If the original value of that record is inconsistent to the predicted value, the value of the record obtained from the prediction will be saved for correction in the next phase. In the adjustment phase, the ten-fold cross-validation is performed on attributes. Then, the incorrect value of each record which was identified in the previous phase is corrected based on the predicted value by ten classifiers. The correct value can be selected only from the values predicted in the previous phase. In this method, correction and detection phases cannot be separated and therefore, a user cannot employ another approach as the correction phase.
In [14], the constraints of databases have been employed for the correction of inconsistencies. At first, constraints existing in a data source are extracted, and then, records violating constraints are found. Records having inconsistencies are corrected by a greedy algorithm. As the extraction of constraints requires the interaction with a user, this method is not efficient for a large amount of data.
The method introduced in [15] is a rule-based approach for error detection. At first, all attributes are converted to binary form . After the binarization of attributes, the rules with the minimum support are extracted from a data source. Then, it is calculated how many times a rule is violated by the data of that source. Rules are deleted which violated more than a specified threshold level. In the next step, rules having a parent are deleted. That means, if there are two rules in the form of x, y → z and x → z in the rule set, then x → z is considered as the parent of x, y → z. Therefore, x, y → z is deleted from the rule set. Thereafter, the number of violating rules by each record is calculated. Records which violated more than a threshold level are detected as erroneous records.
The proposed method in [7] repairs data using a knowledge base and a crowd powered. The table containing errors, the knowledge base, and the crowd are inputs in this method. At first, a table pattern is created for mapping data table to the knowledge base. Then, it chooses the best table pattern using a crowdsourcing. After that, data are categorized into three classes: (1) correct tuples that are identified using the knowledge base, (2) correct tuples identified using the knowledge base and the crowd, (3) dirty tuples identified using the knowledge base and the crowd. Finally, top k mappings are presented from the knowledge base as a correct value for erroneous data. This method uses an expert if there is not enough information for selecting k corrections.
A repairing method based on constraints has been proposed in [25]. The method finds a minimum data repair that satisfies at least one of the constraints which has variety more than other constraints. It employs both predicate insertion and deletion for repairing constraints. Compared to the related approach with the trust parameter that controls the portion of trusted data, it can corrects numeric data type.
The interaction between data correction and record matching has been studied as new problem in [9]. This approach indicates that data correction can help data matching effectively. This method presents a framework contains correction and matching to achieve a corrected data set based on constraints, verification rules, and master data. This approach uses conditional functional dependencies and matching dependencies as constraints for detecting inconsistencies. This framework proposes three algorithms for error identification.
In [2] a novel data repairing approach is proposed based on constraints and ensemble learning. The proposed approach consists of two main steps. At first step, functional dependencies are extracted. Then, noisy records which violate functional dependencies more than a threshold are identified. After that, the repeated values are extracted from consistent records for each FD. The repeated values are used to detect noisy attributes. In the second step, ensemble learning model is used to correct noisy attributes.
A cost based model has been presented in [6] which data and constraints are compared equally. The model uses functional dependencies as constraints. The model proposes two separated algorithms for repairing data and constraints. Data repair algorithm looks for correct values for inconsistent records that have a minimum changes in original data. This model proposed two approaches to correct inconsistent records with a functional dependency. In the first approach, it finds a repair for two inconsistent records which have different values for the right side attribute of the functional dependency. After that, the attribute values in these records are set equal. In the second approach, it sets different values for the left side attribute of the functional dependency in these records. Furthermore, the algorithm for repairing inconsistencies in functional dependencies has two approaches, too. In the first approach, it searches an attribute which added to the functional dependency to remove the inconsistency. In the second approach, a set of attribute values is selected from records satisfying functional dependencies. These values determine the subset of consistent records with functional dependencies. Then a cost based model runs for selecting the best correction. The main goal of this model is to select a correction with minimum changes.
A functional dependency based integration system has been introduced in [24] for inconsistent data identification. This approach corrects data or functional dependencies for fixing inconsistencies. It uses three techniques for functional dependencies correction: adding attributes to a functional dependency, transforming a functional dependency to a conditional functional dependency, and redundancy identification in a functional dependency. The system consists of four components: violation detector, data repair generator, constraint repair generator and unified repair engine. At first, data and functional dependencies are sent to the violation detector module to identify inconsistent data. Then, inconsistent records with each functional dependency are detected. The inconsistent record is passed to the data repair generator and violated functional dependency is sent to the constraint repair generator. Data repair module compares records patterns and passes a set of corrections to the repair engine. Functional dependency correction module sends a set of corrections to the repair engine. Correction engine selects a repair with minimum cost.
Functional dependencies have been used widely for error detection. Different repairs can be employed for each identified inconsistency but just one repair has to be chosen as the final repair. To find the optimal repair, a cost and diversity function are used as two parameters. The algorithm provided in [12] uses both parameters as its objectives for the first time. To compute diversities, a distance function is used that computes dissimilarities between records. To calculate costs, the number of changes in the original data source is computed.
In [3], A method has been proposed for inconsistencies corrections using sampling from the repair space of conditional functional dependency. This method proposes more than one value for correcting each inconsistency. Firstly, this method detects clean cells of each tuple. Then, for each functional dependency it generates a set of consistent cells. Finally, it randomly selects an efficient correction using the sampling algorithm from the space of cardinality-set-minimal. This method can apply user determined constraints in addition to the functional dependencies and conditional functional dependencies for corrections. This method only corrects inconsistencies and does not delete records having them.
In [11] a repair diversification novel has been presented. The aim of this approach is to generate a set of repairs, such that these repairs minimize the cost and maximize the diversity. Actually, generated repairs are dissimilar with each other to prevent redundancy. In this approach, a user defines a parameter, in order to keep a balance between the cost and diversity.
In [20] a framework is introduced for data repairing by probabilistic inference engine. It unifies integrity constraints, external data, and quantitative statistics, to repair errors in structured data sets. For combination these methods, probability theory is used. A framework generates a probabilistic model to detect inconsistencies over records in the data set. Statistical learning and probabilistic inference engine are used in order to clean errors.

AUTOMATED NOISE DETECTION IN A DATABASE BASED ON A COMBINED METHOD
The methods mentioned above are different in terms of being automated or semi-automated, the ability in noise detection, and the type and number of errors they can detect. This paper aims to propose an automatic noise detection method in a data set which has different data types. In the next section, we detail our approach first and next, it compared with the approach presented in [15]. The purpose of selecting this method for comparison is that the method automatically and separately detects noises in the different types of data.

Proposed Method
In this section, we propose an automated noise detection method based on the k-means clustering and k-nearest neighbors classification. At first, each attribute (A j ) is temporarily removed from data and the k-means is applied. After that, in each cluster, the k-nearest neighbors is used in order to predict a value for A j in each record. Noisy records are detected which A j have incorrect values in them. The procedure is applied for all attributes. The proposed method is able to detect noise in the different data types. This method only detects noises and for correction them, any other approaches can be employed. The method consists of four phases explained in the following subsections. Algorithm 1 illustrates the following subsections 3.1 to 3.3 which form the main part of the proposed algorithm. In the next subsections, each phase of Algorithm 1 will be explained.

First Phase: Clustering
Suppose that the input data set (D) has been defined on a set of attributes {A 1 , A 2 , · · · , A m }. In this paper, D ij refers to j th attribute of i th record in D. In this phase, the k-means clustering is applied for each of m attributes. That means A j is temporarily taken from the data source for each A j attribute, and the k-means clustering is applied on the other attributes (all attributes except A j ). To select k, one of the cluster validity indices [1] can be employed. In this paper, Silhouette Index has been used to determine the number of clusters [21]. After each clustering, the second, third, and fourth phases are implemented for each attribute. Specifically, in this phase, one attribute (A j ) is removed from the data source in order to detect records having incorrect values for their j th attribute. In this paper, it is supposed that there is not any master data or primary knowledge for examined data set to use for detecting noisy fields. After that, the k-means clustering is used in order to find similar records. In the data source, the other attributes may be incorrect in records in addition to A j , so some correct attributes are detected as noises. To solve this problem, the proposed approach is iterated (fourth phase), because the k-means has different outputs in iterations on the same data source. Lines 8 and 9 of Algorithm 1 show first phase of the algorithm. In line 8, the considered attribute will be deleted from the data. In line 9, the k-means clustering is carried out for other attributes.

Second Phase: Comparison
After each clustering, nearest neighbors are found for each record (r) among records in the same cluster with r. After that, A j is put as target for each record and a value is predicted for it in the record r based on majority of votes or average value using nearest neighbors. That means the k-nearest neighbors is applied for all records in all clusters. Suppose that a record r is in the cluster i. To applies the k-nearest neighbors, k ′ records, which are in the cluster i and have the least distance from the record r, are selected for the record r. The value of k ′ is an experimental value. If the attribute A j is a numeric, the average of A j in k ′ neighbors is predicted for the value of A j in the record r. However, if A j is of ordinal or nominal type, the most frequent value for A j in this k ′ neighbors is predicted. The predicted value is called β. Specifically, these values are used in next phase in order to detect in which record, A j has incorrect value. In fact, the nearest records are assumed as correct data and used to predict a value for A j in each record.
Lines 10 to 22, and 27 to 32 of Algorithm 1 show this phase of the algorithm. In lines 10 to 16, the distance of the record r in the cluster i from other records is calculated and the k ′ records which have the least distance from the record r in the cluster i are selected. Lines 17 to 22 and 27 to 32 estimate values for numeric, ordinal, and nominal attributes, respectively. To calculate the distance in this method, Euclidean distance has been employed.

Third Phase: Error Detection
If the attribute A j is numeric and the difference between its value attribute in record r and β is higher than a threshold level (ϕ), that record detected as a noisy one. Moreover, if the attribute has ordinal or nominal type and the value of A j in record r is in contradiction with the value of β, A j is detected as the noise in record r. In fact, the proposed algorithm detects records having incorrect A j in this phase. In lines 23 to 25 of Algorithm 1, the noise detection of numeric attributes and in lines 33 to 36 the noise detection of nominal and ordinal attributes have been done, respectively. It is noteworthy that, the k-nearest neighbor predict a value close to real value, so the predicted value is not sufficient in order to use for correction noise. In the other word, the predicted value by k-nearest neighbor is the initial estimation, and suggested to use another approaches [17,16,32] to correct the detected noise.

Fourth Phase: Error Reduction
This algorithm detects not only noisy attributes but also some correct ones as noisy attributes. To minimize the number of correct attributes identified as a noisy one, the steps 3-1 to 3-3 are run at least 2 times and at most 5 times with k ′ neighbors to increase the efficiency of the algorithm. The number of iterations has been obtained from the experiments conducted on the different data sets. Taken into account that there is a difference between the approach for numeric attributes and the approach of nominal and ordinal attributes, two different approaches have been proposed for reducing errors. For numeric attributes, the iteration showing more errors will be selected. This iteration is called n 1 and it is compared with other iterations. Suppose that n is the number of iterations, if an erroneous attribute which is in n 1 occurs in less than n − 1 iterations, its value in the record will be considered as correct value.
As regards nominal and ordinal attributes, just like numeric attributes, the iteration is selected which has most noisy attributes. It is called n 1 and compared with the other iterations. The elements existing in less than n − 1 iterations are deleted and the other ones are saved in a temporary memory. The steps 3-1 to 3-3 are run at least 2 and at most 5 times with k ′ − 1 neighbors. However, the results of iterations are compared at this time with the elements existing in the temporary memory. Any element of the memory which exists in results of n iterations is detected as the final error.

Time Complexity
In this section, the time complexity of the proposed algorithm is calculated. Suppose that a data set D includes s records and m attributes and also k is the cluster number of the kmeans algorithm. In line 7, there is a general loop, which runs lines 8-38 of the algorithm m times. At first, the k-means algorithm is applied to cluster the data set. The order of the k-means algorithm is O(kst) and t is the iteration number of the main body of the k-means algorithm. In line 12, there is a loop which iterates s times. Inside this loop and in lines 12-14, the distance of each record r from other records which have the same cluster is calculated. If k = s, these lines will run once, and if k = 1, they will iterate s times.
In line 16, the obtained distances are put in a ascending order. If the first condition is met, the order is O(1) and if the second condition is fulfilled, the order is O(s logs). In lines 19-21 and 29-31, there is a loop, which is iterated k ′ times. If the first condition is met, k ′ = 1 and the loop is not iterated. However, if the second condition is met k ′ = s and this loop is iterated s times. Therefore, the order of this algorithm is O(ks 2 logs) at the worst case. As the procedure of error detection and correction is considered as the preprocessing phase and has no effect on the main data processing, the order of correction algorithm does not affect the main processing step.

Experiments
In this section, five different data sets taken from the UCI Machine Learning Database, are used to assess the proposed method performance. Table 1 shows a brief summary of these data sets. These five data sets are correct and free from any error. To implement this method, MATLAB R2014a has been employed. The first data set is about

Algorithm 1 The pseudo-code of the steps 3-1 to 3-4 of the proposed algorithm
Input parameters D: A data set defined on schema {A 1 , A 2 , · · · , A m } with m attributes and s records k: Number of the k-means cluster ϕ : Threshold for difference between the predicted value and the considered value Output: A set of erroneous records Function for all for all records r in D ′ do ClusterLabel ← find cluster of record r from RecordsClusterLabel for all records r which cluster of it is ClusterLabel do dis(r) ← euclidean distance between records in ClusterLabel and r Indexs ← sort records ∈ ClusterLabel according to dis in a ascending order ) then out ← out ∪ r return out ENDFunction wholesale customers ¶ and consists of eight attributes. The second data set is about user knowledge modeling ∥ and consists of six attributes. The third data set is about the income of people based on census * * and consists of fifteen attributes. The fourth data set is about the Indonesia contraceptive prevalence † † and consists of nine attributes. The fifth data set is about predicting the cellular localization sites of proteins ‡ ‡ and consist of nine attributes.
In this paper, the performance of the proposed algorithm is evaluated by five criteria. Before introducing the criteria, the parameters used by these criteria are introduced. N and P denote the number of erroneous fields and the number of correct fields, respectively. TN and TP are respectively the numbers of erroneous and correct fields which have been labeled correctly by the algorithm. FP and FN are the numbers of erroneous and correct fields which have been labeled wrongly by the algorithm, respectively. In the following we introduce the criteria. By the first criterion shown in the equation (1), false alarm rate has been calculated. This rate is the number of errors ¶ https://archive.ics.uci.edu/ml/datasets/Wholesale+customers ∥ https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling * * https://archive.ics.uci.edu/ml/datasets/Adult † † https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice ‡ ‡ https://archive.ics.uci.edu/ml/datasets/Yeast which have not been detected divided by the total number of errors. By the second criterion shown in the equation (2), the error rate is calculated. For this purpose, the total number of the errors which have not been detected and the number of the fields which have been correctly detected as erroneous fields are divided by the total number of all records. An equation (3) calculates the true negative rate. This rate equals the number of the detected errors divided by the total number of the detected errors plus those errors which have not been detected. By the equation (4), the recall is calculated. For this purpose, the number of correct attributes which have been detected correctly is divided by the total number of the correct attributes that have been detected correctly plus the number of the correct attributes that have been detected as erroneous ones. The equation (5) calculates the precision. This rate equals to the number of correct attributes which have been detected correctly divided by the total number of the correct attributes that have been detected correctly plus the number of errors which have not been detected. The high value of the true negative rate, precision, and recall, and the low value of the false alarm rate and error rate in the experiments show the high efficiency of the algorithm.
In each data set which were free of error, 5%, 10%, 15%, and 20% noises are created and the performance was calculated using the above mentioned criteria. Random function is used to create noises in nominal and ordinal attributes. This function generates random numbers from the domain of attributes. For numeric attributes, equation (6) is used to create noises. A Rand() in equation (6), generate random number between 0 and 1. In figures 1 to 5 the performance of the proposed method is shown in five data sets using mentioned criteria. As shown in these figures, in all noise rates, the true negative rate, recall, false alarm rate, error rate, and precision are averagely 92%, 96%, 8%, 5%, and 98.5% respectively. Table 2 shows the performance of the proposed method with the noise rate equal to 20% in each data set and for different iterations. The true negative is more than 89% in all iterations and the false alarm rate is averagely 9%. In the iteration 1, in all data sets, the maximum value of the error rate is 17%, and after increasing the number of iterations, this value decreased to 9%. The best balance between parameters for the two first data sets, Wholesale Customer and User Knowledge Modeling, has been achieved in the iteration 3, for Adult and Contraceptive Method Choice in iteration 5, and for Yeast in iteration 4.
In the numeric attributes, the value of ϕ has been introduced as the difference between the predicted value from the k−nearest neighbors and the actual value of the attribute in the considered record. To obtain the best value of ϕ, 10% of noises were created in a numeric attribute, and the five criteria were considered in terms of different values of ϕ. Table 3 shows different values of the parameter ϕ. As shown, the low value of ϕ causes the true negative rate to be 100%. However, it increases the error rate to the maximum rate equal to 25%. The increase in the value of ϕ reduces the error rate dramatically. In addition, the true negative rate decreases from 100% to a value higher than 89%. The value of ϕ to achieve the best balance between criteria for the Wholesale Customer data set is 2350, for the User Knowledge Modeling data set is 0.285, for the Adult data set is 9, for the Contraceptive Method Choice data set is 0.31, and for the Yeast data set is 0.13. Any higher value of ϕ reduces true negative rate. Thus, these values are the best ones for ϕ in the data sets. According to the experiments, this best value is something between the mean and covariance value of each attribute.  Table 4, from the Wholesale Customer, Adult, and Yeast data sets a numeric attribute, and from User Knowledge Modeling, and Contraceptive Method Choice data sets an ordinal attribute has been selected. This table tests a number of neighbors to find the best number of neighbors. As shown in Table 4, at first, the true negative rate, precision, and recall are low and the false alarm rate and error rate are high. The increase in the number of neighbors, increases the true negative rate, precision, and recall and decreases false alarm rate and error rate. The best balance between criteria for the Wholesale Customer data set is achieved in the 8 neighbors, for the User Knowledge Modeling data set in the 4 neighbors, for the Adult data set in the 20 neighbors, for the Contraceptive  Method Choice in 22, and for the Yeast data set in 30. In these points, the true negative rate, precision, and recall reach its maximum values and the false alarm rate and error rate reduce considerably.
In Figure 6, the method proposed in this paper has been compared with the automated noise detection method using rules [15]. The rule based approach has been selected for comparison for three reasons: firstly, it has only detection phase, secondly, it is able to detect noises in all types of data, and finally, it detects noisy fields automatically without user interaction. To implement the automated noise detection method using association rules  [15], C# language and Visual Studio 2010 are employed. An average performance of the proposed algorithm is compared with an average performance of the automated noise detection method using rules. In all data sets 20% of noises are created.
As shown in Figure 6, the value of the recall in the proposed method and the automated noise detection method using rules is equal, but the rest criteria in the proposed method in comparison with the automated noise detection method using rules have better results. In the proposed method, the true negative rate and precision increase averagely by 13% and 2.7% respectively from those of the automated noise detection using rules. In addition, false alarm rate and error rate decrease by 13% and 0.1% respectively from those of the automated noise detection using rules. The experiments indicate that the proposed method is more efficient than the rule based method [15].

Conclusion
Data accuracy is considered as an important dimension in data quality. The decision making based on incorrect data can impose high costs and failure on organizations. The data cleaning process consists of two phases: error detection and error correction. The cleaning process detects inconsistencies, missing values, and duplicates; and corrects the detected errors. Considering the high volume of data, the interaction with the user is impossible during error detection. Therefore an automatic approach has been proposed for error detection in this paper. The proposed method is based on the k−means clustering. In this approach, clustering was carried out for each attribute, and then in each cluster, the k−nearest neighbors was applied. This approach can detect errors in different data types. In addition, as attributes are considered separately, it can detect several erroneous attributes in a record. According to the experiments, the true negative rate of this method is averagely equal to 92%. Moreover, the true negative rate of the proposed algorithm is 13% more than that of the similar method.