Improved Mean Methods of Imputation

Replacing missing values of a variable with the mean of the non-missing values is a simple and natural way to impute values fortunately in the case where data is missing completely at random. Following a short review of this method we consider three possible improvements, one called the shrinkage method, a second called the weighted interval method, and a third called the known variance method. Estimates of the population mean obtained from each of these methods are compared to the mean method both analytically and by means of numerical examples.


Introduction
The problem of missing data is a very common issue in real surveys as well as in experimental studies and may arise for any number of reasons.If the data is missing completely at random (MCAR) things can be very hard to handle in practice, particularly when no auxiliary information is available.Examples of data that is MCAR are laboratory sample that is dropped, or a questionnaire that is lost in a mail survey and so resulting observations became missing.Another recent example would be plant-studies that were affected by hurricane Mathew in Florida.MCAR is a more difficult situation than missing at random (MAR), or cases where there is deliberate non-response.More efforts are required to develop better imputation methods for MCAR where additional information could be little helpful or could lead to wrong prediction.In addition, in some situations, a cheap and fast method is an imputation technique which is being frequently used to substitute for missing values in order to improve inferential properties of an estimator.For more detail on the concept of MCAR, one could refer to [4,11,3].
A general problem with many frequently employed imputation methods is an introduction of bias and an increase in variance of the resultant estimators.A search for unbiased estimators under imputation should therefore be of interest.For details on the history of imputation methods, one could refer to [7], where one finds that several imputations methods such as hot deck, nearest neighbourhood, cold deck, warm deck, ratio method, regression method and power method of imputation have been proposed.These methods either make use of a "deck" from a past or the present survey, or make use of auxiliary information.A critical review in [7] shows that efforts have not been made to improve the mean method of imputation in the absence of auxiliary information.We also conclude that it is not an easy task to improve the mean method of imputation in the absence of any additional information at hand.In this paper we suggest three new imputing methods in the absence of auxiliary information.Consider a 527 finite population, Ω, of N units as Ω = {ω 1 , ω 2 , . . ., ω i , . . ., ω N } , Let y be the variable of interest in the population, and y i the value of y for the unit i. Let be the true population mean of the study variable y.Assume a simple random without replacement sample (SRSWOR), s, of size n is drawn from the population Ω. Assume it was possible only to collect information on r units out of the sampled n units from the population.In particular, let the set of r responding units be denoted by A ⊆ s and that of (n − r) non-responding units be denoted by A c .For every unit i ∈ A, the value of y i is observed and the for that units i ∈ A c , the value of y i is missing.Thus the sample data values have the following structure: Now the first choice is to forget or drop the missing (n − r) data values in the set A c from the sample s of n data values and consider an estimator of the population mean Ȳ as: which is the sample mean of the r values in the responding set A. Assuming the data is missing completely at random (MCAR), then applying the concept of two-phase sampling as given in [1], it is easy to verify that the sample mean ȳr in ( 3) is an unbiased estimator of the population mean Ȳ with conditional variance, for a given value of r, given by: where is the population mean squared error (or population variance) for the study variable.
Now consider imputing the missing data values by the mean method of imputation as follows: that is all the missing values are replaced by the sample mean ȳr of the responding values.Now consider the point estimator of population mean, given by: On using ( 5) in (6) we have From ( 7) and (3), the mean method of imputation leads to the same estimator(= ȳr ) of the population mean Ȳ with the same variance as given in (4).Thus, although the mean method of imputation is helpful in completing the missing values in a sample, it does not provide any additional benefit in drawing inferences from the results.
In the following sections, we propose new imputing techniques which fill the missing data values with more accurate predicted values, and which lead to more efficient estimators of the population mean under various situations.In section 2, we introduce a new shrinkage estimation technique.In section 3, we introduce a new weighted interval method of imputation and in section 4 we introduce a new method of imputation when the population variance of the study variable is known.

SHRINKAGE METHOD OF IMPUTATION
Following [12,13], we propose the following shrinkage method of imputation given by where λ is called the shrinkage parameter and which is to be determined based on some criterion, such as that the resultant estimator has minimum mean squared error.Under the proposed shrinkage method of imputation, the point estimator of the population mean ȳ is given by The percentage relative bias in the proposed shrinkage estimator ȳshrink is given by It may be worth pointing out that the value of percent relative bias is free from the value of the population mean.
The mean squared error of the proposed shrinkage estimator ȳshrink is given by The optimum value of λ which minimises the mean squared error of the proposed shrinkage estimator ȳshrink is given by where C y = S y / Ȳ denotes the value of the coefficient of variation of the study variable.The resultant minimum mean squared error of the shrinkage estimator ȳshrink is given by The optimum percent relative bias in the proposed shrinkage estimator ȳshrink is given by The optimum percentage relative efficiency of the proposed shrinkage method of imputation over the mean method of imputation is given by Note that the percent relative efficiency is an increasing function of the value of coefficient of variation C y and the difference between 1 r and 1 N .The proposed shrinkage method of imputation obviously will perform better than the mean method of imputation in case the value of response r is low and the value of the coefficient of variation C y is large.It seems that the proposed shrinkage method of imputation will be very useful when it is very expensive to obtain responses from the respondents and variation among the units in the population is large.Note that the percent relative efficiency RE(ȳ shrink ) o value is a function of r, N and C y , and the percent relative bias RB(ȳ shrink ) is a function of r, N , C 2 y and S 2 y .We investigated the behaviour of RE(ȳ shrink ) o , RB(ȳ shrink ) o and optimum value of λ for various choices of the parameter.In the study, we considered nine different populations with different values of the coefficient of variation C y equal to 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9 each of size N = 1000 units.From each population we considered a sample of size n = 15 units, and assumed several different values of number of respondents r equal to 3, 5, 7 and 9.It is noted that for r = 3, as the value of C y varies between 0.1 and 0.5, the absolute value of the optimum relative bias RB(ȳ shrink ) o remains less than 10% [1] and the percent relative efficiency RE(ȳ shrink ) o varies between 100.33% and 108.31%; for r = 5 as the value of C y varies between 0.1 and 0.7, the absolute value of the optimum relative bias RB(ȳ shrink ) o remains less than 10% and the percent relative efficiency RE(ȳ shrink ) o varies between 100.20% and 109.75%; for r = 7 as the value of C y varies between 0.1 and 0.8, the absolute value of the optimum relative bias RB(ȳ shrink ) o remains less than 10% and the percent relative efficiency RE(ȳ shrink ) o varies between 100.14% and 109.08%; and for r = 9 as the value of C y varies between 0.1 and 0.9, the absolute value of the optimum relative bias RB(ȳ shrink ) o remains less than 10% and the percent relative efficiency RE(ȳ shrink ) o varies between 100.11% and 108.92%.From the analysis, we conclude that if a good guess at or the true value of, the coefficient of variation C y of the study variable is available, then it can be used to obtain better imputed values than the mean method of imputation.It may be worth mentioning that if the value of λ is unknown, it can be estimated by using a consistent estimator given by where In the next section, we introduce a new interval method of imputation by making use of the sample standard deviation of the responding units in addition to the sample mean.The standard interval method of estimation of population mean available in almost all introductory statistics text books motivated the authors to think whether such a method can be constructed to impute a pair of values for each respondent instead of a single value.

A NEW WEIGHTED INTERVAL METHOD OF IMPUTATION
In this section, we suggest a new method of imputation by using a weighted interval method of estimation given by where α 1 and α 2 are real constants such that α 1 + α 2 = 1.In (17) two values are imputed for each non-respondent, one value to the left of the sample mean and another to the right of the sample mean.If one decides to choose α 1 = α 2 = 1 2 , then the imputation method in (17) reduces to the mean method of imputation.Thus the question is to decide about the possible best choice of the values of α 1 and α 2 .One possibility is to determine the values of α 1 and α 2 such that the mean squared error of the resultant estimator is minimum.The point estimator (6) under the weighted interval method of imputation in (17) becomes In order to study the asymptotic properties of the newly proposed estimator ȳw(int) based on the weighted interval method of imputation, we find the bias, to the first order of approximation, is given by where 4 .The mean squared error of the estimator ȳw(int) is given by which will be minimum if: The minimum mean squared error, to the first order of approximation, of the proposed estimator ȳw(int) is given by The values of α 1 and α 2 are respectively given by and Note that the value of (µ (n−r) ) should be given greater weight so that the overall imputed value is given by Thus for a data set skewed to the right, the imputed value of a non-respondent should be smaller than the sample mean value of the responding units.If the value of the coefficient of skewness is negative, then the proposed left side interval estimate (ȳ r − √ n s y(r) (n−r) ) should be given more weight and the right side interval estimate (n−r) ) should be given less weight such that the overall imputed value is given by Similarly for a data set skewed to the left, the imputed value of a non-respondent should be greater than the sample mean value of the responding units.Further note that if the value of the coefficient of skewness is zero, then the proposed left side interval estimate (n−r) ) should be given same weight as the right side interval estimate (n−r) ) so that the overall imputed value is given by Thus for a data set which is symmetric, (say Normally distributed) the imputed value of a non-respondent reduced to the sample mean value of the responding units.
, the percent relative bias in the proposed weight interval method of imputation is given by where y is the coefficient of kurtosis and C y is the coefficient of variation.Note that if β 1 = 0 then there is no relative bias and the proposed weighted method of imputation reduces to the usual mean method of imputation for the optimum values of α 1 and α 2 .If β 1 < 0 then the distribution is skewed to the left and if β 1 > 0 then the distribution is skewed to the right.If β 2 = 3 then the distribution is mesokurtic which stands for normal distribution; if β 2 > 3 then the curve will be more peaked than the normal curve and it is called leptokurtic curve, while if β 2 < 3 then the curve is flatter than the normal curve and it is called platykurtic curve.The percent relative efficiency of the proposed weighted interval method of imputation with respect to the mean method of imputation is defined as: From (28), one obvious observation is that the value of the coefficient of skewness should be a real number satisfying the condition: Thus we would be interested in studing the behaviour of the percent relative bias and percent relative efficiency for values of β 1 satisfying the condition (29), for various value s of C y , and for a few values of β 2 showing both platykurtic and/ or leptokurtic type curves.In order to look at the magnitude of gain in efficiency for different choice of parameters involved in the percent relative bias and percent relative efficiency expressions, we generated four populations each of size N = 10000 units by using the model: where y * i ∼ Gamma(α, β), that is, y * i follows gamma distribution with parameters α and β.In other words, in each of the generated population, the study variable y i follows gamma distribution with mean shifted up by 10 units.The four populations are generated with four different choices of shape parameter α equal to 0.05, 0.15, 0.25 and 0.35 and only one choice of scale parameter β = 1.One can see that the purpose of adding 10 to the population values is to reduce the value of the coefficient of variation to a reasonable value around 10% by following [1].A graphical presentation of four such populations is given in Figure 1.
For a population, with α = 0.05 we found One can easily see that: Either one of these two sets of parameters can be used to study the percent relative efficiency and percent relative bias as both have identical results, and the difference is due to the finite population of size N = 10000 taken from such super-populations.To be more precise, we used the alternatively produced parameters with N = 10000 in order to find the percent relative efficiency and percent relative bias.We consider n = 500, r = 50, 100, 150, 200, 250, 300, 350, 400 and 450.
We found that, if α = 0.05 then the population is highly leptokurtic, having a value of β 2 = 123 and skewed to the right with a value of β 1 = 8.944 and a reasonable value of the coefficient of variation C y = 14.43%, then the percent relative efficiency (RE) value increases from 104.4% to 166.1% as the value of r decreases from 450 to 50.The respective values of the percent relative bias (RB) remain negligible between −0.1369% and −1.2839% .If the value of α is increased to 0.15, then the population is still leptokurtic with a value of β 2 = 43.0 and skewed to the right with a value of β 1 = 5.164 and has a high value of the coefficient of variation C y = 23.47%,so that the percent relative efficiency (RE) value increases from 101.4% to 115.3% as the value of r decreases from 450 to 50.The value of the percent relative bias (RB) still remains negligible between −0.1455% and −1.2061%.Now if the value of α is increased to 0.35, then the population is still leptokurtic with a value of β 2 = 20.1, skewed to the right with a value of β 1 = 3.81 and has a high value of the coefficient of variation C y = 31.98%,then the percent relative efficiency (RE) value increases from 100.6% to 106.0% as the value of r decreases from 450 to 50.The value of the percent relative bias (RB) remains negligible between −0.1147% and −1.0757%.It seems that if the variation in a population is too large then it will be hard to impute missing values irrespective of the method one uses.Thus consistency of a population must be investigated before implementing any imputation method.
We conclude that the proposed weighted interval method of imputation can be useful in imputing the missing value if the value of the coefficient of skewness and coefficient of kurtosis in the population are known.The proposed weighted interval method of imputation can perform better than mean method of imputation if the distribution is skewed to right or left with a high value of β 2 .In practice, the distribution of income is found to be skewed to the right in many populations of interest, and if it also has leptokurtic nature then the proposed weighted method of imputation can be useful in imputing missing values in such surveys.Among others, [6] made use of the known population variance of the study variable in improving the estimator of population mean of the same study variable.In the next section, this motivated the authors to construct a new method of imputation in the presence of a known population variance of the same variable.

A NEW IMPUTATION METHOD WHEN POPULATION VARIANCE IS KNOWN
In this section, we introduce a naive variance dependent imputing method as follows: where γ is a constant to be determined such that the variance of the final estimator is minimum, and S 2 y = (N − 1) −1 ∑ i∈Ω (y i − Ȳ ) 2 is the known population variance of the study variable y.
If γ = 0 then the proposed variance dependent imputing method reduces to the usual mean method of imputation.Under the proposed variance dependent imputation method in (33), the point estimator ( 6) can be written as: The proposed variance dependent estimator ȳn(var) is an unbiased estimator of the population mean.For the optimum value of γ, the minimum variance of the proposed naive variance dependent estimator ȳn(var) is given by The percent relative efficiency of the proposed naive variance dependent method of imputation estimator ȳn(var) with respect to the mean method of imputation is defined as: It is interesting to note that the value of the percent relative efficiency is a function of only two parameters β 1 and β 2 .We used the same four population for α equal to 0.05, 0.15, 0.25 and 0.35 as in the previous section.The computed optimum values of γ are found to be 0.328, 0.317, 0.308 and 0.199 respectively with the percent relative efficiency (RE) values of being 290.5%, 273.9%, 260.0% and 248.1%.A further look at the behaviour of the percent relative efficiency as a function of value of α, finds that as the value of α increases from 1 to 49, the value of β 1 decreases from 2 to 0.286, the value of β 2 decreases from 9 to 3.122 (leptokurtic), the value of C y decreases from 0.4 to 0.139, the optimum value of γ decreases from 0.25 to 0.019, and the percent relative efficiency decreases from 200.00% to 104.0%.
The following remark is devoted to answer very valuable question raised by one of the reviewers:

Remark
(1) Are the new methods competitive with the EM or MI algorithm?
(a) EM-Algorithm: As per our understanding, the EM-Algorithm is making an assumption of know distribution of data being imputed as is done in case of mathematical statistics.For example, refer to [5] and it seems it was introduced by [2].In survey sampling methodology, we do not make any such assumption that the distribution of data is known or unknown.All the ratio, product and regression type estimators are free from such assumptions.However sometime they assume known values of a few constants being used at the estimation stage are functions of the parameters being estimated.Later they show that the replacement of those unknown constants with their consistent estimators is not altering the final mean square errors to the first order of approximation.For example, the difference estimator depends on a constant β whose optimum value is given by β = Sxy S 2 x , and it is estimated as β = sxy s 2 x and the difference estimator becomes a regression estimator given by It is shown is all textbooks, see [1], that V ( ȳdif ) = M SE(ȳ reg ).One could also refer to [14].
In EM-Algorithm, it is not clear how one EM-Algorithm can be proved to be better than another EM-Algorithm except through simulation study based on couple of thousands of iterations.(i.)If we are imputing missing values for several variables, then the Searls imputing method proposed in this paper will always provide better results than the mean method of imputations for each variable being imputed.If missingness is considered as a very sensitive issue, then one should not impute several variables simultaneously.Imputation needs to be done very carefully for each variable separately, and if possible every single missing value should be imputed very carefully for every variable.(ii.)If we are imputing missing values several times for a single variable, then no doubt an EM-Algorithm will provide a different value at each iterations of imputation based on the random start, and would lead to suspicious that which imputed value should be considered.In contrast, each one of the mean, ratio, regression, or the proposed Searls method of imputation will give us unique imputed value based on the method being used.There will be no confusion which imputed value to be considered or not. ( In what way can a measure of variability between imputations be produced from the new methods?
In mean, ratio and regression type methods of imputation, there is no question of variability between the imputed values.The imputed values are unique by a given method.However such a problem of variability between the imputed values by EM-Algorithm could make us suspicious which and why an imputed value should be considered?
(3) Is it possible to produce multiple imputations using any of the three methods?
It depends how you define Multiple Imputations, if you are imputing one variable, then the answer is yes, otherwise no.Here " no " is better than " yes ", because your imputed value is unique and removes any types of confusion.
(4) Can new method only be applied if the mechanism of missing values is completely random-MCAR?What should be done with the new methods in case of mechanism of missing values is MAR or MCAR?
So long as the mean method of imputation or the proposed Searls type method of imputation is concerned, it should be applied only to the situation of MCAR.However, if some auxiliary information is available, the proposed method can be extended on the lines of recent work of [8,9,10].To our knowledge this idea of proposing Searls method of imputation, and other two methods in the paper are completely new.

Figure 1 .
Figure 1.Four populations considered in the study Those simulation results would change from data set to data set.In contrast, the ratio or regression type methods of imputation are comparable based on fixed parameter called Mean Square Error (MSE), which is easy to derive and compares based of theoretical justification.It results into one standard result, for example in the present paper it is shown that the Searls method of imputation always has MSE less than the mean method of imputation irrespective of the data set being used for imputation.Please see equation (15) on page 2. Such theoretical justification is not possible in case of EM Algorithms.(b) Multiple Imputations:As per our understanding, the Multiple Imputations can be interpreted in two different meanings: