Evaluating the Goodness of the Sample Coefﬁcient of Variation via Discrete Uniform Distribution

Discrete uniform distribution (DUD) is one of the simplest probability models, but it is now introduced as the main tool for the evaluation of resampling techniques which are rapidly entering data analysis and discovering useful information for the researchers. In this paper we evaluate whether the sample coefﬁcient of variation (CV) is a good estimator for the population CV, when the random variable (r.v.) follows the DUD. A method is proposed to obtain the percentage of the number of samples where the CV lies within the bounds of the corresponding population CV and this value is used as a measure of goodness. Samples both with replacement and without replacement are examined, indicating that the goodness of the sample CV estimator increases with the sample size. The overall study gives a good idea of whether the sample CV is generally a good estimator . A real-life data set is analyzed to demonstrate the applicability of the proposed method in practice and the results are interpreted.


Introduction
The CV of a r.v.X is given by: where σ is the population standard deviation and µ ̸ = 0 is the population mean of the rv X [21,22], while the value of the CV obtained from a sample is as follows: where s is the sample standard deviation and x is the sample mean.
The CV is unit-free and thus, it is one of the most widely used statistical tools in various scientific fields.It is used as a tool for quality improvement [12], as a tool for managing loan portfolio risks [17], as a tool for measuring athletic performance [25], as a tool for calculating the sample size needed for carrying out studies [29], as a tool for obtaining a distribution model [20], as an intra-observer variability assessment tool [7] and so on.

643
Discrete uniform distributions play a naturally important role in many classical problems of probability theory [1,13].The interest of the study of discrete random variables is that the study is theoretical and can be exhaustive because we have a small amount of data.Another advantage of the discrete r.v. is that a continuous variable can be transformed into discrete by appropriately dividing its range.The continuous life time may not necessarily always be measured on a continuous scale but may often be counted as discrete random variables [10].In survival analysis the survival function may be a function of count r.v. that is a discrete version of the underlying continuous r.v.For example, the length of stay in an observation ward is counted by number of days or survival time of leukemia patients is counted by number of weeks.Therefore, the use of a discrete distribution is more realistic than the use of a continuous one.An extension of the study for a larger number of values can also predict the marginal behavior of the continuous r.v.when the number of values is increasing, or even tends to infinity.The importance of the CV in a DUD is that it can be used as a main evaluation tool for resampling techniques, like bootstrapping and permutation testing [9,23], which are very powerful in all areas of data analysis, such as in fraud detection, product categorization and disease diagnosis.
We investigate the case of the discrete r.v.X, following the DUD DU {0, 1, . . ., N − 1}.It can be easily seen that the CV of X is given by: From ( 3), we conclude that the CV is independent of the difference between the consecutive terms and that, as the population size, N , increases, the CV tends to √ 3/3 = 0.5774.Let CV (N ) and CV (N + 1) denote the CV for random variables following the discrete uniform distributions DU {0, 1, . . ., N − 1} and DU {0, 1, . . ., N }, respectively.Thus, we have: and we conclude that the CV is a strictly decreasing function and has an upper limit, CV 1, ∀ N>2.
In the related literature, not much work is seen in the discrete case and consequently, the published work on CV and DUD is rare.[11] recently developed procedures for interval estimation and hypothesis testing for the coefficient of variation in the continuous uniform distribution, [2,3,19] studied the convolution powers of DUD, [5,4] studied order statistics from a DUD, [16] obtained bounds for the population CV in DUD along with other distributions and [6] defined goodness-of-fit statistics to test fit to a DUD.In the present study, the goodness of the sample CV is ultimately attributed to the percentage of samples where their CV is lying within the bounds of the population CV: √ 3/3 < CV 1 [16] or approximately in the interval (0.5774,1].The organization of the rest of this paper is as follows.In Section 2, we describe the sampling methods used.In Section 3, we provide an efficient algorithm for computing the rate of return and we apply it for different sample sizes and population sizes.In Section 4, regression and correlation analysis were carried out between the sample size and the rate of return.In section 5, a real-data application illustrates the effectiveness of the proposed algorithm.Section 6 concludes the paper and proposes future directions.

Sampling Method
The sampling method selected is the random sampling (RS) [8,15], i.e. every element of the population has the same probability of being drawn with any other element.In the RS process it is usually considered that the selected element does not return to the population, so it cannot be re-selected.This sampling is known as sampling without replacement.However, there is a case where each element is selected and after its value is recorded, it is returned to the population so that it can be re-selected.This process is known as sampling with replacement.
It has been proved that V ar Xr > V ar X, where Xr is the mean of the sample with replacement and X is the mean of the sample without replacement and more precisely, it is bigger by (N − 1)/(N − n) [8], where N and n are the population size and the sample size, respectively.Therefore, in the case of random sampling with replacement, we have less information from a sample of size n.
The number of different samples taken during sampling without replacement is equal to the number of possible combinations of n elements from a population of size N , which is denoted by , while in sampling with replacement is equal to N n .

Proposed Method
The proposed method for the evaluation of the estimate of the sample CV is described in the following steps: 1. Recording the samples as n-dimensional vectors (x i1 , x i2 , . . ., x in ), where i denotes the samples number.2. Calculating the number of samples by solving the following equation: where B j = 0, 1, . . ., n is a non-negative integer and denotes the number of times that we observe number j in our sample.3. Calculating the number of permutations that can be formed with the values of each n-dimensional vector.4. Calculating the CV of each sample.It can be shown that: ) Calculating the percentage (%) of the number of samples where the CV lies within the interval ( √ 3/3, 1].This percentage will henceforth be called rate of return for the sake of brevity.The probability distribution of CV is given by: ) n We apply the proposed method when using both sampling methods, but the following propositions are useful in the case of sampling with replacement.Proposition 1.In the case of sampling with replacement, the proposed algorithm examines samples.Proof.As mentioned in section 2, the number of possible combinations of n elements from a population of size N is equal to N n .We will, however, prove that the proposed algorithm examines ) samples, which is too much smaller than N n .
Assume that (x i1 , x i2 , . . ., x in ) follow a DUD DU {0, . . ., N − 1}.Since the sample CV does not depend on the order of the samples, we only need to find the number of ordered samples (x i1 , x i2 , . . ., x in ) where x i1 x i2 . . .x in .It is easy to see that the number of ordered samples (x i1 , x i2 , . . ., x in ) is equal to the number of non-negative integer solutions of the equation [26], which completes the proof of the proposition 1. Proposition 2. In the case of sampling with replacement, the number of distinct values of CV is Proof.Samples (i, 0, 0, . . ., 0) for i = 1, . . ., N − 1 have the same CV because we have B i = 1, B 0 = n − 1 and therefore: Stat., Optim.Inf.Comput.Vol. 7, December 2019 which does not depend on i.Similarly, considering samples (i, i, 0, . . ., 0) for i = 1, . . ., N − 1, we have B i = 2, B 0 = n − 2 and therefore: We can see generally that when B i = m, B 0 = n − m, for m = 1, . . ., n, then: As a result, we can neglect N − 2 replications of CV for each sample with Finally, the case of (0, 0, . . ., 0) is also removed as it has the same CV with samples (i, i, . . ., i).In this way, we show that the number of same CV values is n(N − 2) + 1 which completes the proof of the proposition 2.

DU{0,1,2}
The goodness of the sample CV estimator of the r.v.X following a DUD DU {0, 1, 2} is evaluated, i.e. for N = 3.The case of samples of size 6, which are taken from sampling with replacement, is fully described below.
The equation ∑ N −1 j=0 B j = 6 has 28 solutions.Each of the 28 solutions is represented in Table 1 in the form of a 6-dimensional vector (x i1 , x i2 , . . ., x i6 ), i = 1, . . ., 28, the coordinates of which can obtain the values of 0, 1 and/ or 2 in all possible ways.The number of permutations that can be formed with the values given in each row of the table is denoted by r.
All in all, we get N n = 3 6 = 729 samples of size 6.We notice that in 390 samples of them, which are marked in bold, the CV lies within the interval ( √ 3/3, 1].Finally, the number of distinct values for CV is 21 which is found by removing 7 duplicated values (see Proposition 2).We repeat the same procedure for each sample of size n = 2, 3, 4, 5, 7, 8, 9 and 10 which is randomly selected and replaced and record the number of total samples as well as the number and the percentage of the samples where the CV lies within the interval ( √ 3/3, 1] (Table 2).It is noticed that the percentage of the values of the sample CV within the bounds of the population CV increases in parallel with the sample size (Figure 1).

DU{0,1,2,3}
The goodness of the sample CV estimator of the r.v.X following a DUD DU {0, 1, 2, 3} is evaluated, i.e. for N = 4. Table 3 presents the results of sampling with replacement for samples of size n = 2, 3, 4, 5, 6 and 7.An increasing trend of the percentage of the samples, where the CV lies within the bounds of the population CV is obvious (Figure 2).

DU{0,1,2,3,4}
The goodness of the sample CV estimator of the r.v.X following a DUD DU {0, 1, 2, 3, 4} is evaluated, i.e. for N = 5.Table 4 presents the results of sampling without replacement for samples of size n = 2, 3, 4 and 5.We notice that there is only one sample of size 5, which is the population itself.In this case, the sample CV and the population CV coincide and are equal to 0.7906.The percentage of the samples, where the CV lies within the interval ( √ 3/3, 1], is still increasing and has a strong linear trend with respect to the sample size (Figure 3).

DU{0,1,2,3,4,5,6,7}
Finally, the goodness of the sample CV estimator of the r.v.X following a DUD DU {0, 1, 2, 3, 4, 5, 6, 7} is evaluated, i.e. for N = 8.Table 5 presents the results of sampling without replacement for samples of size  Similar to the case of N = 5, we notice that is only one sample of size 8, which is the population itself.In this case, the sample CV and the population CV coincide and are equal to 0.6999.In general, the rate of return has an increasing trend with respect to the sample size (Figure 4).

Correlation between sample size and rate of return
After the steps 1 to 5 have been completed, we investigate the rate of increase of the rate of return when the sample size, n, increases by one unit.
The case of N = 3 and sampling with replacement is fully described, where the value of the CV exceeds 0.8 (Table 6).In other words, at least 64% of the changes in the rate of return is explained by the change in the sample size.The value of the correlation coefficient indicates that there is a linear relationship between the rate of return and the sample size and the linear regression equation is given by: where ŷ is the estimate of the rate of return, n is the sample size and b 0 , b 1 the regression coefficients.The value of the coefficient b 1 is derived from the pair sample size C rate of return (%) of Table 2 and on average is slightly higher than 2. This allows us to assume an increase in the rate of return by approximately 2% for each one-unit sample size increase.Therefore, with an initial goodness of the CV estimator around 35%, a sample size of 35 to 40 elements is required in order to approach 100%.The value of the coefficient b 1 resulting from regression with data including samples of size n > 10 is marginally decreasing.Consequently, the average will fall below 2. Table 6 above presents the values of the correlation coefficient and the coefficient of determination for both sampling cases examined.All values of the correlation coefficients are positive, therefore there is a positive linear relationship between the two variables [14], the rate of return and the sample size, whether the sample is taken with replacement or without.
The value of the correlation coefficient for N = 5 and sampling without replacement (r = 0.9959) confirms that there is almost a perfect linear relationship between the sample size and the rate of return, as seen in Figure 3.The corresponding value of the coefficient of determination confirms the fitting of the regression line to the DUD data.

Application
From 1st of January until 31st of December 2017, 840 patients were hospitalised in the RSA IGEA -Rehabilitation Residential Centre in Trieste, Italy.They were given the Barthel Index (BI) questionnaire [18] both at the beginning and at the end of the rehabilitation.This questionnaire contains 10 questions and assesses the degree of independence from daily activities, such as feeding, clothing and personal hygiene.
83 patients were excluded because the final data are not available and as we have been informed by the Medical Director, Dr Paolo Da CoL, these patients were discharged before ending the rehabilitation path.The sample consists of 757 patients with an average age of 84.09 ± 8.93 years old, of whom 522 (69%) are females and 235 (31%) are males.
We divided the questions into the following 3 categories based on the possible answers: i. questions about help with personal hygiene and bathing were answered with 0 (dependent) or 1 (independent).
ii. questions about help with feeding, dressing/undressing, using the toilet and climbing stairs were answered with 0 (dependent), 1 (with help) or 2 (independent) and questions about fecal and urinary incontinence were answered with 0 (incontinent), 1 (occasionally) or 2 (continent).
iii. question about transfer from chair to bed was answered with 0 (not able), 1 (major help), 2 (minor help) or 3 (independent) and question about walking was answered with 0 (immobile), 1 (moving with wheelchair), 2 (with help of a person) or 3 (with aids for disabled).
The above answers are coded to equal levels, therefore assuming that the expected average, approximated by the sample estimator, makes sense, we calculate the coefficients of variation of all the questions before and after in each category, as well as the coefficients of variation of all the questions before and after in total (Table 7).
We first notice a decrease in the CV in 9 of the 10 cases investigated, which indicates the improvement of the patients condition.The increase in the CV in the case of help with bathing was expected and the reason is, as we were informed by the Medical Director, that they patients rarely stay alone in the bathroom.We can also point out that in the case of personal hygiene of patients, which is a binary variable, the final CV has become less than 1, which means that the patients will become independent in their personal hygiene with higher probability after the end of the rehabilitation path.
From Table 7, we get the following rates of return by category of questions: 25% for N = 2, 50% for N = 3 and 25% for N = 4. Finally, we observe that while the initial CV does not fall within the bounds of the population CV, the final CV falls and coincides with the improvement of the patients condition.
We also used likelihood ratio tests (LRTs) to test whether there is statistically significant difference between the coefficients of variation before and after the rehabilitation.The results indicate that there is enough evidence to reject the null hypothesis, at 5% level of significance.
Remark: Only in the case of help with climbing stairs the value of the test statistic is not available.Furthermore, paired samples t-tests were used for comparing the average responses before and after the rehabilitation and the results indicate that there is enough evidence to reject the null hypothesis, at 1% level of significance, apart from the case of help with bathing, which is not surprising as explained earlier.
Table 8 above adds the information that the average responses before and after the rehabilitation are all positively correlated and reports the 95% confidence intervals that confirm the improvement of the patients condition.

Conclusions and Future Directions
In this paper, we proposed a method to evaluate the goodness of the sample CV and investigated the case of the discrete r.v.X, following the DUD when random samples are taken from the population.[16] investigated only the case of N = 3 and n = 4 when random samples are taken with replacement from the population, while we further investigated the cases of (a) N = 3 and n = 2, 3, . . ., 10 and (b) N = 4 and n = 2, 3, . . ., 7 and when it comes to samples without replacement, the cases of (c) N = 5 and n = 2, 3, 4, 5 and N = 8 and n = 2, 3, . . ., 8.
The value of the sample CV is strictly associated with the number of zeros included in each sample and thus, we stated and proved a proposition that derives the number of distinct values of the CV in the case of sampling with replacement and we also introduced a model of low computational cost describing the relationship between the CV and the sample.As a result, given the sample size, n, we can calculate the number of non-zero elements required in order the sample CV to lie within the bounds of the population CV, which makes it a good estimator.As the population size, N , increases, the percentage of samples, taken both by replacement and without replacement, giving the sample CV within the bounds of the population CV, is also increasing.The results are of interest for the scientific community, as the sample CV can be used from now on as an evaluation measure for resampling techniques and the discrete distributions are gaining ground.
Regression analysis further describes the relationship between the sample size and the rate of return.The value of the correlation coefficient indicates that there is a stronger positive linear correlation between the sample size and the rate of return of the samples taken without replacement than the samples taken with replacement.Linear regression models predict that samples of size n > 40 give high rates of samples where the sample CV is lying within the bounds of the population CV.Numerical study showed that this further investigation can easily evaluate the variation in discrete data.
Directions for future research include using the proposed method to construct confidence intervals for the CV and examining the proposed method in (a) larger populations and (b) other families of discrete uniform distributions, like the Marshall-Olkin discrete uniform (MODU) distribution [27], the generalized DUD, known as Harris Discrete Uniform (HDU) distribution, [24] and the exponentiated Marshall-Olkin discrete uniform (E-MO-U) distribution [28].
2, . . ., n.Since i = 1, 2, . . ., N − 1, there are N − 1 samples having the same CV and thus, we can remove N − 2 same values of CV and since m = 1, 2, . . ., n, we can remove n(N − 2) same values of CV among the possible values of CV .

Table 1 .
Description of samples and values of CV for N = 3 & n = 6.

Table 6 .
Values of correlation coefficient and coefficient of determination.

Table 8 .
Paired samples correlations and 95% CI for Before-After.