Estimation of Zero-Inflated Population Mean with Highly Skewed Nonzero Component: A Bootstrapping Approach

This paper adopts a bootstrap procedure in the maximum pseudo-likelihood method under probability sampling designs. It estimates the mean of a population that is a mixture of excess zero and a nonzero skewed sub-population. Simulations studies show that the bootstrap confidence intervals for zero-inflated log-normal population consistently capture the true mean. The proposed method is applied to a real-life data set.


Introduction
An underlying population that contains many zeros appears in a wide range of literature, from social sciences to natural sciences. Data with too many zero values are common in several applications such as insurance, reliability, meteorology, ecology, auditing, and manufacturing; Paneru [15] cited examples of data with too many zero values in various disciplines. In meteorology, for example, a geographical region may have many days with no rainfall (zero precipitation is recorded); the rest of the days are recorded with a value of positive precipitation. Similarly, in a production process, a manufacturing plant aims to produce a large proportion of non-defective items (a zero-defect is recorded) and a small proportion of defective items where the number of defects follows a probability distribution [11,14]. In general, zero-inflated data has a large spike of zero values with a smaller proportion of nonzero values and is thus called a zero-inflated population (ZIP). These examples suggest that a zero-inflated population could consist of a nonzero sub-population with continuous or count data.
Excess zeros and skewness in the zero-inflated population are commonly encountered phenomena that limit traditional methods to estimate the population mean. Similar issues happen in developing regression models for the zero-inflated data. Estimating the population mean based on normal approximation or the central limit theorem (CLT) gives poor estimates because of apparent skewness caused by zero values, skewness in the population's nonzero component, or both. Different approaches and techniques are developed to overcome the issue of zero inflation. In a two-component mixture model, the population is split into zero and nonzero components, where 1045 the nonzero component follows a known probability distribution [10]. Under parametric assumptions, likelihood ratio statistics and bootstrap techniques are applied to construct confidence intervals for the mean of diagnostic test charge data containing many zeros [26]. Other approaches to construct confidence intervals of zero-inflated population mean include a pseudo empirical likelihood approach in complex sampling surveys [2], a mixture model of true zero exposures, and a log-normal distribution with left censoring [22]. The zero-inflated log-normal distribution is important in real-life as many populations are highly skewed, such as rainfall, air contamination exposure, and diagnostic test charge; and different approaches exist to compare the mean of such populations [27,23,25,26,12]. Satter and Zhao [20] proposed empirical and adjusted empirical likelihood methods to construct nonparametric confidence intervals for the mean of a zero-inflated population. Satter and Zhao [21] further extended their empirical methods to the Jackknife empirical likelihood method to estimate the difference in means of two zero-inflated skewed populations. Pittman et al. [19] describe different methods for analyzing overdispersed zero-inflated count data and illustrate their application to cigarette and marijuana smoking data. Furthermore, modeling of zero-inflated count data and applications have been studied by many researchers, e.g., [24,9,3,13,4]. Although, many of the existing methods for estimating the mean of ZIPs do not address the problem when the data is produced through complex probability sampling designs such as stratified sampling and cluster sampling, which are commonly applied to survey sampling and designs. Chen et al. [1] proposed a maximum pseudo-likelihood approach under complex probability sampling designs for interval estimation of zero-inflated population mean. Paneru and Chen [16,17] extended the pseudo-likelihood approach to develop regression models for the zero-inflated data.
Although the pseudo-likelihood approach addresses complex sampling designs and shows confidence intervals have better coverage than existing methods, the method reliant on pseudo-likelihood ratio test statistic to produce confidence intervals under the pseudo-likelihood approach is mathematically and computationally complex. To overcome the complexity, Paneru et al. [18] adopted a pseudo-likelihood approach in a bootstrap procedure, an alternative method for constructing interval estimates for the mean of the zero-inflated population. The bootstrap procedure is mathematically, computationally, and intuitively more straightforward than existing methods. Paneru et al. [18] provide the application in the normal model, assuming that the nonzero component is normally distributed.
This paper extends Paneru et al. [18] to highly skewed data where a nonzero component brings extra skewness in a zero-inflated population. This paper describes the two-component mixture model, maximum pseud-likelihood function, and derivation of parameter estimates in the zero-inflated log-normal mixture model. It explains the bootstrapping procedure used, along with the results of the simulations. It also demonstrates the application to actual data, the Canadian Labour Force Survey-2000 detailing extra work hours.

Methodology
This paper follows the concept of the two-component parametric mixture model from [10], the maximum pseudolikelihood approach developed in [1], and the bootstrap procedure adopted in [18]. In ZIPs, two components exist; one consists solely of zero values and the second component of nonzero values that adheres to some known probability distribution. The two-component mixture model for the zero-inflated population is defined by where α is the proportion of nonzero values, µ is the mean and σ is the nuisance parameter of the nonzero component, and I is the indicator function. The parameter of interest, the mean of the zero-inflated population, is θ = αµ.

ESTIMATION OF ZERO-INFLATED POPULATION MEAN
Consider a survey population of N units with values y 1 , · · · , y N be independently generated from the model (1) . The log-likelihood function of all N sampling units of survey population is Let s be a random subset of n sampling units with values y 1 , · · · , y n selected from the surveyed population. Let m(< n)) be the number of zero values in n observed units arranged as As explained in [1], consider a probability sampling design, where the random subset s of n sampling units is obtained from the surveyed population with inclusion probability π i , i = 1, · · · , n. Under the probability sampling design, the pseudo-likelihood function, an estimate of the log-likelihood function, l(α, µ, σ), is defined bŷ where the sampling weights w i = 1/π i , i = 1, · · · , n are chosen such that E(l) = l.
In real-world applications, zero-inflated data is complex to analyze due to a spike at zero with addition skewness in nonzero component. To address the issue, this paper considers the nonzero component to follow a log-normal distribution in the two-component model (1). The pdf of the log-normal component of the model (1) is . The pseudo-likelihood function and the estimation of parameters for the zero-inflated log-normal population is given below.
Stat., Optim. Inf. Comput. Vol. 10, September 2022 Pseudo-Likelihood Estimates of Parameters: By Setting up ∂l ∂α = 0, the estimate of the proportion of non-zeros, α, is given bŷ We derive the estimate of ξ by setting up ∂l ∂ξ = 0. So, Similarly, the estimate of σ 2 is derived as below.
Then, by setting ∂l ∂σ = 0,σ In summary, the pseudo-likelihood estimates of the parameters under the log-normal model are given bŷ Thus, the pseudo-likelihood estimate of the mean of the zero-inflated log-normal population is given bŷ

Bootstrap Estimation and Simulation Results
The bootstrap technique was introduced by Efron [5], and further developments are found in Efron [6,7]. The bootstrap technique's logic is that computing-based methods' power can overcome the complexity of standard statistical approaches [8]. This paper uses the bootstrap approach where the data follow a specified two-component mixture model as detailed in the model (1). This paper also presents the simulation results of parametric and nonparametric bootstrap estimates. We conduct simulation studies to calculate percentile bootstrap confidence intervals of θ = αµ, the mean of zero-inflated populations. It uses the empirical distribution of bootstrap replicates as the reference sampling distribution ofθ, pseudo-likelihood estimation under complex probability sampling designs given in equation (6)  For the simulation studies of the zero-inflated log-normal model, finite populations of size 10, 000(N = 10, 000) are randomly generated and divided into four (k = 4) strata of sizes 500, 1500, 3000, and 5000. For a random sample of size n = 500, the inclusion probability for each stratum is set to π 1 = 140/500, π 2 = 120/1500, π 3 = 130/3000, and π 4 = 110/5000. The corresponding weights for each stratum are set to w 1 = 500/140, w 2 = 1500/120, w 3 = 3000/130 and w 4 = 5000/110, respectively. A random size n j , j = 1, · · · , 4, is drawn from each stratum using simple random sampling without replacement.
In the first population condition, the nonzero component is generated using µ = 100 and σ = 30. Table 1 and Figure 1 present the resulting bootstrap CIs and distributions, respectively. We find that the parametric bootstrap method results in narrower intervals across all levels of nonzero proportion, α. At low levels α, bootstrap percentile intervals are less centered on the population value compared to those at high levels of α. In the second population condition, the nonzero component is generated using µ = 50 and σ = 25. Table 2 and Figure 2 present the resulting bootstrap CIs and distributions, respectively. Results are similar to the first condition. A notable difference is that the nonparametric bootstrap CIs are more heavily weighted towards zero. The tendency toward zero meant all CI for the nonparametric approach is shifted to the left of the population value even though the intervals still contained the parameter of interest. In the final population condition, the nonzero component is generated using µ = 20 and σ = 15, and results are presented in Table 3 and Figure 3. Results are similar to the previous conditions. For α = 0.50, we observe a narrow overlap between the nonparametric and parametric bootstrap CIs which can be seen in the bootstrap distributions presented in Figure 3. Caution may be needed in interpreting bootstrap intervals in populations with a small mean.
In tables 1, 2, and 3 presented below,θ is the pseudo-maximum likelihood estimate from simulated data, LL .025 and U L .975 represents the lower-limit and upper-limit, respectively, of the confidence intervals, andX is the bootstrap sample mean. Similarly, in figures 1, 2, and 3 displayed below, vertical dashed lines represent the true value of θ at each level of α.       In the simulation study, the coverage rates of bootstrap confidence intervals were investigated using the population described above. For conditions, we varied the proportion of nonzero values (α) among 0.05, 0.1, 0.2, and 0.5. For sample size, we varied the number of cases randomly sampled from each stratum. The estimated coverage rates for nonparametric percentile CIs are given in Table 4; the coverage rates increase consistently well when the proportion of non-zero values (α) gets higher (e.g., when α = 0.5, recovery was at approximately nominal levels).

Application
Due to the complexity of applying the pseudo-likelihood function detailed in the methodology, we apply the bootstrapping method to the Canadian Labour Force Survey -2000 for Ontario [1]. We looked at employees' overtime hours, which can either be paid or unpaid hours worked outside the main job's work schedule. The data provides information of 17,415 participants; overtime hours are highly skewed. As shown in Table 5, the six strata vary greatly in means and sizes, indicating that inclusion probability is highly associated with the response.In the data, only 3,677 respondents have nonzero overtime hours with a mean of 92.15, which results in the nonzero proportion α = 0.211 and the mean θ = αµ = 19.46. As stated in [1], we also view the data set as the underlying population with θ = 19.46 for the illustration purpose. A random sample of sizes 110, 110, 70, 70, 40, and 40 was taken from each of the six strata, respectively; the inclusion probabilities are listed in Table 5 . A similar approach is taken as in the simulation, pseudo-likelihood estimates are generated, and nonparametric and parametric bootstrap confidence intervals are constructed. The point estimate is 21.09, and the 95% percentile confidence intervals are (14.79, 28.19) and (16.12, 25.90) for the nonparametric and parametric methods, respectively. Figure  4 gives the bootstrap distributions for both techniques, parametric and nonparametric. Note. The vertical dashed lines in Figure 4 represent the 95% CIs of each bootstrap distribution ofθ and the middle (red) line represents the population mean of interest.
The paper illustrates a bootstrap procedure (both mathematically and computationally simple) to the maximum pseudo-likelihood approach under complex sampling designs developed in [1], it is worth comparing the confidence intervals under the bootstrap procedure, and the existing approach developed in [1]. 95% confidence interval of θ(= 19.46) under the existing approach [1] is estimated to be (9.47, 24.16). Bootstrap distribution ofθ estimates that 95% percentile confidence intervals are (14.79, 28.19) and (16.12, 25. parametric methods. The application shows that the new simpler approach gives a consistent and more symmetric confidence interval as compared to the existing method.

Conclusion
Paneru et al. [18] describe the bootstrap method as an alternative approach to compute confidence intervals of a zero-inflated population mean; the approach overcomes the complexity of maximum pseudo-likelihood ratio statistic and the asymptotic distribution. Paneru et al. [18] only provide an application to the normal model where the nonzero component is symmetrically distributed. However, it is common to work with data that is not normally distributed in a real-life situation. This paper extends the bootstrap approach where the nonzero part in the twocomponent model follows a log-normal distribution. Since the pseudo-likelihood function used in the methodology assumes a known probability distribution for the nonzero component, we recommend parametric bootstrap for the application. We also compute nonparametric bootstrap confidence intervals for the comparison. Under the assumption of a nonzero log-normal component, the parametric bootstrap method gives narrower confidence intervals than the nonparametric bootstrap method. Simulations results show that both bootstrap confidence intervals, parametric and nonparametric, consistently capture the true population mean. The bootstrap method was applied to the real data of the Canadian Labour Force Survey -2000 for Ontario. Although, as with the application to the Labour Force data, the pseudo-likelihood function is constrained by the sample sizes and inclusion probabilities. Caution should be kept in mind when selecting sample sizes, and careful consideration of the variation of inclusion probabilities allows for the constructed bootstrap distributions to be symmetrical.