Flexiblity of Using Com-Poisson Regression Model for Count Data

The Poisson regression model is the most common model for fitting count data. However, it is suitable only for modeling equi-dispersed distribution. The Conway-Maxwell-Poisson (COM-Poisson) regression model allows modeling over and under-dispersion distribution. The purpose of this study is to demonstrate the flexibility of the Conway-MaxwellPoisson (COM-Poisson) regression model on simulation and alg data.


Introduction
The number of occurrence of any event within a specified time can be described as counting data.In the case of the dependent variable is a count and researcher is interested in how this count changes as the explanatory variable increases count data regression model is used.Modelling count data has been widely used in actuarial sciences, Aitkin et al. (1990) [1] and Renshaw (1994) [19] fitted Poisson regression to two different set of U.K. motor claim data, and in biostatistics and demography,Frome (1983) [8] modelled the lung cancer death rates among British physicians who were regular cigarette smokers.In recent years this model has been used frequently in the economy, political science, and sociology, Lord (2006) [12] modeled motor vehicle crashes by using Poisson-Gamma model and Riphahn et al. (2003) [20] fitted the model to German Socioeconomic Panel (GSOEP) data.Because of counts are all positive integers and for rare events the Poisson distribution (rather than the normal) is appropriate.However, it is suitable only for modeling equi-dispersed (i.e., an equal mean and variance) distribution.Many real data do not adhere to this assumption (over-or under-dispersed data) and the inappropriate imposition of Poisson regression model may underestimate the standard errors and overstate the significance of regression coefficients.For over-dispersed data, the Generalized Estimation Equations (GEEs) method with negative Binomial distribution is a popular choice and increase the efficiency of estimates [12,9].Other overdispersion models include Poisson mixtures [15] and quasi-Poisson model is characterized by the first two moments (mean and variance) [25].However, these models are not suitable for under-dispersed data.Under-dispersed data is less commonly observed.In cases the sample is small and the sample mean is very low and can be caused by the data generating a process that is independent of the sample size or mean [18].A few models exist that allow for both over-and underdispersed data.One example is the restricted generalized Poisson regression models of Famoye (1993) [6].It is called restricted , because it belongs to an exponential family under the condition that the distribution parameter is constant.The Conway-Maxwell-Poisson(COM-Poisson)regression model is an alternative model to fit data sets of varying dispersion.The Conway-Maxwell-Poisson (COM-Poisson) distribution is a two-parameter generalization of the Poisson which also includes the Bernoulli and geometric distribution that allows for over-and underdispersion.The distribution was briefly introduced by Conway and Maxwell in 1962 for modeling queuing systems with state-dependent service rates.The statistical properties of the COM-Poisson distribution, as well as methods for estimating its parameters were established by Shmueli et al.(2005) [23].The COM-Poisson distribution has been used in a variety of count data application, [4,3,7,24,5,23,14,13,11].

Poisson Regression Modes
Poisson regression is a special case of Generalized Linear Models (GLM) framework.The simplest distribution used for modeling count data is the Poisson distribution.Denote by Y a random variable from the Poisson U 3bb distribution, with the distribution function given by one of the properties of the Poisson distribution is that the ratio of consecutive probabilities is linear in y, or β0 β1 The canonical link is g (λ) = log(λ) resulting in a log-linear relationship between mean and linear predictor.The variance in the Poisson regression model is identical to the mean [27].The mean Poisson regression model can be assumed to follow a log link, , where x i denotes the vector of explanatory variables and U 3b2 the vector of regression parameters.The maximum likelihood estimates can be obtained by maximizing the log likelihood.

Negative Binomial Models
The second way of modeling over-dispersed count data is to assume a Negative Binomial (NB) distribution for which can arise as a Gamma mixture of Poisson distributions.One parameterization of its probability density function is with mean and shape parameter θ; Γ(.) is the Gamma function.For every fixed θ, this is another special case of the GLM framework.It also has ϕ = 1 but with variance function θ [27].The mean of NB regression can also be assumed to follow the log link, and the maximum likelihood estimates can be obtained by maximizing the log likelihood.

Geometric Regression Model
The geometric distribution is a generalized of Poisson distribution by including a gamma noise variable (Poissongamma mixture (negative Binomial)) which dispersion parameter is set to one.The probability distribution function of the geometric distribution is In geometric regression model, the mean of Y is determined by , where x i denotes the vector of explanatory variables and U 3b2 the vector of regression parameters.The regression coefficients are estimated using the method of maximum likelihood [2].
FLEXIBLITY OF USING COM-POISSON REGRESSION MODEL FOR COUNT DATA

COM-Poisson Regression Model
The Conway-Maxwell-Poisson (COM-Poisson) is a generalization of the Poisson distribution which can model both under-dispersed and over-dispersed data.In some applications, the ratio in equation ( 2) may not decrease linearly in y, i.e.the distribution may have a thicker or thinner tail than the Poisson.Suppose instead of equation (2), is set for a random variable Y .The resulting distribution for which equation ( 5) holds, called the Conway-Maxwell-Poisson distribution [26], is given by for U 3bb > 0 and U 3c5 ≥ 0. This satisfies the conditions for a probability function.υ is considered the dispersion parameter such that υ ≻ 1 represents under-dispersion, and υ ≺ 1 over-dispersion.
Minka and et al. ( 2003) [16] denote the infinite sum in the denominator by: The COM-Poisson distribution includes three well-known distribution as special cases: and Bernoulli (υ → ∞ with) probability λ 1+λ ) [23].Taking a GLM approach, Sellers and Shmueli (2010) [21] proposed a COM-Poisson regression model using the link function, Accordingly, this function indirectly models the relationship between E(Y ) and X ′ β, and allows for estimating U 3b2 and υ via associated normal equations.Because of the complexity of the normal equation, using β (0) and υ (0) = 1, as starting values.These equations can thus be solved via an appropriate iterative reweighted least squares procedure (or by maximizing the liklihood function directly using an optimization program) to determine the maximum likelihood estimates β, and υ.The associated standard errors of the estimated coefficients are derived using the Fisher Information matrix [22].
To comparison of different scale Poisson and COM-Poisson model coefficients, Sellers and Shmueli (2010) [21] offered to divide the COM-Poisson coefficients by υ, because of E(Y υ ) = λ.

Testing for Variable Dispersion
In GLM to detect over or under-dispersion simply, the researcher may look at the rule of thumb that the mean deviance, that is deviance/degree of freedom should be close to unity.Deviance theoretically allows one to determine if the fitted GLM model is significantly worse than the saturated model [17].Sellers and Shmueli (2010) [21] established a hypothesis testing procedure to determine if significant data dispersion exists, thus demonstrating the need for a COM-Poisson regression model over a simple Poisson regression model; in other words, they test whether (υ = 1) or otherwise [22].The test statistics, Where Λ is the likelihood ratio test statistic, β(0) are the maximum likelihood estimates obtained under H 0 : υ = 1 (i.e., the Poisson estimates), and are the maximum likelihood estimates under the general state space for the COM-Poisson distribution with 1 degree of freedom.For small samples, the test statistic distribution can be estimated via bootstarp [21].

Akaike Information Criteria (AIC)
When several models are available, one can compare the models performance based on several likelihood measures which have been proposed in the statistical literature.One of the most popular used measures is AIC.The AIC penalized a model with larger number of parameters and is defined as (10) where ln L denotes the fitted log likelihood and p the number of parameters [10].A relatively small value of AIC is favorable for the fitted model.

Simulated Data
To demonstrate the flexibility of the COM-Poisson distribution, 500 data were derived from Poisson, geometric, negative Binom and COM-Poisson, respectively.The inversion method is particularly simple to sample an integer value from the COM-Poisson distribution.The COM-Poisson probabilities are summed up starting from P (Y = 0), until this sum exceeds the value of a simulated Uniform(0,1) variable.Y is then an observation from the COM-Poisson distribution [16].Goodness of fit (associated p-values provided in parentheses) of each distribution and estimated parameters were given in Table 1.The analysis are performed in R program.Respectievely, glm() function from "stats" package and cmp() function from "COMPoissonReg" package are used.
Table 1 illustrates that, while the Poisson and the geometric distribution are meaningful only for identical distribution, the negative Binom distribution was flexible for many distributions except under-dispersion.However, COM-Poisson distribution was flexible for all considered distributions.Furthermore the estimated parameters almost same for Poisson and negative Binomial distributions.

Cocconeis Placentula Ehrenberg Data
Cocconeis Placentula Ehrenbergis one of the Epilithic algae occur in Freshwater (more oligotrophic) habitats, including slightly humic waters.Between June 2013 and May 2014, the number of Cocconeis Placentula Ehrenbergdata are collected from four different stations that located on Batlama stream.The descriptive statistics was given in Table 3.
In order to model the effect of stations and season on the number of Cocconeis Placentula Ehrenberg the models described above applied to the data set.At the end of the section, all fitted models are compared highlighting that the modeled mean function is similar but the fitted likelihood and AIC are different.
Results of regressing the number of Cocconeis Placentula Ehrenberg on stations and season for Poisson and COM-Poisson models are given in Table 4. Table 4 shows that the parameter estimates, standard error and AIC value.After dividing the COM-Poisson coefficients by dispersion parameter (υ), the results in Table 4 indicate that the regression parameters for all models have similar estimates in terms of the coefficient magnitudes.
While the estimated dispersion parameter for Poisson regression model is 0.3015, for COM-Poisson model is υ = 1.1790, indicating under-dispersion.To determine whether the dispersion parameter is significiant or not a FLEXIBLITY OF USING COM-POISSON REGRESSION MODEL FOR COUNT DATA hypothesis test which established by Sellers and Shmueli (2010) [21] was used.The p value was found 0.0000, and the 95% bootstrap confidence interval for υ not include the value 1 (using 1000 samples).Indicating dispersion that requires a COM-Poisson regression instead of Poisson regression.
While the same season (Winter and Spring) was statistically significant for both regression model, the fourth station was statistically significant only for the Poisson regression model.In terms of Log-likelihood and AIC, the COM-Poisson regression showed best fit for data.
In terms of model interpretation, dividing the COM-Poisson coefficients by υ and taking exponent of each divided parameters, the COM-Poisson regression indicates that, according to summer referance category, respectively Winter and Spring are about 9 and 23 times as common to create Cocconeis Placentula Ehrenberg.Cocconeis Placentula Ehrenberg is about twice as common in first station than in third station.

conclusion
This study is related to the response variable of interest is a count, that is, takes on non-negative integer values.For count data, the most widely used regression model is a Poisson regression.Poisson regression is limiting in equidispersion assumption.Lately when the data display over-and under-dispersion COM-Poisson regression is used.To demonstrate the flexibility of the COM-Poisson distribution, 500 data were derived from Poisson, geometric, negative Binom and COM-Poisson, respectively.The goodness of fit test showed that, while the Poisson and the geometric distribution are meaningful only for identical distribution, the negative binom distribution was flexible for many distributions except under-dispersion.However, COM-Poisson distribution was flexible for all considered distributions.In terms of the model, the COM-Poisson regression model showed best fit for data based on Log-likelihood and AIC values.In order to model the effect of stations and season on the number of Cocconeis Placentula Ehrenberg, Poisson and COM-Poisson regression models are fitted respectively.The results indicated that the regression parameters of all models had similar estimates.According to the test which established by Sellers and Shmueli (2010) [21] to test whether the dispersion parameter is significant or not, indicated that COM-Poisson regression was more adequate than Poisson model.It revealed that, while the same season (Winter and Spring) was statistically significant for both regression model, the fourth station was statistically significant only for the Poisson regression model.In terms of Log-likelihood and AIC, the COM-Poisson shows best fit for data.In terms of model interpretation, according to summer reference category, respectively Winter and Spring are about 9 and 23 times as common to create Cocconeis Placentula Ehrenberg.Cocconeis Placentula Ehrenberg is about twice as common in the first station than in the third station.