overdisp: A Stata (and Mata) Package for Direct Detection of Overdispersion in Poisson and Negative Binomial Regression Models

Stata has several procedures that can be used in analyzing count-data regression models and, more specifically, in studying the behavior of the dependent variable, conditional on explanatory variables. Identifying overdispersion in countdata models is one of the most important procedures that allow researchers to correctly choose estimations such as Poisson or negative binomial, given the distribution of the dependent variable. The main purpose of this paper is to present a new command for the identification of overdispersion in the data as an alternative to the procedure presented by Cameron and Trivedi [5], since it directly identifies overdispersion in the data, without the need to previously estimate a specific type of count-data model. When estimating Poisson or negative binomial regression models in which the dependent variable is quantitative, with discrete and non-negative values, the new Stata package overdisp helps researchers to directly propose more consistent and adequate models. As a second contribution, we also present a simulation to show the consistency of the overdispersion test using the overdisp command. Findings show that, if the test indicates equidispersion in the data, there are consistent evidence that the distribution of the dependent variable is, in fact, Poisson. If, on the other hand, the test indicates overdispersion in the data, researchers should investigate more deeply whether the dependent variable actually exhibits better adherence to the Poisson-Gamma distribution or not.


Introduction
Many situations have as an outcome of interest a nonnegative integer, or a count, denoted by y, y ∈ N 0 = 0, 1, 2, .... The benchmark model for the analysis of integer count-data is the Poisson regression model, which restricts the variance of the data to be equal to the mean, conditional on explanatory variables [5,6,7]. Failures of this restriction can allow researchers to estimate parameters considering more general distributions, such as the negative binomial.
Many commonly used count-data models are implemented in a variety of software packages, such as the poisson and nbreg packages in Stata [19], glm and glm.nb packages in R [16] and GENMOD Procedure in SAS [18], and many applications of these models can be found in economics, finance, actuary, ecology, demography, sociology psychology, and health, among other relevant fields of knowledge [7,11,14,20,21].
Following the test implemented in Stata through a sequence of four commands proposed by Cameron and Trivedi [5], we present the new package overdisp to directly identify overdispersion in Stata. There are notable advantages to running overdisp in this way. First, it provides a simple, intuitive, fast and easy command which allows users to choose between Poisson and negative binomial estimations in the presence of countdata. Second, prior to fitting a particular model, users can take advantage of Statas excellent statistics and data 774 OVERDISP: A STATA (AND MATA) PACKAGE FOR DIRECT DETECTION OF OVERDISPERSION management commands to prepare and descriptively analyze their data [13]. Last, all analyses can be reproduced and documented for publication and review by typing the command into a file and running it directly.
One of the key advantages of such a package for count-data regression is that, for the development of new models, overdispersion can be identified in one line of code rather than having to program the test.
The remainder of the article is structured as follows. Section 2 briefly reviews count-data models. Section 3 formally presents the overdispersion test. Section 4 describes how to install overdisp in Stata, presents the overdisp command syntax and describes the estimation options. Section 5 illustrates overdisp by replicating examples presented in Cameron and Trivedi [5] and Fávero and Belfiore [7]. Section 6 performs a simulation study using overdisp to demonstrate an important research finding. Section 7 concludes.

Review of count-data models
According to Rabe-Hesketh and Skrondal [17], a general count-data regression model can be written as follows: where β 0 represents the constant, β j (1, 2, ..., k) are the estimated parameters for each X j explanatory variable, u is the expected number of occurrences or the estimated incidence rate ratio for the phenomenon under study for a given exposition (period, area, region, among other examples) and for a determined observation i(i = 1, 2, ..., n), and n is the sample size.

Poisson regression models
The beginning point for the study of count-data regression models is the presentation of the Poisson distribution that, for a determined observation i, has the following occurrence probability for a count m in a given exposition: where m = 0, 1, 2, .... As discussed by Cameron and Trivedi [5], Avci [2] and Fávero and Belfiore [7], in the Poisson distribution, the mean and the variance for the variable under study should be equal to u, as can be shown as follows: Poisson regression model parameters can be estimated by maximum likelihood, where the dependent variable follows a Poisson distribution [9]. Being the probability of occurrence for a specific count m in a determined exposition, we can define the log likelihood function for Poisson regression models as being

Negative binomial regression models
Negative binomial regression models are also part of the regression models for count-data, but the estimation takes into account the existence of overdispersion in the dependent variable, conditional on explanatory variables [1,10]. The probability function for a negative binomial distribution (also known as Poisson-Gamma distribution), which would permit us to calculate the occurrence probability for a count m, given a determined exposition, can be written as: where α is the inverse of the form parameter of the Gamma distribution (α > 0). So, in the presence of overdispersion, we have: • Variance: The second term of the negative binomial distribution expression variance (equation (8)) represents overdispersion and, if α → 0, this phenomenon will not be present in the data, favoring the estimations of the Poisson regression model. However, if α is statistically greater than zero, the existence of overdispersion in the dependent variable, conditional on explanatory variables, causes that the negative binomial regression should be estimated.
Overdispersion is often encountered when fitting parametric models. The Poisson distribution has one free parameter and does not allow for the variance to be adjusted independently of the mean (equation (4)). If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit. In the case of count data, a negative binomial regression model can be proposed instead, in which the mean of the Poisson distribution can itself be thought of as a random variable drawn, in this case, from the Gamma distribution, thereby introducing an additional free parameter (term α. u 2 in equation (8)).
In this sense, overdispersion can be defined as a great variability (statistical dispersion) that occurs in a variable in comparison with its mean. According to Fávero and Belfiore [7], high variation in the data is due to heterogeneous or non-uniform samples, due to the presence of relevant outliers and/or, in the specific case of count data, due to high levels of expositions for a quantitative variable with discrete and non-negative values For instance: counts per month instead of counts per day, or counts per square kilometer instead of square meter, can generate overdispersion in the data.
The estimation of the parameters of equation (1), in the presence of overdispersion, can also be estimated by maximum likelihood [3,12], and the log likelihood function for negative binomial regression models is Therefore, the estimation of the parameters passes through the previous definition of the statistical significance of the α term. For this, Cameron and Trivedi [5] propose a test to verify the existence of overdispersion in the dependent variable, conditional on explanatory variables, in which there is a need for a previous estimation of a Poisson regression model.
In the following sections, we will formally discuss this test, as well as introduce a new Stata command (overdisp), which can be implementd without the necessity of previously estimating Poisson models and, thus, contributes to the identification of overdispersion in the data since it detects the phenomenon faster and easier.

Overdispersion test
The formal test of the null hypothesis of equidispersion, V ar(y|X) = E(y|X), against the alternative of overdispersion, was firstly introduced by Cameron and Trivedi [4], and is based on the following equation: which is the variance function for the negative binomial distribution, as shown in equation (8). So, we have to test the significance of the parameter α (H 1 : α > 0) against the null hypothesis (H 0 : α = 0). According to the authors, to implement the test, firstly a new variable y * needs to be generated, as follows: where u = exp(X ′ β); β represents the vector of parameters to be estimated through the model presented by equation (1). The test can be implemented by a regression of y * on u, without an intercept term. The t test of the coefficient of u indicates the presence of significant overdispersion (H 0 : P > |t| > significance level → equidispersion, i.e, no significant overdispersion, favoring Poisson estimation; H 1 : P > |t| ≤ significance level → significant overdispersion, favoring negative binomial estimation).

The overdisp Stata command
Cameron and Trivedi [5] presented the overdispersion test in Stata through the application of four commands in sequence, following the logic discussed in the previous section. These commands are as follows: which estimates a Poisson regression model. The term depvar denotes the dependent or response variable and indepvar denotes the list of covariates appearing in the model. The following command generates the predicted number of events ( u, called uhat in Stata):

predict uhat
Based on equation (11), next command creates the a variable (y * , called ystar in Stata), as follows: Stat., Optim. Inf. Comput. Vol. 8, September 2020 generate ystar = ((depvar-uhat)ˆ2 -depvar)/uhat Finally, following the authors, an auxiliary simple regression of y * (ystar) on u (uhat), without an intercept, can be estimated: The overdisp command consolidates these four commands, and was developed through a Mata code, the matrix programming language available in Stata since its version 9. Appendix 1 offers the Stata and Mata ado code for the overdisp command.

Instalation
overdisp is available from Stata 15 and can be installed from the Statistical Software Components (SSC) archive [8] by typing the following command within a net-aware version of Stata:

ssc install overdisp
Two files will be installed on your computer: overdisp.ado, a Stata ado file which defines the command; and overdisp.sthlp, a Stata help file which documents the command. Note that these files will be installed onto your adopath, the path where Stata searches for the files it needs. If you have already installed overdisp from the SSC, you can check that you are using the latest version by typing the following command:

adoupdate overdisp
The overdisp command follows standard Stata syntax for estimation commands. We restrict our discussion here to the most common options. A complete description is provided in the overdisp help file. You may type the following at any point to view this help file: help overdisp

Syntax
The overdisp command has the following syntax: where overdisp is the name of the command, depvar denotes the dependent or response variable, indepvar denotes the list of covariates appearing in the model, the square brackets indicate optional arguments and the comma separates the specification of the model from the specification of any modeling or estimation options listed in options.

Option
The options are defined as follows: level(#) sets confidence level and the default is level(95). H0 indicates equidispersion.

Examples
We illustrate the application of the overdisp command using the two examples below, in which we show the use of the new command and the interpretation of its output in different settings. Given the illustrative purpose of this section, we closely follow the sources of the examples [5,7] when describing the data and discussing the possible overdispersion in the dependent variable, conditional on explanatory variables, of the proposed models. We contribute by proposing an alternative, direct way to test overdispersion.

Example 1: Determinants of annual number of doctor visits
In the first example, we use the dataset mus17data.dta when describing the data and discussing the existence of overdispersion in the dependent variable, conditional on explanatory variables, replicating part of results from chapter 17 of Cameron and Trivedi [5].
The application we consider here is the analysis of the determinants of annual number of doctor visits (docvis) of the Medicare population aged 65 and higher from the sample of the U.S. Medical Expediture Panel Survey for 2003. The covariates include age (age), squared age (age2), years of education (educyr), presence of activity limitation (actlim), number of chronic conditions (totchr), having private insurance that supplements Medicare (private), and having public Medicaid insurance for low-income individuals that supplements Medicare (medicaid).
Thus, the proposed count-data model here is and our objective is verify the existence of overdispersion in the dependent variable docvis, conditional on the seven explanatory variables. We start by loading the data. We do this using the use command where we specify the clear option to replace any data should they currently exist in memory.
use http://www.stata-press.com/data/mus/mus17data, clear We then use the codebook command with the compact option to compactly describe the data contents [13].
The dependent variable docvis is quantitative, discrete and has non-negative values. As such, the tab command, which is frequently used to obtain the distribution frequencies for a qualitative variable, can be used in this case, given that the dependent variable presents nonnegative integer values.

(output omitted)
Next command, hist, offers the opportunity to see the histogram for the dependent variable, presented in Figure  1. The term discrete informs that the dependent variable presents only integer values. Before preparing any regression model for count-data, it is interesting for the user to evaluate if the mean and variance of the dependent variable are equal, or at least close. This will give an idea as to if the Poisson regression model estimation is adequate, or if it will be necessary to estimate a negative binomial regression model. Typing in the following command will allow that this initial diagnostic be performed. As we can see, the variance of the dependent variable is about 8 times greater than its mean. Although this fact suggests the existence of overdispersion in the data, until this moment we do not have conditions to verify this phenomenon in the dependent variable, conditional on the explanatory variables.
Cameron and Trivedi [5] recommend that all modeling where the dependent variable contains count data be started by means of estimating a Poisson regression model. To do this, we will type the following commands † : The poisson command estimates a Poisson regression model by maximum likelihood. Just as for multiple, binary and multinomial logistic regression models, if the researcher does not inform the desired confidence level for the estimated parameter interval definition, that standard will be 95%. The following command generates the predicted number of events, u:

predict uhat
Next, based on equation (11), we now have to create a new variable in the database, which is called ystar, according to what follows: Although this result causes researchers to estimate a Poisson model with the vce(robust) option, we recommend to model this feature using the negative binomial model.

Example 2: Determinants of annual number of durable good purchases through installment closed-end credit
The second application we consider is Fávero and Belfiore's [7] analysis of the determinants of the quantity of durable good purchases made using installment closed-end credit in the last year per consumer (purchases). As the finance department for a large appliance retailer wants to know if consumer income and age explain the use of financing when purchasing goods such as cellular telephones, tablets, laptops, televisions, videogames, DVD/Blu-ray players and etc., to develop a marketing campaign for this form of financing based on customer profile, a survey was conducted on a random sample of 200 clients. Thus, covariates include the monthly consumer income in US$ (income) and the consumer age in years (age). We use the dataset Financing.dta when describing the data and discussing the existence of overdispersion in the dependent variable, conditional on the covariates, replicating results from chapter 15 of Fávero and Belfiore [7].
The proposed count-data model here is ln( purchases i ) = β 0 + β 1 .income i + β 2 .age i (13) and now our objective is verify the existence of overdispersion in the dependent variable purchases, conditional on the two explanatory variables. Firstly, we can use the codebook command with the compact option to compactly describe the data, as follows: The dependent variable is quantitative, discrete and has non-negative values. In this case, as discussed in the previous example, we can use tab and hist commands, as follows: tab purchases (output omitted) hist purchases, discrete freq The following command allows us to investigate preliminarily if the mean and variance of the dependent variable are equal, or at least close. By means of analyzing the mean and variance, which are quite close, we can suppose that the Poisson regression model will be suitable in this case.
Following again the logic proposed by Cameron and Trivedi [5], lets firstly estimate the Poisson regression model and, in the sequence, investigate the existence of overdispersion in the data, typing the four following commands: The t test of the coefficient of u indicates the equidispersion, since P > |t| > 0.05, favoring Poisson estimation. The prcounts command, to be typed after the poisson command, allows variables corresponding to the occurrence probabilities for each possibility of the dependent variable to be generated for each observation. In the case the prcounts command has not been installed in Stata, the researcher should type in findit prcounts and install it in the statistical package. The command is: Variables that correspond, respectively, to the occurrence of 0 to 9 observed and predicted probabilities for the whole sample (prpoissonobeq and prpoissonpreq) are created. Finally, the prpoissonval variable presents the actual values of 0 to 9 (Stata default) that will be related to the observed and predicted probabilities. The following command allows the observed probabilities and occurrence predictions from 0 to 9 to be visually compared: graph twoway (scatter prpoissonobeq prpoissonpreq prpoissonval, connect (1 1)) To verify the quality of the final adjusted estimated model (goodness-of-fit), we can perform an χ 2 test to compare the two curves presented in Figure 3. Thus, after estimating the Poisson model, we should type: The result shows the quality of the final Poisson regression model, or rather, that there are no statistically significant differences between the predicted and observed values for the number of durable good purchases made using installment closed-end credit.
The overdisp command makes unnecessary the use of the four commands proposed by Cameron and Trived [5], and even the prcounts and poisgof commands, since it directly identifies overdispersion in the data, without the need to previously estimate a specific type of count-data model, and consequently, allows researchers to propose correct models, such Poisson or negative binomial, directly.

Simulation study
In this section, we present a simulation study to show the consistency of the overdispersion test using the overdisp command, from the generation of dependent variables with Poisson or negative binomial distributions, as a function of two random explanatory variables.
We analyze 1,0000,000 replications of 25,000 observations. On each replication, we generate observations responses, y i , and two X i expanatory variables which are standard normal variates defined at the observation i. Appendix 2 presents the Stata codes for the proposed simulations.
We can verify that 5.24% of the simulations with the dependent variable presenting Poisson distribution indicated the existence of overdispersion in the data at 5% of significance level (8.83% of these simulations indicated existence of overdispersion in the dependent variable, conditional to the explanatory variables, at 10% of significance level -already including the previous 5.24%). On the other hand, no simulation rejected the hypothesis of overdispersion in the dependent variable when it assumes a negative binomial distribution. Table 1 summarizes the findings of our simulations. Simulation results indicate that, although the overdispersion test proposed here through the new Stata command overdisp is quite efficient in the occurence of such phenomenon for dependent variables with negative binomial distribution, it can still generate distortions in the interpretation of the results in cases in which the dependent variable follows a Poisson distribution.
This finding represents an important contribution, i.e., if the test indicates equidispersion in the data, there are consistent evidence that the distribution of the dependent variable is, in fact, Poisson. On the other hand, if the test indicates overdispersion in the data, the researcher should investigate more deeply whether the dependent variable actually exhibits better adherence to the Poisson-Gamma distribution or not.

Conclusion
We have presented the overdisp command in Stata, for count-data regression models. Through specication of Poisson or negative binomial models, we have illustrated how to implement directly the overdispersion test. In essence, the new command overdisp contributes to the identification of overdispersion in the data since it consolidates the four commands presented by Cameron and Trivedi [5], what renders the process of detecting overdispersion in count-data models faster and easier.
Compared with chi2gof [15], overdisp can be implementd without the necessity of previously estimating Poisson models, and can be used even if negative binomial models prevail. In other words, while chi2gof compares differences between actual frequencies of the dependent variable with its predicted frequencies obtained through the proposed model after the estimation of a Poisson or a negative binomial model, researchers can apply the overdisp command without the necessity of estimating Poisson or negative binomial models. The use of the overdisp command also makes unnecessary the use of the prcounts and poisgof commands, as shown in Section 5.2.
Finally, it is important to mention that the choice of the significance level of the overdispersion test is left to the user, but an increasing significance level may cause distortions in the interpretation of the results for dependent variables with Poisson distribution.