The package
discretefit
implements fast Monte Carlo simulations for
goodness-of-fit (GOF) tests for discrete distributions. This includes
tests based on the root-mean-square statistic, the Chi-squared
statistic, the log-likelihood-ratio (G2) statistic, the
Freeman-Tukey (Hellinger-distance) statistic, the Kolmogorov-Smirnov
statistic, and the Cramer-von Mises statistic.
Simulations are written in C++ (utilizing Rcpp
) and are
considerably faster than the simulated Chi-squared GOF test in the R
stats
package.
The GOF tests in discretefit
function on a vector of
counts, x, and a vector of probabilities, p. In the below example, x
represents a vector of counts for five categories, and p represents a
vector of probabilities for each corresponding category. The GOF tests
provides p-values for the null hypothesis that x is a random sample of
the discrete distribution defined by p.
library(discretefit)
library(bench)
x <- c(42, 0, 13, 2, 109)
p <- c(0.2, 0.05, 0.1, 0.05, 0.6)
chisq_gof(x, p)
#>
#> Simulated Chi-squared goodness-of-fit test
#>
#> data: x
#> Chi-squared = 17.082, p-value = 0.0024
rms_gof(x, p)
#>
#> Simulated root-mean-square goodness-of-fit test
#>
#> data: x
#> RMS = 1.731, p-value = 0.0415
g_gof(x, p)
#>
#> Simulated log-likelihood-ratio goodness-of-fit test
#>
#> data: x
#> G2 = 27.362, p-value = 9.999e-05
ft_gof(x, p)
#>
#> Simulated Freeman-Tukey goodness-of-fit test
#>
#> data: x
#> FT = 45.599, p-value = 9.999e-05
ks_gof(x, p)
#>
#> Simulated Kolmogorov-Smirnov goodness-of-fit test
#>
#> data: x
#> KS = 0.056627, p-value = 0.2464
cvm_gof(x, p)
#>
#> Simulated Cramer-von Mises goodness-of-fit test
#>
#> data: x
#> W2 = 0.12578, p-value = 0.1859
In a surprising number of cases, a simulated GOF test based on the root-mean-square statistic outperforms the Chi-squared test and other tests in the Cressie-Read power divergence family. This has been demonstrated by Perkins, Tygert, and Ward (2011). They provide the following toy example.
Take a discrete distribution with 50 bins (or categories). The probability for the first and second bin is 0.25. The probability for each of the remaining 48 bins is 0.5 / 48 (~0.0104).
Now take the observed counts of 15 for the first bin, 5 for the second bin, and zero for each of the remaining 48 bins. It’s obvious that these observations are very unlikely to occur for a random sample from the above distribution. However, the Chi-squared, G2, and Freeman-Tukey tests fail to reject the null hypothesis.
x <- c(15, 5, rep(0, 48))
p <- c(0.25, 0.25, rep(1/(2 * 50 -4), 48))
chisq_gof(x, p)
#>
#> Simulated Chi-squared goodness-of-fit test
#>
#> data: x
#> Chi-squared = 30, p-value = 0.9644
g_gof(x, p)
#>
#> Simulated log-likelihood-ratio goodness-of-fit test
#>
#> data: x
#> G2 = 32.958, p-value = 0.6674
ft_gof(x, p)
#>
#> Simulated Freeman-Tukey goodness-of-fit test
#>
#> data: x
#> FT = 50.718, p-value = 0.1342
By contrast, the root-mean-square test convincingly rejects the null hypothesis.
rms_gof(x, p)
#>
#> Simulated root-mean-square goodness-of-fit test
#>
#> data: x
#> RMS = 5.1042, p-value = 9.999e-05
For additional examples, also see Perkins, Tygert, and Ward (2011) and Ward and Carroll (2014).
The simulated Chi-squared GOF test in discretefit
produces identical p-values to the simulated Chi-squared GOF test in the
stats
package that is part of base R.
set.seed(499)
chisq_gof(x, p, reps = 2000)$p.value
#> [1] 0.9685157
set.seed(499)
chisq.test(x, p = p, simulate.p.value = TRUE)$p.value
#> [1] 0.9685157
However, because Monte Carlo simulations in discretefit
are implemented in C++, chisq_gof
is much faster than
chisq.test
, especially when a large number of simulations
are required.
bench::system_time(
chisq_gof(x, p, reps = 20000)
)
#> process real
#> 85.3ms 85.4ms
bench::system_time(
chisq.test(x, p = p, simulate.p.value = TRUE, B = 20000)
)
#> process real
#> 1.22s 1.22s
Additionally, the simulated GOF tests in base R is vectorized, so for
large vectors attempting a large number of simulations may not be
possible because of memory constraints. Since the functions in
discretefit
are not vectorized, memory use is
minimized.
Let N represent the number of observations, let k represent the number of bins, let pi represent the expected fraction of observations in the ith bin, and let qi represent the observed fraction of observations in the ith bin. Pearson’s chi-squared is defined as follows (Pearson, 1900):
$$ X^2 = \sum_{i = 1}^{k} \frac{(q_{i} - p_{i})^2} {p_{i}} $$ The log-likelihood ratio statistic is defined as follows under the convention that qi ln($\frac{q_{i}} {p_{i}}$) = 0 when qi = 0.
$$ G^2 = 2N \sum_{i = 1}^{k} q_{i} \ln(\frac{q_{i}} {p_{i}}) $$ The Freeman-Tukey statistic was introduced by Freeman and Tukey (1950). For application to GOF tests see Cressie and Read (1984).
$$ FT = 4N \sum_{i = 1}^{k} (\sqrt{q_{i}} - \sqrt{p_{i}})^2 $$ Finally, the root-mean-square statistic is defined as follows:
$$ RMS = \sqrt{N^{-1} \sum_{i = 1}^{k} (q_{i} - p_{i})^2}$$
For statistics based on the empirical cumulative distribution function, let Si correspond to the empirical cumulative distribution function for the observed data, Ti correspond to the cumulative distribution function for the theoretical distribution, and Zi = Si - Ti.
The Kolmogorov-Smirnov statistics for discrete distributions as defined by Pettitt and Stephens is as follows (1977):
KS = max1 ≤ i ≤ k|Zi|
The cvm_gof
function utilizes the formula for the
Cramer-von Mises statistic for discrete distributions originally
introduced by Choulakian, Lockhart, and Stephens (1994).
$$ CVM = N^{-1} \sum_{i = 1}^{k} Z_{i}^2p_{i} $$ For an alternative formula see Lockhart, Spinelli, and Stephens (2007).
All p-values calculated by the functions in discretefit
follow the formula for simulated p-values proposed by Dwass (1957). For
the below equation, let m represent the number of simulations and let B
represent the number of simulations where the test statistic for the
simulated data is greater than or equal to the test statistic for the
observed data.
$${p}_{u} = P(B <= b) = \frac{b + 1}{m + 1}$$
This is the equation used to calculate simulated p-values in the
chisq.test
function in the stats
package but
some implementations of simulated p-values, for example the
dgof
package, use the unbiased estimator, $\frac{B}{m}$. For an explanation of why the
biased estimator yields a test of the correct size and the unbiased
estimator does not, see Phipson and Smyth (2011).
Several other packages implement GOF tests for discrete distributions.
As noted above, the stats
package in base R implements a
simulated Chi-squared GOF test.
I’m not aware of an R package that implements a simulated G2 GOF test but the
packages RVAideMemoire
and DescTools
implement
GOF tests that utilize approximations based on the Chi-squared
distribution.
The dgof
package (Anderson and Emerson 2011) implements
simulated Kolmogorov-Smirnov GOF tests and simulated Cramer-von Mises
GOF tests . The cvmdisc
package also implements a simulated
Cramer-von Mises GOF test.
The KSgeneral
package (Dimitrova, Kaishev, and Tan,
2020) implements exact Kolmogorov-Smirnov GOF tests and fast, simulated
GOF tests utilizing the algorithm introduced by Wood and Altavela
(1978).
I’m not aware of another R package that implements a root-mean-square GOF test.
Arnold, Taylor B. and John W. Emerson. “Nonparametric goodness-of-fit tests for discrete null distributions.” R Journal. https://doi.org/10.32614/rj-2011-016
Choulakian, V., Richard. A. Lockhart, and Michael A. Stephens. “Cramer–von Mises statistics for discrete distributions.” Canadian Journal of Statistics, 1994. https://doi.org/10.2307/3315828
Cressie, N. and Read, T.R. “Multinomial goodness-of-fit tests.” Journal of the Royal Statistical Society, 1984. https://doi.org/10.1111/j.2517-6161.1984.tb01318.x
Dimitrova, Dimitrina S., Vladimir K. Kaishev, and Senren Tan. “Computing the Kolmogorov-Smirnov distribution when the underlying CDF is purely discrete, mixed, or continuous.” Journal of Statistical Software, 2020. https://doi.org/10.18637/jss.v095.i10
Dwass, Meyer. “Modified randomization tests for nonparametric hypotheses.” Annuls of Mathematical Statistics, 1957. https://doi.org/10.1214/aoms/1177707045
Eddelbuettel, Dirk and Romain Francois. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software, 2011. https://www.jstatsoft.org/article/view/v040i08
Freeman, Murray F. and John W. Tukey. “Transformations related to the angular and the square root.” The Annals of Mathematical Statistics, 1950. https://doi.org/10.1214/aoms/1177729756
Lockhart, Richard A., John J. Spinelli and Michael A. Stephens. “Cramer–von Mises statistics for discrete distributions with unknown parameters.” Canadian Journal of Statistics, 2007. https://doi.org/10.1002/cjs.5550350111
Pearson, Karl. “On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.” Philosophical Magazine, 1900.
Perkins, William, Mark Tygert, and Rachel Ward. “Computing the confidence levels for a root-mean-square test of goodness-of-fit.” Applied Mathematics and Computation, 2011. https://doi.org/10.1016/j.amc.2011.03.124
Pettitt, Anthony N., Stephens, Michael A. “The Kolmogorov–Smirnov goodness-of-fit statistic with discrete and grouped data.” Technometrics, 1977. https://doi.org/10.2307/1268631
Phipson, Belinda, and Gordon K. Smyth. “Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 2010. https://doi.org/10.2202/1544-6115.1585
Ward, Rachel and Raymond J. Carroll. “Testing Hardy–Weinberg equilibrium with a simple root-mean-square statistic.” Biostatistics, 2014. https://doi.org/10.1093/biostatistics/kxt028
Wood, Constance L., and Michele M. Altavela. “Large-Sample Results for Kolmogorov–Smirnov Statistics for Discrete Distributions.” Biometrika, 1978. https://doi.org/10.1093/biomet/65.1.235