|
![](/i/fill.gif) |
Suppose you flip a coin 1000 times. What is the probability of getting
402 heads?
The answer is of course that the head count follows a binomial
distribution. That means that you need to compute the probability mass
function. Unfortunately, in this instance the PMF in question is
Pr(402 heads) = (1000 choose 402) / 2^1000
Now this is a problem. 2^1000 is roughly the square of the number of
atoms in the observable universe, while 1000 choose 402 is defined as
1000 choose 402 = 1000! / (402! * 598!)
In principle it looks plausible that you could compute the factorial of
a big number. It might just be a bit slow, eh? Well... no. It turns out
that 1000! is over 3 million bits in length, for a start.
Even if you could compute 1000 choose 402, it's again almost as huge as
2^1000. And taking the quotient of two vast numbers is unlikely to be
very numerically stable.
Well, perhaps we should instead look at the cumulative density function?
Unfortunately, the CDF for a binomial distribution is even worse. It's
defined in terms of the regularised incomplete beta function, which a
quotient of the beta function (which involves more factorials - or
worse, the gamma function) and the incomplete beta function (which is
defined as a complicated integral having no closed-form solutions).
Without an advanced degree in special functions, it's not even possible
to derive an *algorithm* for the CDF, never mind one that's numerically
stable.
Alternatively, we can construct a histogram, with
Heads Tails
402 598
and compare this with the "expected outcomes",
Heads Tails
500 500
Applying Pearson's chi-squared test, we have to compute
X^2 = (402-500)^2 / 500 + (598-500)^2 / 500
This is easily determined to be 38.416. But what's the *probability* of
that happening? The answer is that X^2 follows a chi-squared
distribution, and consulting the PDF, we discover... that it requires
the gamma function at non-integer arguments. (Well, half-integer anyway.
There *is* a closed-form solution for that... but it requires a double
factorial. Again.)
OK, so let's try the CDF... Yeah, no-go. That requires the gamma
function *and* the lower incomplete gamma function too.
Ah, but wait. Many, many probability distributions approximate the
so-called "normal distribution" for sufficiently large values of one
parameter or another. And indeed, we find that the binomial distribution
for a very large number of trails (such as our 1000) is approximately a
normal distribution with a mean of 1000/2 and a standard deviation of
Sqrt(1000)/2.
The PDF for a normal distribution isn't too bad. It's basically
exp(-x^2) with a few twiddle-factors. Using this we should be able to
compute a probability. (Need a continuity correction though. So *that*
should be easy to get right...)
The CDF is like your worst nightmare. It involves the "error function",
erf(). (Or, equivalently, erfc().) This, it seems, can be defined in
terms of... *sigh* the gamma function again.
In short, if you wanted a p-value for our 402 heads, good luck with
that. You can compute a Z-score easily enough, but you won't get a p-value.
It's kind of frustrating that the "normal" distribution, the one most
commonly encountered, is also the hardest one to find p-values for.
Post a reply to this message
|
![](/i/fill.gif) |