|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Suppose you flip a coin 1000 times. What is the probability of getting
402 heads?
The answer is of course that the head count follows a binomial
distribution. That means that you need to compute the probability mass
function. Unfortunately, in this instance the PMF in question is
Pr(402 heads) = (1000 choose 402) / 2^1000
Now this is a problem. 2^1000 is roughly the square of the number of
atoms in the observable universe, while 1000 choose 402 is defined as
1000 choose 402 = 1000! / (402! * 598!)
In principle it looks plausible that you could compute the factorial of
a big number. It might just be a bit slow, eh? Well... no. It turns out
that 1000! is over 3 million bits in length, for a start.
Even if you could compute 1000 choose 402, it's again almost as huge as
2^1000. And taking the quotient of two vast numbers is unlikely to be
very numerically stable.
Well, perhaps we should instead look at the cumulative density function?
Unfortunately, the CDF for a binomial distribution is even worse. It's
defined in terms of the regularised incomplete beta function, which a
quotient of the beta function (which involves more factorials - or
worse, the gamma function) and the incomplete beta function (which is
defined as a complicated integral having no closed-form solutions).
Without an advanced degree in special functions, it's not even possible
to derive an *algorithm* for the CDF, never mind one that's numerically
stable.
Alternatively, we can construct a histogram, with
Heads Tails
402 598
and compare this with the "expected outcomes",
Heads Tails
500 500
Applying Pearson's chi-squared test, we have to compute
X^2 = (402-500)^2 / 500 + (598-500)^2 / 500
This is easily determined to be 38.416. But what's the *probability* of
that happening? The answer is that X^2 follows a chi-squared
distribution, and consulting the PDF, we discover... that it requires
the gamma function at non-integer arguments. (Well, half-integer anyway.
There *is* a closed-form solution for that... but it requires a double
factorial. Again.)
OK, so let's try the CDF... Yeah, no-go. That requires the gamma
function *and* the lower incomplete gamma function too.
Ah, but wait. Many, many probability distributions approximate the
so-called "normal distribution" for sufficiently large values of one
parameter or another. And indeed, we find that the binomial distribution
for a very large number of trails (such as our 1000) is approximately a
normal distribution with a mean of 1000/2 and a standard deviation of
Sqrt(1000)/2.
The PDF for a normal distribution isn't too bad. It's basically
exp(-x^2) with a few twiddle-factors. Using this we should be able to
compute a probability. (Need a continuity correction though. So *that*
should be easy to get right...)
The CDF is like your worst nightmare. It involves the "error function",
erf(). (Or, equivalently, erfc().) This, it seems, can be defined in
terms of... *sigh* the gamma function again.
In short, if you wanted a p-value for our 402 heads, good luck with
that. You can compute a Z-score easily enough, but you won't get a p-value.
It's kind of frustrating that the "normal" distribution, the one most
commonly encountered, is also the hardest one to find p-values for.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 06/29/11 13:05, Invisible wrote:
> Suppose you flip a coin 1000 times. What is the probability of getting
> 402 heads?
>
> The answer is of course that the head count follows a binomial
> distribution. That means that you need to compute the probability mass
> function. Unfortunately, in this instance the PMF in question is
>
> Pr(402 heads) = (1000 choose 402) / 2^1000
>
> Now this is a problem. 2^1000 is roughly the square of the number of
> atoms in the observable universe, while 1000 choose 402 is defined as
>
> 1000 choose 402 = 1000! / (402! * 598!)
>
> In principle it looks plausible that you could compute the factorial of
> a big number. It might just be a bit slow, eh? Well... no. It turns out
> that 1000! is over 3 million bits in length, for a start.
>
> Even if you could compute 1000 choose 402, it's again almost as huge as
> 2^1000. And taking the quotient of two vast numbers is unlikely to be
> very numerically stable.
Division by a power of 2 is numerically very stable for binary
arithmetics: It is just subtraction of the exponent. :-)
Moreover: If you feed the formula above into a symbolic calculator it is
able to
a) compute the binomial more efficient than by division of three
factorials.
b) eliminate each factor of 2 in the numerator by one in the
denominator so the division does not operates on this high numbers.
So the program 'maxima' prints after a very short time:
(%i1) binomial(1000,402)/(2^1000);
(%o1) 138375989102164508869118744263820047102478631240107164548096353\
437104485769496863239251132205776942412106967230343812796549218291211\
502370211070909495252274618853753553269750995170002584212749043301928\
705163749324086671038996613530725409460218145152701829040905136436576\
746556205530054031875 / 133938575898283415118553131132500226320175601\
463191700930468798546293881390617015311649797351961982265949334114694\
143353148393160711539255449807219683732185049182097185302887317763432\
563279639273474427276913080937294774265842484594489569299325963286432\
1399559710817770957553728956578048354650708508672
(I put spaces around the / to make it clearer)
To get a floating point approximation of the quotient:
(%i2) float(binomial(1000,402)/(2^1000));
(%o2) 1.0331302104275842E-10
So where is the problem in calculating such expressions?
Lars R.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> Division by a power of 2 is numerically very stable for binary
> arithmetics: It is just subtraction of the exponent. :-)
OK. But that still leaves me with the problem of computing the binomial
coefficient in the first place. Simply generating two vast numbers and
then expecting their quotient to be computed accurately isn't a very
good idea. I could probably use Pascal's triangle to compute it more
directly. (Let's face it, computing the factorials is iterative anyway.)
Given that the number I eventually want to arrive at is actually 10^-10,
it seems quite daft to arrive at that by computing huge integers.
(And this still doesn't help me if I want a P-value.)
> So where is the problem in calculating such expressions?
I don't have a copy of Maxima.
But that doesn't matter, because Wolfram Alpha can similarly give you a
numerical answer almost instantly.
The real problem is that I want to write computer software which
computes P-values for the results it obtains. The fact that all these
distributions are so numerically intractable to operate with makes that
a rather difficult task.
PS. Apparently Boost has functions for the PDF and CDF of many common
distributions. Who knew?
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Le 29/06/2011 14:21, Invisible a écrit :
>> Division by a power of 2 is numerically very stable for binary
>> arithmetics: It is just subtraction of the exponent. :-)
>
> OK. But that still leaves me with the problem of computing the binomial
> coefficient in the first place. Simply generating two vast numbers and
> then expecting their quotient to be computed accurately isn't a very
> good idea. I could probably use Pascal's triangle to compute it more
> directly. (Let's face it, computing the factorials is iterative anyway.)
>
> Given that the number I eventually want to arrive at is actually 10^-10,
> it seems quite daft to arrive at that by computing huge integers.
>
> (And this still doesn't help me if I want a P-value.)
>
>> So where is the problem in calculating such expressions?
>
> I don't have a copy of Maxima.
Included in any decent distribution... let's troll an OS war again (not!)
That's not Mathematica's license fee.
>
> But that doesn't matter, because Wolfram Alpha can similarly give you a
> numerical answer almost instantly.
>
> The real problem is that I want to write computer software which
> computes P-values for the results it obtains. The fact that all these
> distributions are so numerically intractable to operate with makes that
> a rather difficult task.
If you want to play with big numbers by yourself, there is C libraries
about that. They are often used in cryptography but can be used to
perform computation on unlimited integers.
Have a look at GNU MP for instance.
--
Software is like dirt - it costs time and money to change it and move it
around.
Just because you can't see it, it doesn't weigh anything,
and you can't drill a hole in it and stick a rivet into it doesn't mean
it's free.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
>>> So where is the problem in calculating such expressions?
>>
>> I don't have a copy of Maxima.
>
> Included in any decent distribution... let's troll an OS war again (not!)
> That's not Mathematica's license fee.
I'd still have to learn Maxima, and I'd still have the problem that what
I actually want is to write a program myself which computes the answers
as part of its work. I don't really want an external dependency on a
sophisticated math package.
>> The real problem is that I want to write computer software which
>> computes P-values for the results it obtains. The fact that all these
>> distributions are so numerically intractable to operate with makes that
>> a rather difficult task.
>
> If you want to play with big numbers by yourself, there is C libraries
> about that. They are often used in cryptography but can be used to
> perform computation on unlimited integers.
>
> Have a look at GNU MP for instance.
I don't really want to play with big numbers; I want to compute a
P-value. It's just that the formula for the binomial distribution
happens to involve large numbers as intermediate results. The other
distributions have the problem that they all involve exotic functions
which nobody can compute (most prominently the gamma function).
If there were a way to approximate these distributions reasonably well
(especially in the tails), that would be preferable.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> […]
> I don't really want to play with big numbers; I want to compute a
> P-value. It's just that the formula for the binomial distribution
> happens to involve large numbers as intermediate results.
You can choose:
1) Either use exact results that needs "arbitrary precision" libraries.
Even if you calculate (n choose p) more efficiently by alternating
multiplications and (integer) divisions. The integers still becomes
quite big.
2) Use floating point arithmetics. The precision of IEEE double is far
enough for your calculations. :-)
Lars R.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 29/06/2011 12:05 PM, Invisible wrote:
> Suppose you flip a coin 1000 times. What is the probability of getting
> 402 heads?
>
> The answer is of course that the head count follows a binomial
> distribution. That means that you need to compute the probability mass
> function. Unfortunately, in this instance the PMF in question is
>
> Pr(402 heads) = (1000 choose 402) / 2^1000
>
> Now this is a problem. 2^1000 is roughly the square of the number of
> atoms in the observable universe, while 1000 choose 402 is defined as
>
> 1000 choose 402 = 1000! / (402! * 598!)
Just for giggles, I tried it.
It turns out that the GMP can easily handle factorials of this size. I
was even able to convert this to a double-precision floating point
number and divide by 2^1000 to get the correct answer.
However, if the parameters become sufficiently large, it eventually
stops working. Apparently the "choose" function yields its largest
result when k = n/2. 1028 choose 514 works OK. However, 1030 choose 515
is so huge that it registers as "infinity". The rest of the calculation
then fails to work as a result.
Performing the division by 2^n in infinite-precision arithmetic and then
converting to double-precision allowed me to get the correct probability
value for 1030 flips with 515 heads. However, if we now look at the
lowest probability, around 1060 flips the probability reported for zero
heads began to get progressively less accurate. (Denormal numbers?) At
1070 flips 0 heads the probability actually came out as "zero". (It
should in fact be 7.7 * 10^326.) Still, the fraction representation was
quite correct; it's merely that double-precision cannot handle such tiny
numbers.
In short, it seems there are limits to what can be done without some
very clever numerical programming. Still, I got a lot further than I
ever expected to get...
--
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible <voi### [at] devnull> writes:
> Pr(402 heads) = (1000 choose 402) / 2^1000
>
> Now this is a problem. 2^1000 is roughly the square of the number of
> atoms in the observable universe, while 1000 choose 402 is defined as
>
> 1000 choose 402 = 1000! / (402! * 598!)
>
> In principle it looks plausible that you could compute the factorial of
> a big number. It might just be a bit slow, eh? Well... no. It turns out
> that 1000! is over 3 million bits in length, for a start.
>
> Even if you could compute 1000 choose 402, it's again almost as huge as
> 2^1000. And taking the quotient of two vast numbers is unlikely to be
> very numerically stable.
Am I missing something? This is trivial to calculate:
(1000/(598*2)) * (999/(597*2) * (998/(596/2)) * ...
when you've exhausted the 598!, continue the sequence with 402!
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 6/29/2011 6:58 AM, Lars R. wrote:
>
>
> So the program 'maxima' prints after a very short time:
>
>
> (%i1) binomial(1000,402)/(2^1000);
> (%o1) 138375989102164508869118744263820047102478631240107164548096353\
> 437104485769496863239251132205776942412106967230343812796549218291211\
> 502370211070909495252274618853753553269750995170002584212749043301928\
> 705163749324086671038996613530725409460218145152701829040905136436576\
> 746556205530054031875 / 133938575898283415118553131132500226320175601\
> 463191700930468798546293881390617015311649797351961982265949334114694\
> 143353148393160711539255449807219683732185049182097185302887317763432\
> 563279639273474427276913080937294774265842484594489569299325963286432\
> 1399559710817770957553728956578048354650708508672
>
> (I put spaces around the / to make it clearer)
>
> To get a floating point approximation of the quotient:
>
> (%i2) float(binomial(1000,402)/(2^1000));
> (%o2) 1.0331302104275842E-10
>
Cool :) Another Maxima user!
--
~Mike
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|