POV-Ray: Newsgroups: povray.off-topic: It's all statistics

POV-Ray : Newsgroups : povray.off-topic : It's all statistics		Server Time 13 Jul 2025 02:34:39 EDT (-0400)

From: Invisible
Subject: It's all statistics
Date: 29 Jun 2011 07:05:21
Message: <4e0b06f1$1@news.povray.org>

Suppose you flip a coin 1000 times. What is the probability of getting 
402 heads?

The answer is of course that the head count follows a binomial 
distribution. That means that you need to compute the probability mass 
function. Unfortunately, in this instance the PMF in question is

   Pr(402 heads) = (1000 choose 402) / 2^1000

Now this is a problem. 2^1000 is roughly the square of the number of 
atoms in the observable universe, while 1000 choose 402 is defined as

   1000 choose 402 = 1000! / (402! * 598!)

In principle it looks plausible that you could compute the factorial of 
a big number. It might just be a bit slow, eh? Well... no. It turns out 
that 1000! is over 3 million bits in length, for a start.

Even if you could compute 1000 choose 402, it's again almost as huge as 
2^1000. And taking the quotient of two vast numbers is unlikely to be 
very numerically stable.



Well, perhaps we should instead look at the cumulative density function? 
Unfortunately, the CDF for a binomial distribution is even worse. It's 
defined in terms of the regularised incomplete beta function, which a 
quotient of the beta function (which involves more factorials - or 
worse, the gamma function) and the incomplete beta function (which is 
defined as a complicated integral having no closed-form solutions). 
Without an advanced degree in special functions, it's not even possible 
to derive an *algorithm* for the CDF, never mind one that's numerically 
stable.



Alternatively, we can construct a histogram, with

   Heads Tails
    402   598

and compare this with the "expected outcomes",

   Heads Tails
    500   500

Applying Pearson's chi-squared test, we have to compute

   X^2 = (402-500)^2 / 500 + (598-500)^2 / 500

This is easily determined to be 38.416. But what's the *probability* of 
that happening? The answer is that X^2 follows a chi-squared 
distribution, and consulting the PDF, we discover... that it requires 
the gamma function at non-integer arguments. (Well, half-integer anyway. 
There *is* a closed-form solution for that... but it requires a double 
factorial. Again.)

OK, so let's try the CDF... Yeah, no-go. That requires the gamma 
function *and* the lower incomplete gamma function too.



Ah, but wait. Many, many probability distributions approximate the 
so-called "normal distribution" for sufficiently large values of one 
parameter or another. And indeed, we find that the binomial distribution 
for a very large number of trails (such as our 1000) is approximately a 
normal distribution with a mean of 1000/2 and a standard deviation of 
Sqrt(1000)/2.

The PDF for a normal distribution isn't too bad. It's basically 
exp(-x^2) with a few twiddle-factors. Using this we should be able to 
compute a probability. (Need a continuity correction though. So *that* 
should be easy to get right...)

The CDF is like your worst nightmare. It involves the "error function", 
erf(). (Or, equivalently, erfc().) This, it seems, can be defined in 
terms of... *sigh* the gamma function again.

In short, if you wanted a p-value for our 402 heads, good luck with 
that. You can compute a Z-score easily enough, but you won't get a p-value.

It's kind of frustrating that the "normal" distribution, the one most 
commonly encountered, is also the hardest one to find p-values for.

Post a reply to this message

From: Lars R
Subject: Re: It's all statistics
Date: 29 Jun 2011 07:58:57
Message: <4e0b1381$1@news.povray.org>

On 06/29/11 13:05, Invisible wrote:
> Suppose you flip a coin 1000 times. What is the probability of getting
> 402 heads?
> 
> The answer is of course that the head count follows a binomial
> distribution. That means that you need to compute the probability mass
> function. Unfortunately, in this instance the PMF in question is
> 
>   Pr(402 heads) = (1000 choose 402) / 2^1000
> 
> Now this is a problem. 2^1000 is roughly the square of the number of
> atoms in the observable universe, while 1000 choose 402 is defined as
> 
>   1000 choose 402 = 1000! / (402! * 598!)
> 
> In principle it looks plausible that you could compute the factorial of
> a big number. It might just be a bit slow, eh? Well... no. It turns out
> that 1000! is over 3 million bits in length, for a start.
> 
> Even if you could compute 1000 choose 402, it's again almost as huge as
> 2^1000. And taking the quotient of two vast numbers is unlikely to be
> very numerically stable.

Division by a power of 2 is numerically very stable for binary
arithmetics: It is just subtraction of the exponent. :-)

Moreover: If you feed the formula above into a symbolic calculator it is
able to

  a) compute the binomial more efficient than by division of three
     factorials.
  b) eliminate each factor of 2 in the numerator by one in the
     denominator so the division does not operates on this high numbers.


So the program 'maxima' prints after a very short time:


(%i1) binomial(1000,402)/(2^1000);
(%o1) 138375989102164508869118744263820047102478631240107164548096353\
437104485769496863239251132205776942412106967230343812796549218291211\
502370211070909495252274618853753553269750995170002584212749043301928\
705163749324086671038996613530725409460218145152701829040905136436576\
746556205530054031875 / 133938575898283415118553131132500226320175601\
463191700930468798546293881390617015311649797351961982265949334114694\
143353148393160711539255449807219683732185049182097185302887317763432\
563279639273474427276913080937294774265842484594489569299325963286432\
1399559710817770957553728956578048354650708508672

(I put spaces around the / to make it clearer)

To get a floating point approximation of the quotient:

(%i2) float(binomial(1000,402)/(2^1000));
(%o2)                       1.0331302104275842E-10


So where is the problem in calculating such expressions?


			Lars R.

Post a reply to this message

From: Invisible
Subject: Re: It's all statistics
Date: 29 Jun 2011 08:21:39
Message: <4e0b18d3$1@news.povray.org>

> Division by a power of 2 is numerically very stable for binary
> arithmetics: It is just subtraction of the exponent. :-)

OK. But that still leaves me with the problem of computing the binomial 
coefficient in the first place. Simply generating two vast numbers and 
then expecting their quotient to be computed accurately isn't a very 
good idea. I could probably use Pascal's triangle to compute it more 
directly. (Let's face it, computing the factorials is iterative anyway.)

Given that the number I eventually want to arrive at is actually 10^-10, 
it seems quite daft to arrive at that by computing huge integers.

(And this still doesn't help me if I want a P-value.)

> So where is the problem in calculating such expressions?

I don't have a copy of Maxima.

But that doesn't matter, because Wolfram Alpha can similarly give you a 
numerical answer almost instantly.

The real problem is that I want to write computer software which 
computes P-values for the results it obtains. The fact that all these 
distributions are so numerically intractable to operate with makes that 
a rather difficult task.

PS. Apparently Boost has functions for the PDF and CDF of many common 
distributions. Who knew?

Post a reply to this message

From: Le Forgeron
Subject: Re: It's all statistics
Date: 29 Jun 2011 08:54:59
Message: <4e0b20a3@news.povray.org>

Le 29/06/2011 14:21, Invisible a écrit :
>> Division by a power of 2 is numerically very stable for binary
>> arithmetics: It is just subtraction of the exponent. :-)
> 
> OK. But that still leaves me with the problem of computing the binomial
> coefficient in the first place. Simply generating two vast numbers and
> then expecting their quotient to be computed accurately isn't a very
> good idea. I could probably use Pascal's triangle to compute it more
> directly. (Let's face it, computing the factorials is iterative anyway.)
> 
> Given that the number I eventually want to arrive at is actually 10^-10,
> it seems quite daft to arrive at that by computing huge integers.
> 
> (And this still doesn't help me if I want a P-value.)
> 
>> So where is the problem in calculating such expressions?
> 
> I don't have a copy of Maxima.

Included in any decent distribution... let's troll an OS war again (not!)
That's not Mathematica's license fee.

> 
> But that doesn't matter, because Wolfram Alpha can similarly give you a
> numerical answer almost instantly.
> 
> The real problem is that I want to write computer software which
> computes P-values for the results it obtains. The fact that all these
> distributions are so numerically intractable to operate with makes that
> a rather difficult task.

If you want to play with big numbers by yourself, there is C libraries
about that. They are often used in cryptography but can be used to
perform computation on unlimited integers.

Have a look at GNU MP for instance.

-- 
Software is like dirt - it costs time and money to change it and move it
around.

Just because you can't see it, it doesn't weigh anything,
and you can't drill a hole in it and stick a rivet into it doesn't mean
it's free.

Post a reply to this message

From: Invisible
Subject: Re: It's all statistics
Date: 29 Jun 2011 09:02:17
Message: <4e0b2259$1@news.povray.org>

>>> So where is the problem in calculating such expressions?
>>
>> I don't have a copy of Maxima.
>
> Included in any decent distribution... let's troll an OS war again (not!)
> That's not Mathematica's license fee.

I'd still have to learn Maxima, and I'd still have the problem that what 
I actually want is to write a program myself which computes the answers 
as part of its work. I don't really want an external dependency on a 
sophisticated math package.

>> The real problem is that I want to write computer software which
>> computes P-values for the results it obtains. The fact that all these
>> distributions are so numerically intractable to operate with makes that
>> a rather difficult task.
>
> If you want to play with big numbers by yourself, there is C libraries
> about that. They are often used in cryptography but can be used to
> perform computation on unlimited integers.
>
> Have a look at GNU MP for instance.

I don't really want to play with big numbers; I want to compute a 
P-value. It's just that the formula for the binomial distribution 
happens to involve large numbers as intermediate results. The other 
distributions have the problem that they all involve exotic functions 
which nobody can compute (most prominently the gamma function).

If there were a way to approximate these distributions reasonably well 
(especially in the tails), that would be preferable.

Post a reply to this message

From: Lars R
Subject: Re: It's all statistics
Date: 29 Jun 2011 10:12:32
Message: <4e0b32d0$1@news.povray.org>

> […]
> I don't really want to play with big numbers; I want to compute a
> P-value. It's just that the formula for the binomial distribution
> happens to involve large numbers as intermediate results.

You can choose:

1) Either use exact results that needs "arbitrary precision" libraries.
Even if you calculate (n choose p) more efficiently by alternating
multiplications and (integer) divisions. The integers still becomes
quite big.

2) Use floating point arithmetics. The precision of IEEE double is far
enough for your calculations. :-)

			Lars R.

Post a reply to this message

From: Orchid XP v8
Subject: Re: It's all statistics
Date: 29 Jun 2011 17:42:49
Message: <4e0b9c59@news.povray.org>

On 29/06/2011 12:05 PM, Invisible wrote:
> Suppose you flip a coin 1000 times. What is the probability of getting
> 402 heads?
>
> The answer is of course that the head count follows a binomial
> distribution. That means that you need to compute the probability mass
> function. Unfortunately, in this instance the PMF in question is
>
> Pr(402 heads) = (1000 choose 402) / 2^1000
>
> Now this is a problem. 2^1000 is roughly the square of the number of
> atoms in the observable universe, while 1000 choose 402 is defined as
>
> 1000 choose 402 = 1000! / (402! * 598!)

Just for giggles, I tried it.

It turns out that the GMP can easily handle factorials of this size. I 
was even able to convert this to a double-precision floating point 
number and divide by 2^1000 to get the correct answer.

However, if the parameters become sufficiently large, it eventually 
stops working. Apparently the "choose" function yields its largest 
result when k = n/2. 1028 choose 514 works OK. However, 1030 choose 515 
is so huge that it registers as "infinity". The rest of the calculation 
then fails to work as a result.

Performing the division by 2^n in infinite-precision arithmetic and then 
converting to double-precision allowed me to get the correct probability 
value for 1030 flips with 515 heads. However, if we now look at the 
lowest probability, around 1060 flips the probability reported for zero 
heads began to get progressively less accurate. (Denormal numbers?) At 
1070 flips 0 heads the probability actually came out as "zero". (It 
should in fact be 7.7 * 10^326.) Still, the fraction representation was 
quite correct; it's merely that double-precision cannot handle such tiny 
numbers.

In short, it seems there are limits to what can be done without some 
very clever numerical programming. Still, I got a lot further than I 
ever expected to get...

-- 
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*

Post a reply to this message

From: Neeum Zawan
Subject: Re: It's all statistics
Date: 9 Jul 2011 13:18:58
Message: <8762nbs781.fsf@fester.com>

Invisible <voi### [at] devnull> writes:

>   Pr(402 heads) = (1000 choose 402) / 2^1000
>
> Now this is a problem. 2^1000 is roughly the square of the number of
> atoms in the observable universe, while 1000 choose 402 is defined as
>
>   1000 choose 402 = 1000! / (402! * 598!)
>
> In principle it looks plausible that you could compute the factorial of
> a big number. It might just be a bit slow, eh? Well... no. It turns out
> that 1000! is over 3 million bits in length, for a start.
>
> Even if you could compute 1000 choose 402, it's again almost as huge as
> 2^1000. And taking the quotient of two vast numbers is unlikely to be
> very numerically stable.

Am I missing something? This is trivial to calculate:

(1000/(598*2)) * (999/(597*2) * (998/(596/2)) * ... 

when you've exhausted the 598!, continue the sequence with 402!

Post a reply to this message

From: Mike Raiford
Subject: Re: It's all statistics
Date: 19 Jul 2011 10:14:46
Message: <4e259156$1@news.povray.org>

On 6/29/2011 6:58 AM, Lars R. wrote:

>
>
> So the program 'maxima' prints after a very short time:
>
>
> (%i1) binomial(1000,402)/(2^1000);
> (%o1) 138375989102164508869118744263820047102478631240107164548096353\
> 437104485769496863239251132205776942412106967230343812796549218291211\
> 502370211070909495252274618853753553269750995170002584212749043301928\
> 705163749324086671038996613530725409460218145152701829040905136436576\
> 746556205530054031875 / 133938575898283415118553131132500226320175601\
> 463191700930468798546293881390617015311649797351961982265949334114694\
> 143353148393160711539255449807219683732185049182097185302887317763432\
> 563279639273474427276913080937294774265842484594489569299325963286432\
> 1399559710817770957553728956578048354650708508672
>
> (I put spaces around the / to make it clearer)
>
> To get a floating point approximation of the quotient:
>
> (%i2) float(binomial(1000,402)/(2^1000));
> (%o2)                       1.0331302104275842E-10
>

Cool :) Another Maxima user!

-- 
~Mike

Post a reply to this message