POV-Ray: Newsgroups: povray.unofficial.patches: [announce] JITC: Really fast POVRay FPU: Re: [announce] JITC: Really fast POVRay FPU

POV-Ray : Newsgroups : povray.unofficial.patches : [announce] JITC: Really fast POVRay FPU : Re: [announce] JITC: Really fast POVRay FPU		Server Time 12 Jul 2025 08:47:05 EDT (-0400)

From: Wolfgang Wieser
Date: 6 Aug 2004 12:34:33
Message: <4113b318@news.povray.org>

Thorsten Froehlich wrote:
> Well, it would fit into the eight available stack places.  It is not as
> trivial as doing it with (at least) eight real registers.
> 
Yes. Especially since there are also some restrictions -- e.g. several 
functions can only be performed on the top-of-stack element. 
It's like the POV VM e.g calling maths functions only on r0 (IIRC). 

> The good thing is that most of these redundant moves can be removed
> without too much work using peephole optimisation.  
>
Correct. After having sent the last posting, I played with the idea 
of actually implementing a peephole optimization step...

> However, it turns out that 
> this produces hardly any performance gain (for neither VM nor JIT code)
>
...but I see that you already tied. 
(Good that I did yet not begin with it.)

> but does
> make the compiling/assembling much more complicated.  What you get is
> below the measurement error - about two percent raw function performance,
> which translates into at most 0.5% (18 seconds per hour) speed improvement
> in a heavy isosurface-using scene.
> 
Which clearly is not worth the efford. 

> The trick is that having the code inline reduces call overhead, which
> accounts for about 10% to 15% for the average isosurface function.  
>
This is correct. Actually, it seems the dynamically linked function call 
overhead is the only disadvantage of my approach concerning performance. 
I actually don't know exactly what the CPU cycles get spent on: 

  (*int_func)(bar);   // <-- internal function
  (*ext_func)(bar);  // <-- external function (dynamically linked)

In C code it looks completely identical but the second one has 
at least 10 times more call overhead. 

> Thus, 
> while you perhaps gain 10% function speed by using gcc (thus 2.5% for the
> scene), the call overhead has to be really low to make a difference.  To
> reach this with dynamic linking is not as easy as with truly inline
> compiled
> code.  So most likely to total difference is close to zero.
> 
Well, probably it's more like that: The more complex a single function, 
the more you gain. For trivial functions, the "gain" may be negative 
(i.e. loss). 

> The main reason for this is that unlike integer instructions, changing
> floating-point operation order tends to also change precision, which in
> turn
> will quickly result in the compiled representation not being equivalent. 
>
Well, since the i386 internally has 80bit FP registers, the accuracy 
of the compiled version can be expected to be at least as good as 
that of the interpreted version. But of course that does not guarantee 
equivalent images. OTOH, all that numerics is more or less a 
trade-off between accuracy and runtime. And scenes should not 
depend on the last couple of bits anyways because they then would 
look completely differently when different compilers or different 
architectures are used. - You already mentioned sth similar further down. 

> As the basic principle of compiler construction is to generate equivalent
> code, compilers either perform very few optimisations at all, or, like
> newer compilers do, allow disabling the strict equivalency requirement for
> floating-point operations.
> 
I'm not sure what GCC really does. I'm compiling with -ffast-math which 
allows some IEEE violations but I'm not sure if it does lots of "dangerous" 
things. But at least I verified that the register moves are optimized 
away. 

>> But 2 reasons suggest against that [SSE2]:
>> (1) Not yet widely available.
> 
> Well, every processor sold at Aldi today offers it, doesn't it? ;-)
> 
IMHO it depends on if they are currently selling AthlonXP or PentiumIV 
CPUs... [At least my AthlonXP-1.47GHz does not have SSE2.]

Wolfgang

Post a reply to this message