POV-Ray: Newsgroups: povray.unofficial.patches: [announce] JITC: Really fast POVRay FPU: Re: [announce] JITC: Really fast POVRay FPU

POV-Ray : Newsgroups : povray.unofficial.patches : [announce] JITC: Really fast POVRay FPU : Re: [announce] JITC: Really fast POVRay FPU		Server Time 19 Oct 2025 22:17:58 EDT (-0400)
From: Thorsten Froehlich
Date: 4 Aug 2004 16:28:45
Message: <411146fd@news.povray.org>
In article <41112aa8@news.povray.org> , Wolfgang Wieser 
<wwi### [at] nospamgmxde>  wrote:

> Well, I had a look at the PPC JIT Compiler in the Mac version. If I
> understand it correctly, then it is actually assembling the binary code for
> the PPC in memory and then executing it by jumping into that created
> code.

Yes, that is what it does.

> After thinking about it, I found 2 reasons which can keep me from
> doing something like that for i386 architecture:
>
> (1) When I saw the POV VM instruction set it immediately reminded me
>   somehow on the PPC instruction set or a similar RISC instruction set
>   with a number of general-purpose registers etc. (I do not know the PPC
>   instructions in detail but this opinion was based on the feeling I had
>   of it from reading the computer magazines and from my experiences with
>   other RISCs.)
>   So, compiling this code into PPC code turns out to be pretty
>   straight-forward. In contrast, the i387 does not seem to have
>   these general purpose registers but instead it uses a register stack with
>   IMO 8 registers and there is a top-of-stack pointer and so on.
>   Furthermore, I am by far no expert in i386/7 assembly and I do not want
>   to hack tons of error-prone code to perform correct translation of
>   POV-VM-ASM into i387-ASM. [r0 should be top of stack...]

Well, it would fit into the eight available stack places.  It is not as
trivial as doing it with (at least) eight real registers.

> (2) GCC does a decent work in optimizing. The POV VM compiler produces
>   assembly which IMO has plenty of (seemingly?) pointless register moves.

Yes, it does generate plenty of them, which is an artifact of the the direct
compiling from an expression tree into the final instruction set without
intermediate code.

The good thing is that most of these redundant moves can be removed without
too much work using peephole optimisation.  However, it turns out that this
produces hardly any performance gain (for neither VM nor JIT code) but does
make the compiling/assembling much more complicated.  What you get is below
the measurement error - about two percent raw function performance, which
translates into at most 0.5% (18 seconds per hour) speed improvement in a
heavy isosurface-using scene.

>   Compiling this assembly directly into i387 code would probably not give
>   as good runtime performance as asking GCC would.
>   And I do not want to implement an optimizer especially since there is
>   really little chance to get better than GCC if we're only 1 or 2 people...

The trick is that having the code inline reduces call overhead, which
accounts for about 10% to 15% for the average isosurface function.  Thus,
while you perhaps gain 10% function speed by using gcc (thus 2.5% for the
scene), the call overhead has to be really low to make a difference.  To
reach this with dynamic linking is not as easy as with truly inline compiled
code.  So most likely to total difference is close to zero.

The main reason for this is that unlike integer instructions, changing
floating-point operation order tends to also change precision, which in turn
will quickly result in the compiled representation not being equivalent.  As
the basic principle of compiler construction is to generate equivalent code,
compilers either perform very few optimisations at all, or, like newer
compilers do, allow disabling the strict equivalency requirement for
floating-point operations.

The neat thing is that with isosurfaces precision in the range compilers
have to preserve is only of secondary importance.  Thus, the function
compiler does already perform many of the possible optimisations, and thus
when compiling all that is left are relatively few storage instructions that
can be optimized away.  As I already pointed out, measuring (on Macs with VM
and JIT compiler) revealed that those hardly reduce performance.

> Maybe it would be easier to translate the POV-VM-ASM into SSE2 instructions.

Yes, that would be much easier than targeting a i387-style FPU.

> But 2 reasons suggest against that:
> (1) Not yet widely available.

Well, every processor sold at Aldi today offers it, doesn't it? ;-)

> (2) My box only has SSE1 and hence I could not test it.

I can see how this would be a problem...

    Thorsten

____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org
Post a reply to this message