POV-Ray: Newsgroups: povray.unofficial.patches: [announce] JITC: Really fast POVRay FPU

POV-Ray : Newsgroups : povray.unofficial.patches : [announce] JITC: Really fast POVRay FPU		Server Time 28 Sep 2024 18:34:01 EDT (-0400)

<<< Previous 10 Messages

Goto Latest 10 Messages

Next 3 Messages >>>

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 05:48:20
Message: <4110b0e3@news.povray.org>

Thorsten Froehlich wrote:

> In article <41101157$1@news.povray.org> , Nicolas Calimet
> <pov### [at] freefr>  wrote:
> 
>>> You are aware that the official POV-Ray for Mac OS does include a
>>> just-in-time compiler, aren't you?
>>
>>  It's very likely Wolfgang never tried the Mac OS version  ;-)
> 
> That is true, but using the core code SYS_FUNCTIONS macro family would
> have made integration of a JIT compiler possible without hacking
> fnpovfpu.cpp. In particular because they are documented to exist exactly
> for that purpose. 
>
Well, "they are documented to..." is little exaggeration. 
All I have found in the sources about these macros were a couple 
of lines which described just what I already knew from reading the 
code and one "cryptic" allusion that it can be used somehow to allow 
JIT compilation. And that sounded to me like "we could some time 
use that for JIT if we finally find the time to do so"...

> Would at least have made the whole implementation 
> easier...
> 
Okay... could somebody please make the POVRay-for-MacOS 
source code available in a file format which I can read?

Or, alternatively, point me to a _working_ .sit unpacker for Linux. 

The last time I was using the stuffit expander, I sweared to ban it from 
my HD: First, I unpacked the archive which took quite long for just 
some hundret kb. The binaries were okay but I saw that it messed up 
the newlines in the text. So, I un-packed it again with changed flags 
which really took _ages_ now and finally the text files were okay but 
the binaries were corrupt. Oh dear...

> Certainly!  On the other hand, calling gcc is overkill to make JIT
> compilation possible :-)
> 
I wonder how the Mac version does it...

Wolfgang

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 05:59:25
Message: <4110b37c@news.povray.org>

ABX wrote:

> On Wed, 04 Aug 2004 11:28:22 +0200, Wolfgang Wieser
> <wwi### [at] nospamgmxde> wrote:
>> Christoph Hormann wrote:
>> > I think the work probably would have been better invested in
>> > implementing an internal JIT compiler using the existing hooks as
>> > Thorsten explained.  This would work on all x86 systems (and for Mac an
>> > implementation already exists).
>> 
>> Sounded interesting. I'll have a look at that.
>> Why didn't anybody put that on a publically available todo list? ;)
> 
> I have considered this task for the future versions of MegaPOV thought I
> would be veeeeery happy if you could be faster :-)
> 
IMO, in the present stage, this patch is not suitable for inclusion into 
MegaPOV. Need to get rid of some unclean "hacks" first. 

Or, maybe better, use the existing SYS_ macros if I manage to figure out 
how to use them. 

BTW, will the features from MLPov be included in the next MegaPOV version?

Wolfgang

Post a reply to this message

From: ABX
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 06:04:07
Message: <pqc1h0h2j6naajeiafc1btvvck2vfpl9pc@4ax.com>

On Wed, 04 Aug 2004 11:58:19 +0200, Wolfgang Wieser <wwi### [at] nospamgmxde>
wrote:
> BTW, will the features from MLPov be included in the next MegaPOV version?

Well... we try to be trendy ;-)

ABX

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 06:26:41
Message: <4110b9e1@news.povray.org>

In article <4110b0e3@news.povray.org> , Wolfgang Wieser 
<wwi### [at] nospamgmxde>  wrote:

> Or, alternatively, point me to a _working_ .sit unpacker for Linux.

<http://www.stuffit.com/cgi-bin/stuffit_loginpage.cgi?stuffitunix>

> The last time I was using the stuffit expander, I sweared to ban it from
> my HD: First, I unpacked the archive which took quite long for just
> some hundret kb. The binaries were okay but I saw that it messed up
> the newlines in the text. So, I un-packed it again with changed flags
> which really took _ages_ now and finally the text files were okay but
> the binaries were corrupt. Oh dear...

Yes, the Windows version used to come with a Mac configuration as default.
Not very useful to get Mac line endings on Windows...

    Thorsten

____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 14:27:53
Message: <41112aa8@news.povray.org>

Christoph Hormann wrote:
> I think the work probably would have been better invested in
> implementing an internal JIT compiler using the existing hooks as
> Thorsten explained.  This would work on all x86 systems (and for Mac an
> implementation already exists).
> 
Well, I had a look at the PPC JIT Compiler in the Mac version. If I 
understand it correctly, then it is actually assembling the binary code for 
the PPC in memory and then executing it by jumping into that created 
code. 

After thinking about it, I found 2 reasons which can keep me from 
doing something like that for i386 architecture: 

(1) When I saw the POV VM instruction set it immediately reminded me 
  somehow on the PPC instruction set or a similar RISC instruction set 
  with a number of general-purpose registers etc. (I do not know the PPC 
  instructions in detail but this opinion was based on the feeling I had 
  of it from reading the computer magazines and from my experiences with 
  other RISCs.) 
  So, compiling this code into PPC code turns out to be pretty
  straight-forward. In contrast, the i387 does not seem to have  
  these general purpose registers but instead it uses a register stack with 
  IMO 8 registers and there is a top-of-stack pointer and so on. 
  Furthermore, I am by far no expert in i386/7 assembly and I do not want 
  to hack tons of error-prone code to perform correct translation of 
  POV-VM-ASM into i387-ASM. [r0 should be top of stack...]

(2) GCC does a decent work in optimizing. The POV VM compiler produces 
  assembly which IMO has plenty of (seemingly?) pointless register moves. 
  (Don't understand me wrong, Thorsten: Good register allocation is a really 
  tough job.) Take for example this part of the paramflower on my homepage. 
------------------------
        r0 = sqrt(r0);
        r5 = r5 * r0;
        r0 = r5;
        r5 = r6;  <-- completely useless
        r5 = r0;  <-- useless as well
        r0 = r2;
        r0 = sqrt(r0);
        r5 = r5 + r0;
        r6 = r5;
        r0 = POVFPU_Consts[k];
        r5 = r0;  <-- (skip)
        r7 = r5;  <-- why not r7=r0
        r0 = r2;  <-- (skip)
        r5 = r0;  <-- why not r5=r2
        r0 = r5;  <-- hmm?!
        r5 = r5 * r0;
        r5 = r5 * r0;
        r0 = r5;
        r5 = r7;
------------------------
  Compiling this assembly directly into i387 code would probably not give 
  as good runtime performance as asking GCC would. 
  And I do not want to implement an optimizer especially since there is 
  really little chance to get better than GCC if we're only 1 or 2 people...

Maybe it would be easier to translate the POV-VM-ASM into SSE2 instructions. 
But 2 reasons suggest against that: 
(1) Not yet widely available. 
(2) My box only has SSE1 and hence I could not test it. 

Wolfgang

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 16:28:45
Message: <411146fd@news.povray.org>

In article <41112aa8@news.povray.org> , Wolfgang Wieser 
<wwi### [at] nospamgmxde>  wrote:

> Well, I had a look at the PPC JIT Compiler in the Mac version. If I
> understand it correctly, then it is actually assembling the binary code for
> the PPC in memory and then executing it by jumping into that created
> code.

Yes, that is what it does.

> After thinking about it, I found 2 reasons which can keep me from
> doing something like that for i386 architecture:
>
> (1) When I saw the POV VM instruction set it immediately reminded me
>   somehow on the PPC instruction set or a similar RISC instruction set
>   with a number of general-purpose registers etc. (I do not know the PPC
>   instructions in detail but this opinion was based on the feeling I had
>   of it from reading the computer magazines and from my experiences with
>   other RISCs.)
>   So, compiling this code into PPC code turns out to be pretty
>   straight-forward. In contrast, the i387 does not seem to have
>   these general purpose registers but instead it uses a register stack with
>   IMO 8 registers and there is a top-of-stack pointer and so on.
>   Furthermore, I am by far no expert in i386/7 assembly and I do not want
>   to hack tons of error-prone code to perform correct translation of
>   POV-VM-ASM into i387-ASM. [r0 should be top of stack...]

Well, it would fit into the eight available stack places.  It is not as
trivial as doing it with (at least) eight real registers.

> (2) GCC does a decent work in optimizing. The POV VM compiler produces
>   assembly which IMO has plenty of (seemingly?) pointless register moves.

Yes, it does generate plenty of them, which is an artifact of the the direct
compiling from an expression tree into the final instruction set without
intermediate code.

The good thing is that most of these redundant moves can be removed without
too much work using peephole optimisation.  However, it turns out that this
produces hardly any performance gain (for neither VM nor JIT code) but does
make the compiling/assembling much more complicated.  What you get is below
the measurement error - about two percent raw function performance, which
translates into at most 0.5% (18 seconds per hour) speed improvement in a
heavy isosurface-using scene.

>   Compiling this assembly directly into i387 code would probably not give
>   as good runtime performance as asking GCC would.
>   And I do not want to implement an optimizer especially since there is
>   really little chance to get better than GCC if we're only 1 or 2 people...

The trick is that having the code inline reduces call overhead, which
accounts for about 10% to 15% for the average isosurface function.  Thus,
while you perhaps gain 10% function speed by using gcc (thus 2.5% for the
scene), the call overhead has to be really low to make a difference.  To
reach this with dynamic linking is not as easy as with truly inline compiled
code.  So most likely to total difference is close to zero.

The main reason for this is that unlike integer instructions, changing
floating-point operation order tends to also change precision, which in turn
will quickly result in the compiled representation not being equivalent.  As
the basic principle of compiler construction is to generate equivalent code,
compilers either perform very few optimisations at all, or, like newer
compilers do, allow disabling the strict equivalency requirement for
floating-point operations.

The neat thing is that with isosurfaces precision in the range compilers
have to preserve is only of secondary importance.  Thus, the function
compiler does already perform many of the possible optimisations, and thus
when compiling all that is left are relatively few storage instructions that
can be optimized away.  As I already pointed out, measuring (on Macs with VM
and JIT compiler) revealed that those hardly reduce performance.

> Maybe it would be easier to translate the POV-VM-ASM into SSE2 instructions.

Yes, that would be much easier than targeting a i387-style FPU.

> But 2 reasons suggest against that:
> (1) Not yet widely available.

Well, every processor sold at Aldi today offers it, doesn't it? ;-)

> (2) My box only has SSE1 and hence I could not test it.

I can see how this would be a problem...

    Thorsten

____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 6 Aug 2004 12:34:33
Message: <4113b318@news.povray.org>

Thorsten Froehlich wrote:
> Well, it would fit into the eight available stack places.  It is not as
> trivial as doing it with (at least) eight real registers.
> 
Yes. Especially since there are also some restrictions -- e.g. several 
functions can only be performed on the top-of-stack element. 
It's like the POV VM e.g calling maths functions only on r0 (IIRC). 

> The good thing is that most of these redundant moves can be removed
> without too much work using peephole optimisation.  
>
Correct. After having sent the last posting, I played with the idea 
of actually implementing a peephole optimization step...

> However, it turns out that 
> this produces hardly any performance gain (for neither VM nor JIT code)
>
...but I see that you already tied. 
(Good that I did yet not begin with it.)

> but does
> make the compiling/assembling much more complicated.  What you get is
> below the measurement error - about two percent raw function performance,
> which translates into at most 0.5% (18 seconds per hour) speed improvement
> in a heavy isosurface-using scene.
> 
Which clearly is not worth the efford. 

> The trick is that having the code inline reduces call overhead, which
> accounts for about 10% to 15% for the average isosurface function.  
>
This is correct. Actually, it seems the dynamically linked function call 
overhead is the only disadvantage of my approach concerning performance. 
I actually don't know exactly what the CPU cycles get spent on: 

  (*int_func)(bar);   // <-- internal function
  (*ext_func)(bar);  // <-- external function (dynamically linked)

In C code it looks completely identical but the second one has 
at least 10 times more call overhead. 

> Thus, 
> while you perhaps gain 10% function speed by using gcc (thus 2.5% for the
> scene), the call overhead has to be really low to make a difference.  To
> reach this with dynamic linking is not as easy as with truly inline
> compiled
> code.  So most likely to total difference is close to zero.
> 
Well, probably it's more like that: The more complex a single function, 
the more you gain. For trivial functions, the "gain" may be negative 
(i.e. loss). 

> The main reason for this is that unlike integer instructions, changing
> floating-point operation order tends to also change precision, which in
> turn
> will quickly result in the compiled representation not being equivalent. 
>
Well, since the i386 internally has 80bit FP registers, the accuracy 
of the compiled version can be expected to be at least as good as 
that of the interpreted version. But of course that does not guarantee 
equivalent images. OTOH, all that numerics is more or less a 
trade-off between accuracy and runtime. And scenes should not 
depend on the last couple of bits anyways because they then would 
look completely differently when different compilers or different 
architectures are used. - You already mentioned sth similar further down. 

> As the basic principle of compiler construction is to generate equivalent
> code, compilers either perform very few optimisations at all, or, like
> newer compilers do, allow disabling the strict equivalency requirement for
> floating-point operations.
> 
I'm not sure what GCC really does. I'm compiling with -ffast-math which 
allows some IEEE violations but I'm not sure if it does lots of "dangerous" 
things. But at least I verified that the register moves are optimized 
away. 

>> But 2 reasons suggest against that [SSE2]:
>> (1) Not yet widely available.
> 
> Well, every processor sold at Aldi today offers it, doesn't it? ;-)
> 
IMHO it depends on if they are currently selling AthlonXP or PentiumIV 
CPUs... [At least my AthlonXP-1.47GHz does not have SSE2.]

Wolfgang

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 7 Aug 2004 06:18:49
Message: <4114ac88@news.povray.org>

Wolfgang Wieser wrote:
>> The trick is that having the code inline reduces call overhead, which
>> accounts for about 10% to 15% for the average isosurface function.
>>
> This is correct. Actually, it seems the dynamically linked function call
> overhead is the only disadvantage of my approach concerning performance.
> I actually don't know exactly what the CPU cycles get spent on:
> 
>   (*int_func)(bar);   // <-- internal function
>   (*ext_func)(bar);  // <-- external function (dynamically linked)
> 
> In C code it looks completely identical but the second one has
> at least 10 times more call overhead.
> 
This is wrong. I tricked myself: In the above program, the int_func was 
not a pointer to a function but the function itself. 

More measurements show that the overhead is NOT due to 
external shared object "linkage" but due to the function being called via 
a function pointer - not depending on whether the function is in the 
primary code or in a shared object. 

Wolfgang

Post a reply to this message

From: Nicolas Calimet
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 10:57:00
Message: <41163f3c$1@news.povray.org>

> More measurements show that the overhead is NOT due to 
> external shared object "linkage" but due to the function being called via 
> a function pointer - not depending on whether the function is in the 
> primary code or in a shared object. 

	Interesting.  But also surprising (to me).
	Could you explain why it takes an order of magnitude longer to
jump to a function via a pointer as compared to a direct reference ?
(note: I'm not very knowledgeable in low-level programming, just a
tiny idea of some assembly instructions).

	- NC

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 11:13:09
Message: <41164305@news.povray.org>

In article <41163f3c$1@news.povray.org> , Nicolas Calimet 
<pov### [at] freefr>  wrote:

>  Interesting.  But also surprising (to me).
>  Could you explain why it takes an order of magnitude longer to
> jump to a function via a pointer as compared to a direct reference ?
> (note: I'm not very knowledgeable in low-level programming, just a
> tiny idea of some assembly instructions).

It should not at all.

    Thorsten

____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org

Post a reply to this message

<<< Previous 10 Messages

Goto Latest 10 Messages

Next 3 Messages >>>