POV-Ray: Newsgroups: povray.unofficial.patches: [announce] JITC: Really fast POVRay FPU

POV-Ray : Newsgroups : povray.unofficial.patches : [announce] JITC: Really fast POVRay FPU		Server Time 12 Jul 2025 16:10:02 EDT (-0400)

<<< Previous 10 Messages

Goto Initial 10 Messages

From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 06:26:41
Message: <4110b9e1@news.povray.org>

In article <4110b0e3@news.povray.org> , Wolfgang Wieser 
<wwi### [at] nospamgmxde>  wrote:

> Or, alternatively, point me to a _working_ .sit unpacker for Linux.

<http://www.stuffit.com/cgi-bin/stuffit_loginpage.cgi?stuffitunix>

> The last time I was using the stuffit expander, I sweared to ban it from
> my HD: First, I unpacked the archive which took quite long for just
> some hundret kb. The binaries were okay but I saw that it messed up
> the newlines in the text. So, I un-packed it again with changed flags
> which really took _ages_ now and finally the text files were okay but
> the binaries were corrupt. Oh dear...

Yes, the Windows version used to come with a Mac configuration as default.
Not very useful to get Mac line endings on Windows...

    Thorsten

____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 14:27:53
Message: <41112aa8@news.povray.org>

Christoph Hormann wrote:
> I think the work probably would have been better invested in
> implementing an internal JIT compiler using the existing hooks as
> Thorsten explained.  This would work on all x86 systems (and for Mac an
> implementation already exists).
> 
Well, I had a look at the PPC JIT Compiler in the Mac version. If I 
understand it correctly, then it is actually assembling the binary code for 
the PPC in memory and then executing it by jumping into that created 
code. 

After thinking about it, I found 2 reasons which can keep me from 
doing something like that for i386 architecture: 

(1) When I saw the POV VM instruction set it immediately reminded me 
  somehow on the PPC instruction set or a similar RISC instruction set 
  with a number of general-purpose registers etc. (I do not know the PPC 
  instructions in detail but this opinion was based on the feeling I had 
  of it from reading the computer magazines and from my experiences with 
  other RISCs.) 
  So, compiling this code into PPC code turns out to be pretty
  straight-forward. In contrast, the i387 does not seem to have  
  these general purpose registers but instead it uses a register stack with 
  IMO 8 registers and there is a top-of-stack pointer and so on. 
  Furthermore, I am by far no expert in i386/7 assembly and I do not want 
  to hack tons of error-prone code to perform correct translation of 
  POV-VM-ASM into i387-ASM. [r0 should be top of stack...]

(2) GCC does a decent work in optimizing. The POV VM compiler produces 
  assembly which IMO has plenty of (seemingly?) pointless register moves. 
  (Don't understand me wrong, Thorsten: Good register allocation is a really 
  tough job.) Take for example this part of the paramflower on my homepage. 
------------------------
        r0 = sqrt(r0);
        r5 = r5 * r0;
        r0 = r5;
        r5 = r6;  <-- completely useless
        r5 = r0;  <-- useless as well
        r0 = r2;
        r0 = sqrt(r0);
        r5 = r5 + r0;
        r6 = r5;
        r0 = POVFPU_Consts[k];
        r5 = r0;  <-- (skip)
        r7 = r5;  <-- why not r7=r0
        r0 = r2;  <-- (skip)
        r5 = r0;  <-- why not r5=r2
        r0 = r5;  <-- hmm?!
        r5 = r5 * r0;
        r5 = r5 * r0;
        r0 = r5;
        r5 = r7;
------------------------
  Compiling this assembly directly into i387 code would probably not give 
  as good runtime performance as asking GCC would. 
  And I do not want to implement an optimizer especially since there is 
  really little chance to get better than GCC if we're only 1 or 2 people...

Maybe it would be easier to translate the POV-VM-ASM into SSE2 instructions. 
But 2 reasons suggest against that: 
(1) Not yet widely available. 
(2) My box only has SSE1 and hence I could not test it. 

Wolfgang

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 16:28:45
Message: <411146fd@news.povray.org>

In article <41112aa8@news.povray.org> , Wolfgang Wieser 
<wwi### [at] nospamgmxde>  wrote:

> Well, I had a look at the PPC JIT Compiler in the Mac version. If I
> understand it correctly, then it is actually assembling the binary code for
> the PPC in memory and then executing it by jumping into that created
> code.

Yes, that is what it does.

> After thinking about it, I found 2 reasons which can keep me from
> doing something like that for i386 architecture:
>
> (1) When I saw the POV VM instruction set it immediately reminded me
>   somehow on the PPC instruction set or a similar RISC instruction set
>   with a number of general-purpose registers etc. (I do not know the PPC
>   instructions in detail but this opinion was based on the feeling I had
>   of it from reading the computer magazines and from my experiences with
>   other RISCs.)
>   So, compiling this code into PPC code turns out to be pretty
>   straight-forward. In contrast, the i387 does not seem to have
>   these general purpose registers but instead it uses a register stack with
>   IMO 8 registers and there is a top-of-stack pointer and so on.
>   Furthermore, I am by far no expert in i386/7 assembly and I do not want
>   to hack tons of error-prone code to perform correct translation of
>   POV-VM-ASM into i387-ASM. [r0 should be top of stack...]

Well, it would fit into the eight available stack places.  It is not as
trivial as doing it with (at least) eight real registers.

> (2) GCC does a decent work in optimizing. The POV VM compiler produces
>   assembly which IMO has plenty of (seemingly?) pointless register moves.

Yes, it does generate plenty of them, which is an artifact of the the direct
compiling from an expression tree into the final instruction set without
intermediate code.

The good thing is that most of these redundant moves can be removed without
too much work using peephole optimisation.  However, it turns out that this
produces hardly any performance gain (for neither VM nor JIT code) but does
make the compiling/assembling much more complicated.  What you get is below
the measurement error - about two percent raw function performance, which
translates into at most 0.5% (18 seconds per hour) speed improvement in a
heavy isosurface-using scene.

>   Compiling this assembly directly into i387 code would probably not give
>   as good runtime performance as asking GCC would.
>   And I do not want to implement an optimizer especially since there is
>   really little chance to get better than GCC if we're only 1 or 2 people...

The trick is that having the code inline reduces call overhead, which
accounts for about 10% to 15% for the average isosurface function.  Thus,
while you perhaps gain 10% function speed by using gcc (thus 2.5% for the
scene), the call overhead has to be really low to make a difference.  To
reach this with dynamic linking is not as easy as with truly inline compiled
code.  So most likely to total difference is close to zero.

The main reason for this is that unlike integer instructions, changing
floating-point operation order tends to also change precision, which in turn
will quickly result in the compiled representation not being equivalent.  As
the basic principle of compiler construction is to generate equivalent code,
compilers either perform very few optimisations at all, or, like newer
compilers do, allow disabling the strict equivalency requirement for
floating-point operations.

The neat thing is that with isosurfaces precision in the range compilers
have to preserve is only of secondary importance.  Thus, the function
compiler does already perform many of the possible optimisations, and thus
when compiling all that is left are relatively few storage instructions that
can be optimized away.  As I already pointed out, measuring (on Macs with VM
and JIT compiler) revealed that those hardly reduce performance.

> Maybe it would be easier to translate the POV-VM-ASM into SSE2 instructions.

Yes, that would be much easier than targeting a i387-style FPU.

> But 2 reasons suggest against that:
> (1) Not yet widely available.

Well, every processor sold at Aldi today offers it, doesn't it? ;-)

> (2) My box only has SSE1 and hence I could not test it.

I can see how this would be a problem...

    Thorsten

____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 6 Aug 2004 12:34:33
Message: <4113b318@news.povray.org>

Thorsten Froehlich wrote:
> Well, it would fit into the eight available stack places.  It is not as
> trivial as doing it with (at least) eight real registers.
> 
Yes. Especially since there are also some restrictions -- e.g. several 
functions can only be performed on the top-of-stack element. 
It's like the POV VM e.g calling maths functions only on r0 (IIRC). 

> The good thing is that most of these redundant moves can be removed
> without too much work using peephole optimisation.  
>
Correct. After having sent the last posting, I played with the idea 
of actually implementing a peephole optimization step...

> However, it turns out that 
> this produces hardly any performance gain (for neither VM nor JIT code)
>
...but I see that you already tied. 
(Good that I did yet not begin with it.)

> but does
> make the compiling/assembling much more complicated.  What you get is
> below the measurement error - about two percent raw function performance,
> which translates into at most 0.5% (18 seconds per hour) speed improvement
> in a heavy isosurface-using scene.
> 
Which clearly is not worth the efford. 

> The trick is that having the code inline reduces call overhead, which
> accounts for about 10% to 15% for the average isosurface function.  
>
This is correct. Actually, it seems the dynamically linked function call 
overhead is the only disadvantage of my approach concerning performance. 
I actually don't know exactly what the CPU cycles get spent on: 

  (*int_func)(bar);   // <-- internal function
  (*ext_func)(bar);  // <-- external function (dynamically linked)

In C code it looks completely identical but the second one has 
at least 10 times more call overhead. 

> Thus, 
> while you perhaps gain 10% function speed by using gcc (thus 2.5% for the
> scene), the call overhead has to be really low to make a difference.  To
> reach this with dynamic linking is not as easy as with truly inline
> compiled
> code.  So most likely to total difference is close to zero.
> 
Well, probably it's more like that: The more complex a single function, 
the more you gain. For trivial functions, the "gain" may be negative 
(i.e. loss). 

> The main reason for this is that unlike integer instructions, changing
> floating-point operation order tends to also change precision, which in
> turn
> will quickly result in the compiled representation not being equivalent. 
>
Well, since the i386 internally has 80bit FP registers, the accuracy 
of the compiled version can be expected to be at least as good as 
that of the interpreted version. But of course that does not guarantee 
equivalent images. OTOH, all that numerics is more or less a 
trade-off between accuracy and runtime. And scenes should not 
depend on the last couple of bits anyways because they then would 
look completely differently when different compilers or different 
architectures are used. - You already mentioned sth similar further down. 

> As the basic principle of compiler construction is to generate equivalent
> code, compilers either perform very few optimisations at all, or, like
> newer compilers do, allow disabling the strict equivalency requirement for
> floating-point operations.
> 
I'm not sure what GCC really does. I'm compiling with -ffast-math which 
allows some IEEE violations but I'm not sure if it does lots of "dangerous" 
things. But at least I verified that the register moves are optimized 
away. 

>> But 2 reasons suggest against that [SSE2]:
>> (1) Not yet widely available.
> 
> Well, every processor sold at Aldi today offers it, doesn't it? ;-)
> 
IMHO it depends on if they are currently selling AthlonXP or PentiumIV 
CPUs... [At least my AthlonXP-1.47GHz does not have SSE2.]

Wolfgang

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 7 Aug 2004 06:18:49
Message: <4114ac88@news.povray.org>

Wolfgang Wieser wrote:
>> The trick is that having the code inline reduces call overhead, which
>> accounts for about 10% to 15% for the average isosurface function.
>>
> This is correct. Actually, it seems the dynamically linked function call
> overhead is the only disadvantage of my approach concerning performance.
> I actually don't know exactly what the CPU cycles get spent on:
> 
>   (*int_func)(bar);   // <-- internal function
>   (*ext_func)(bar);  // <-- external function (dynamically linked)
> 
> In C code it looks completely identical but the second one has
> at least 10 times more call overhead.
> 
This is wrong. I tricked myself: In the above program, the int_func was 
not a pointer to a function but the function itself. 

More measurements show that the overhead is NOT due to 
external shared object "linkage" but due to the function being called via 
a function pointer - not depending on whether the function is in the 
primary code or in a shared object. 

Wolfgang

Post a reply to this message

From: Nicolas Calimet
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 10:57:00
Message: <41163f3c$1@news.povray.org>

> More measurements show that the overhead is NOT due to 
> external shared object "linkage" but due to the function being called via 
> a function pointer - not depending on whether the function is in the 
> primary code or in a shared object. 

	Interesting.  But also surprising (to me).
	Could you explain why it takes an order of magnitude longer to
jump to a function via a pointer as compared to a direct reference ?
(note: I'm not very knowledgeable in low-level programming, just a
tiny idea of some assembly instructions).

	- NC

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 11:13:09
Message: <41164305@news.povray.org>

In article <41163f3c$1@news.povray.org> , Nicolas Calimet 
<pov### [at] freefr>  wrote:

>  Interesting.  But also surprising (to me).
>  Could you explain why it takes an order of magnitude longer to
> jump to a function via a pointer as compared to a direct reference ?
> (note: I'm not very knowledgeable in low-level programming, just a
> tiny idea of some assembly instructions).

It should not at all.

    Thorsten

____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 13:51:21
Message: <41166817@news.povray.org>

Thorsten Froehlich wrote:
> In article <41163f3c$1@news.povray.org> , Nicolas Calimet
> <pov### [at] freefr>  wrote:
>>  Interesting.  But also surprising (to me).
>>  Could you explain why it takes an order of magnitude longer to
>> jump to a function via a pointer as compared to a direct reference ?
>> (note: I'm not very knowledgeable in low-level programming, just a
>> tiny idea of some assembly instructions).
> 
> It should not at all.
> 
Hmm... After this question, I looked furter into the issue. 

First of all, quoting the GCC info: 
---------------------------------------------------------------------
Note that you will still be paying the penalty for the call through a
function pointer; on most modern architectures, such a call defeats the
branch prediction features of the CPU.  This is also true of normal
virtual function calls.
---------------------------------------------------------------------

But this cannot account for the huge difference I measured. 
And actually, my second posting on the issue must be considered partly wrong 
as well. Because it turns out that GCC will now also inline functions which 
are declared extern _and_ appear further down in the code than the 
calling location -- even when marked with __attribute__((noinline)) !!
[GCC 3.4.2 20040724 (prerelease); Seems I need to file a bug report...]

And since I did not verify that all these 3 precautions would successfully 
prevent the compiler from inlining the code, I actually measured the 
time difference between an extern and an inline call which clearly yields 
to a difference in speed. 

Okay, so let's do some really clean benchmarks this time - finally. 

Oh dear. Maybe could anybody do some independent tests concerning that 
issue? Because I will now tell you that calling an external function in 
an external library is actually _faster_ than calling it directly in the 
code when certain compiler flags are used. 
I attached my test code for review. 

So here are the timings: 

Function call      | OPT1  | OPT2
-------------------+-------+-------
int_foo(44.0);     | 3.95s | 3.58s
(*int_fooP)(44.0); | 3.57s | 3.46s
(*ext_fooP)(44.0); | 3.57s | 4.13s
-none-             | 0.37s | 0.37s

OPT1 = -ffast-math -O2 -fno-rtti
OPT2 = -ffast-math -O2 -fno-rtti -march=athlon-xp

All these values have been repeatedly measured up to +-1 in the last 
digit specified - the differences are significant. 

Hence, I think we can conclude, that there is no overhead for an 
dynamically-linked external library function call. 
[At least until somebody proves that something went wrong... :| ]

I also verified the case where the external library is calling back 
into the main code: There is no real difference again. 

Wolfgang

Here are the generated assembler instructions in all measured 
cases: 

----------<OPT1>------------<*ext_fooP>---------<OPT2>----------------
.L7:                             |  .L7:
    movl    $0, (%esp)           |      movl    $0, (%esp)
    movl    $1078329344, %eax    |      movl    $1078329344, 4(%esp)
    movl    %eax, 4(%esp)        |      call    *%esi
    call    *%esi                |      ffreep  %st(0)
    fstp    %st(0)               |      decl    %ebx
    decl    %ebx                 |      jns .L7
    jns .L7                      |
----------------------------<*int_fooP>-------------------------------
.L7:                             |  .L7:
    movl    $0, (%esp)           |      movl    $0, (%esp)
    movl    $1078329344, %eax    |      movl    $1078329344, 4(%esp)
    movl    %eax, 4(%esp)        |      call    *%esi
    call    *%esi                |      ffreep  %st(0)
    fstp    %st(0)               |      decl    %ebx
    decl    %ebx                 |      jns .L7
    jns .L7                      |
----------------------------<int_foo()>-------------------------------
.L7:                             |  .L7:
    movl    $0, (%esp)           |      movl    $0, (%esp)
    movl    $1078329344, %eax    |      movl    $1078329344, 4(%esp)
    movl    %eax, 4(%esp)        |      call    int_foo
    call    int_foo              |      ffreep  %st(0)
    fstp    %st(0)               |      decl    %ebx
    decl    %ebx                 |      jns .L7
    jns .L7                      |
-----------------------------<-none->---------------------------------
.L7:                             |  .L7:
    decl    %eax                 |      decl    %eax
    jns .L7                      |      jns .L7
---------------------------------^------------------------------------

Here are the test programs: 

---<Makefile>---------------------------------------------------------
MAINFLAGS = -ffast-math -O2 -fno-rtti
LIBFLAGS = -ffast-math -O2 -fno-rtti
#MAINFLAGS = -ffast-math -O2 -fno-rtti -march=athlon-xp
#LIBFLAGS = -ffast-math -O2 -fno-rtti -march=athlon-xp

all:
    g++ $(MAINFLAGS) -DMODULE=0 -DMAIN -c dl.cc -o dl.o
    g++ $(MAINFLAGS) -DMODULE=0 -DFOO -c dl.cc -o foo.o
    g++ $(MAINFLAGS) -o test dl.o foo.o -rdynamic -ldl -lm
    g++ $(LIBFLAGS) -nostartfiles -shared -DMODULE=1 dl.cc -o foo.so
    time ./test

asm:
    gcc $(MAINFLAGS) -fno-exceptions -DMODULE=0 -DMAIN -S dl.cc -o dl.S
    gcc $(MAINFLAGS) -fno-exceptions -DMODULE=0 -DFOO -S dl.cc -o foo.S
------------------------------------------------------------------------

---<dl.cc>--------------------------------------------------------------
// dl.cc - Written by Wolfgang Wieser. 

#if MODULE==0
//------------------
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <string.h>
#include <errno.h>
#include <sys/mman.h>

extern "C" double int_foo(double x) __attribute__((noinline));

#ifdef FOO
double int_foo(double x)
{
    //fprintf(stderr,"int_foo\n");
    return(x);
}
#endif  // FOO

#ifdef MAIN
int main()
{
    void *hdl=dlopen("./foo.so",RTLD_NOW | RTLD_LOCAL);
    if(!hdl)
    {  fprintf(stderr,"dlopen: %s\n",dlerror());  exit(1);  }
    
    dlerror();
    void *sym=dlsym(hdl,"ext_foo");
    const char *err;
    if((err=dlerror()))
    {  fprintf(stderr,"dlsym: %s\n",err);  exit(1);  }
    double (*ext_fooP)(double)=(double (*)(double))sym;
    
    double (*int_fooP)(double)=&int_foo;
    
    // These make the assembler easier to compare because it prevents 
    // function pointers from getting optimized away as "unneeded 
    // variables". 
    int_foo(23.0);
    (*ext_fooP)(23.0);
    (*int_fooP)(23.0);
    
    for(int i=0; i<0xfffffff; i++)
    {
        //int_foo(44.0);
        //(*int_fooP)(44.0);
        (*ext_fooP)(44.0);
    }
    
    return(0);
}
#endif  // MAIN

#else  // MODULE!=0
//------------------
#include <stdio.h>

extern "C" double ext_foo(double x)
{
    //fprintf(stderr,"ext_foo\n");
    return(x);
}
#endif
------------------------------------------------------------------------

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 14:15:43
Message: <41166dcf@news.povray.org>

In article <41166817@news.povray.org> , Wolfgang Wieser 
<wwi### [at] nospamgmxde>  wrote:

> First of all, quoting the GCC info:
> ---------------------------------------------------------------------
> Note that you will still be paying the penalty for the call through a
> function pointer; on most modern architectures, such a call defeats the
> branch prediction features of the CPU.  This is also true of normal
> virtual function calls.

That is completely outdated information and wrong on anything available in
the past decade.  It infers a completely static branch prediction.

The abstraction penalty is commonly measured with the so-called Stepanov
benchmark.  Google will probably find the source code as well as current
measures on current compilers and systems for it.

    Thorsten

____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 15:09:56
Message: <41167a83@news.povray.org>

Thorsten Froehlich wrote:

> In article <41166817@news.povray.org> , Wolfgang Wieser
> <wwi### [at] nospamgmxde>  wrote:
> 
>> First of all, quoting the GCC info:
>> ---------------------------------------------------------------------
>> Note that you will still be paying the penalty for the call through a
>> function pointer; on most modern architectures, such a call defeats the
>> branch prediction features of the CPU.  This is also true of normal
>> virtual function calls.
> 
> That is completely outdated information and wrong on anything available in
> the past decade.  It infers a completely static branch prediction.
> 
Well, actually my measurements in the last posting already showed that 
the branch prediction of my AthlonXP seems to have no major problems 
with it...

But the rest of the results is far more interesting. 
It seems that there actually is no overhead in calling functions inside 
a dynamically linked object.

Wolfgang

Post a reply to this message

<<< Previous 10 Messages

Goto Initial 10 Messages