POV-Ray: Newsgroups: povray.unofficial.patches: [announce] JITC: Really fast POVRay FPU

POV-Ray : Newsgroups : povray.unofficial.patches : [announce] JITC: Really fast POVRay FPU		Server Time 1 Jul 2024 11:57:37 EDT (-0400)

<<< Previous 10 Messages

Goto Initial 10 Messages

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 13:51:21
Message: <41166817@news.povray.org>

Thorsten Froehlich wrote:
> In article <41163f3c$1@news.povray.org> , Nicolas Calimet
> <pov### [at] freefr>  wrote:
>>  Interesting.  But also surprising (to me).
>>  Could you explain why it takes an order of magnitude longer to
>> jump to a function via a pointer as compared to a direct reference ?
>> (note: I'm not very knowledgeable in low-level programming, just a
>> tiny idea of some assembly instructions).
> 
> It should not at all.
> 
Hmm... After this question, I looked furter into the issue. 

First of all, quoting the GCC info: 
---------------------------------------------------------------------
Note that you will still be paying the penalty for the call through a
function pointer; on most modern architectures, such a call defeats the
branch prediction features of the CPU.  This is also true of normal
virtual function calls.
---------------------------------------------------------------------

But this cannot account for the huge difference I measured. 
And actually, my second posting on the issue must be considered partly wrong 
as well. Because it turns out that GCC will now also inline functions which 
are declared extern _and_ appear further down in the code than the 
calling location -- even when marked with __attribute__((noinline)) !!
[GCC 3.4.2 20040724 (prerelease); Seems I need to file a bug report...]

And since I did not verify that all these 3 precautions would successfully 
prevent the compiler from inlining the code, I actually measured the 
time difference between an extern and an inline call which clearly yields 
to a difference in speed. 

Okay, so let's do some really clean benchmarks this time - finally. 

Oh dear. Maybe could anybody do some independent tests concerning that 
issue? Because I will now tell you that calling an external function in 
an external library is actually _faster_ than calling it directly in the 
code when certain compiler flags are used. 
I attached my test code for review. 

So here are the timings: 

Function call      | OPT1  | OPT2
-------------------+-------+-------
int_foo(44.0);     | 3.95s | 3.58s
(*int_fooP)(44.0); | 3.57s | 3.46s
(*ext_fooP)(44.0); | 3.57s | 4.13s
-none-             | 0.37s | 0.37s

OPT1 = -ffast-math -O2 -fno-rtti
OPT2 = -ffast-math -O2 -fno-rtti -march=athlon-xp

All these values have been repeatedly measured up to +-1 in the last 
digit specified - the differences are significant. 

Hence, I think we can conclude, that there is no overhead for an 
dynamically-linked external library function call. 
[At least until somebody proves that something went wrong... :| ]

I also verified the case where the external library is calling back 
into the main code: There is no real difference again. 

Wolfgang

Here are the generated assembler instructions in all measured 
cases: 

----------<OPT1>------------<*ext_fooP>---------<OPT2>----------------
.L7:                             |  .L7:
    movl    $0, (%esp)           |      movl    $0, (%esp)
    movl    $1078329344, %eax    |      movl    $1078329344, 4(%esp)
    movl    %eax, 4(%esp)        |      call    *%esi
    call    *%esi                |      ffreep  %st(0)
    fstp    %st(0)               |      decl    %ebx
    decl    %ebx                 |      jns .L7
    jns .L7                      |
----------------------------<*int_fooP>-------------------------------
.L7:                             |  .L7:
    movl    $0, (%esp)           |      movl    $0, (%esp)
    movl    $1078329344, %eax    |      movl    $1078329344, 4(%esp)
    movl    %eax, 4(%esp)        |      call    *%esi
    call    *%esi                |      ffreep  %st(0)
    fstp    %st(0)               |      decl    %ebx
    decl    %ebx                 |      jns .L7
    jns .L7                      |
----------------------------<int_foo()>-------------------------------
.L7:                             |  .L7:
    movl    $0, (%esp)           |      movl    $0, (%esp)
    movl    $1078329344, %eax    |      movl    $1078329344, 4(%esp)
    movl    %eax, 4(%esp)        |      call    int_foo
    call    int_foo              |      ffreep  %st(0)
    fstp    %st(0)               |      decl    %ebx
    decl    %ebx                 |      jns .L7
    jns .L7                      |
-----------------------------<-none->---------------------------------
.L7:                             |  .L7:
    decl    %eax                 |      decl    %eax
    jns .L7                      |      jns .L7
---------------------------------^------------------------------------

Here are the test programs: 

---<Makefile>---------------------------------------------------------
MAINFLAGS = -ffast-math -O2 -fno-rtti
LIBFLAGS = -ffast-math -O2 -fno-rtti
#MAINFLAGS = -ffast-math -O2 -fno-rtti -march=athlon-xp
#LIBFLAGS = -ffast-math -O2 -fno-rtti -march=athlon-xp

all:
    g++ $(MAINFLAGS) -DMODULE=0 -DMAIN -c dl.cc -o dl.o
    g++ $(MAINFLAGS) -DMODULE=0 -DFOO -c dl.cc -o foo.o
    g++ $(MAINFLAGS) -o test dl.o foo.o -rdynamic -ldl -lm
    g++ $(LIBFLAGS) -nostartfiles -shared -DMODULE=1 dl.cc -o foo.so
    time ./test

asm:
    gcc $(MAINFLAGS) -fno-exceptions -DMODULE=0 -DMAIN -S dl.cc -o dl.S
    gcc $(MAINFLAGS) -fno-exceptions -DMODULE=0 -DFOO -S dl.cc -o foo.S
------------------------------------------------------------------------

---<dl.cc>--------------------------------------------------------------
// dl.cc - Written by Wolfgang Wieser. 

#if MODULE==0
//------------------
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <string.h>
#include <errno.h>
#include <sys/mman.h>

extern "C" double int_foo(double x) __attribute__((noinline));

#ifdef FOO
double int_foo(double x)
{
    //fprintf(stderr,"int_foo\n");
    return(x);
}
#endif  // FOO

#ifdef MAIN
int main()
{
    void *hdl=dlopen("./foo.so",RTLD_NOW | RTLD_LOCAL);
    if(!hdl)
    {  fprintf(stderr,"dlopen: %s\n",dlerror());  exit(1);  }
    
    dlerror();
    void *sym=dlsym(hdl,"ext_foo");
    const char *err;
    if((err=dlerror()))
    {  fprintf(stderr,"dlsym: %s\n",err);  exit(1);  }
    double (*ext_fooP)(double)=(double (*)(double))sym;
    
    double (*int_fooP)(double)=&int_foo;
    
    // These make the assembler easier to compare because it prevents 
    // function pointers from getting optimized away as "unneeded 
    // variables". 
    int_foo(23.0);
    (*ext_fooP)(23.0);
    (*int_fooP)(23.0);
    
    for(int i=0; i<0xfffffff; i++)
    {
        //int_foo(44.0);
        //(*int_fooP)(44.0);
        (*ext_fooP)(44.0);
    }
    
    return(0);
}
#endif  // MAIN

#else  // MODULE!=0
//------------------
#include <stdio.h>

extern "C" double ext_foo(double x)
{
    //fprintf(stderr,"ext_foo\n");
    return(x);
}
#endif
------------------------------------------------------------------------

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 14:15:43
Message: <41166dcf@news.povray.org>

In article <41166817@news.povray.org> , Wolfgang Wieser 
<wwi### [at] nospamgmxde>  wrote:

> First of all, quoting the GCC info:
> ---------------------------------------------------------------------
> Note that you will still be paying the penalty for the call through a
> function pointer; on most modern architectures, such a call defeats the
> branch prediction features of the CPU.  This is also true of normal
> virtual function calls.

That is completely outdated information and wrong on anything available in
the past decade.  It infers a completely static branch prediction.

The abstraction penalty is commonly measured with the so-called Stepanov
benchmark.  Google will probably find the source code as well as current
measures on current compilers and systems for it.

    Thorsten

____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org

Post a reply to this message

From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 15:09:56
Message: <41167a83@news.povray.org>

Thorsten Froehlich wrote:

> In article <41166817@news.povray.org> , Wolfgang Wieser
> <wwi### [at] nospamgmxde>  wrote:
> 
>> First of all, quoting the GCC info:
>> ---------------------------------------------------------------------
>> Note that you will still be paying the penalty for the call through a
>> function pointer; on most modern architectures, such a call defeats the
>> branch prediction features of the CPU.  This is also true of normal
>> virtual function calls.
> 
> That is completely outdated information and wrong on anything available in
> the past decade.  It infers a completely static branch prediction.
> 
Well, actually my measurements in the last posting already showed that 
the branch prediction of my AthlonXP seems to have no major problems 
with it...

But the rest of the results is far more interesting. 
It seems that there actually is no overhead in calling functions inside 
a dynamically linked object.

Wolfgang

Post a reply to this message

<<< Previous 10 Messages

Goto Initial 10 Messages