|
|
Thorsten Froehlich wrote:
> In article <41163f3c$1@news.povray.org> , Nicolas Calimet
> <pov### [at] freefr> wrote:
>> Interesting. But also surprising (to me).
>> Could you explain why it takes an order of magnitude longer to
>> jump to a function via a pointer as compared to a direct reference ?
>> (note: I'm not very knowledgeable in low-level programming, just a
>> tiny idea of some assembly instructions).
>
> It should not at all.
>
Hmm... After this question, I looked furter into the issue.
First of all, quoting the GCC info:
---------------------------------------------------------------------
Note that you will still be paying the penalty for the call through a
function pointer; on most modern architectures, such a call defeats the
branch prediction features of the CPU. This is also true of normal
virtual function calls.
---------------------------------------------------------------------
But this cannot account for the huge difference I measured.
And actually, my second posting on the issue must be considered partly wrong
as well. Because it turns out that GCC will now also inline functions which
are declared extern _and_ appear further down in the code than the
calling location -- even when marked with __attribute__((noinline)) !!
[GCC 3.4.2 20040724 (prerelease); Seems I need to file a bug report...]
And since I did not verify that all these 3 precautions would successfully
prevent the compiler from inlining the code, I actually measured the
time difference between an extern and an inline call which clearly yields
to a difference in speed.
Okay, so let's do some really clean benchmarks this time - finally.
Oh dear. Maybe could anybody do some independent tests concerning that
issue? Because I will now tell you that calling an external function in
an external library is actually _faster_ than calling it directly in the
code when certain compiler flags are used.
I attached my test code for review.
So here are the timings:
Function call | OPT1 | OPT2
-------------------+-------+-------
int_foo(44.0); | 3.95s | 3.58s
(*int_fooP)(44.0); | 3.57s | 3.46s
(*ext_fooP)(44.0); | 3.57s | 4.13s
-none- | 0.37s | 0.37s
OPT1 = -ffast-math -O2 -fno-rtti
OPT2 = -ffast-math -O2 -fno-rtti -march=athlon-xp
All these values have been repeatedly measured up to +-1 in the last
digit specified - the differences are significant.
Hence, I think we can conclude, that there is no overhead for an
dynamically-linked external library function call.
[At least until somebody proves that something went wrong... :| ]
I also verified the case where the external library is calling back
into the main code: There is no real difference again.
Wolfgang
Here are the generated assembler instructions in all measured
cases:
----------<OPT1>------------<*ext_fooP>---------<OPT2>----------------
.L7: | .L7:
movl $0, (%esp) | movl $0, (%esp)
movl $1078329344, %eax | movl $1078329344, 4(%esp)
movl %eax, 4(%esp) | call *%esi
call *%esi | ffreep %st(0)
fstp %st(0) | decl %ebx
decl %ebx | jns .L7
jns .L7 |
----------------------------<*int_fooP>-------------------------------
.L7: | .L7:
movl $0, (%esp) | movl $0, (%esp)
movl $1078329344, %eax | movl $1078329344, 4(%esp)
movl %eax, 4(%esp) | call *%esi
call *%esi | ffreep %st(0)
fstp %st(0) | decl %ebx
decl %ebx | jns .L7
jns .L7 |
----------------------------<int_foo()>-------------------------------
.L7: | .L7:
movl $0, (%esp) | movl $0, (%esp)
movl $1078329344, %eax | movl $1078329344, 4(%esp)
movl %eax, 4(%esp) | call int_foo
call int_foo | ffreep %st(0)
fstp %st(0) | decl %ebx
decl %ebx | jns .L7
jns .L7 |
-----------------------------<-none->---------------------------------
.L7: | .L7:
decl %eax | decl %eax
jns .L7 | jns .L7
---------------------------------^------------------------------------
Here are the test programs:
---<Makefile>---------------------------------------------------------
MAINFLAGS = -ffast-math -O2 -fno-rtti
LIBFLAGS = -ffast-math -O2 -fno-rtti
#MAINFLAGS = -ffast-math -O2 -fno-rtti -march=athlon-xp
#LIBFLAGS = -ffast-math -O2 -fno-rtti -march=athlon-xp
all:
g++ $(MAINFLAGS) -DMODULE=0 -DMAIN -c dl.cc -o dl.o
g++ $(MAINFLAGS) -DMODULE=0 -DFOO -c dl.cc -o foo.o
g++ $(MAINFLAGS) -o test dl.o foo.o -rdynamic -ldl -lm
g++ $(LIBFLAGS) -nostartfiles -shared -DMODULE=1 dl.cc -o foo.so
time ./test
asm:
gcc $(MAINFLAGS) -fno-exceptions -DMODULE=0 -DMAIN -S dl.cc -o dl.S
gcc $(MAINFLAGS) -fno-exceptions -DMODULE=0 -DFOO -S dl.cc -o foo.S
------------------------------------------------------------------------
---<dl.cc>--------------------------------------------------------------
// dl.cc - Written by Wolfgang Wieser.
#if MODULE==0
//------------------
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <string.h>
#include <errno.h>
#include <sys/mman.h>
extern "C" double int_foo(double x) __attribute__((noinline));
#ifdef FOO
double int_foo(double x)
{
//fprintf(stderr,"int_foo\n");
return(x);
}
#endif // FOO
#ifdef MAIN
int main()
{
void *hdl=dlopen("./foo.so",RTLD_NOW | RTLD_LOCAL);
if(!hdl)
{ fprintf(stderr,"dlopen: %s\n",dlerror()); exit(1); }
dlerror();
void *sym=dlsym(hdl,"ext_foo");
const char *err;
if((err=dlerror()))
{ fprintf(stderr,"dlsym: %s\n",err); exit(1); }
double (*ext_fooP)(double)=(double (*)(double))sym;
double (*int_fooP)(double)=&int_foo;
// These make the assembler easier to compare because it prevents
// function pointers from getting optimized away as "unneeded
// variables".
int_foo(23.0);
(*ext_fooP)(23.0);
(*int_fooP)(23.0);
for(int i=0; i<0xfffffff; i++)
{
//int_foo(44.0);
//(*int_fooP)(44.0);
(*ext_fooP)(44.0);
}
return(0);
}
#endif // MAIN
#else // MODULE!=0
//------------------
#include <stdio.h>
extern "C" double ext_foo(double x)
{
//fprintf(stderr,"ext_foo\n");
return(x);
}
#endif
------------------------------------------------------------------------
Post a reply to this message
|
|
|
|
In article <41166817@news.povray.org> , Wolfgang Wieser
<wwi### [at] nospamgmxde> wrote:
> First of all, quoting the GCC info:
> ---------------------------------------------------------------------
> Note that you will still be paying the penalty for the call through a
> function pointer; on most modern architectures, such a call defeats the
> branch prediction features of the CPU. This is also true of normal
> virtual function calls.
That is completely outdated information and wrong on anything available in
the past decade. It infers a completely static branch prediction.
The abstraction penalty is commonly measured with the so-called Stepanov
benchmark. Google will probably find the source code as well as current
measures on current compilers and systems for it.
Thorsten
____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde
Visit POV-Ray on the web: http://mac.povray.org
Post a reply to this message
|
|
|
|
Thorsten Froehlich wrote:
> In article <41166817@news.povray.org> , Wolfgang Wieser
> <wwi### [at] nospamgmxde> wrote:
>
>> First of all, quoting the GCC info:
>> ---------------------------------------------------------------------
>> Note that you will still be paying the penalty for the call through a
>> function pointer; on most modern architectures, such a call defeats the
>> branch prediction features of the CPU. This is also true of normal
>> virtual function calls.
>
> That is completely outdated information and wrong on anything available in
> the past decade. It infers a completely static branch prediction.
>
Well, actually my measurements in the last posting already showed that
the branch prediction of my AthlonXP seems to have no major problems
with it...
But the rest of the results is far more interesting.
It seems that there actually is no overhead in calling functions inside
a dynamically linked object.
Wolfgang
Post a reply to this message
|
|