|
|
|
|
|
|
| |
| |
|
|
From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 05:48:20
Message: <4110b0e3@news.povray.org>
|
|
|
| |
| |
|
|
Thorsten Froehlich wrote:
> In article <41101157$1@news.povray.org> , Nicolas Calimet
> <pov### [at] freefr> wrote:
>
>>> You are aware that the official POV-Ray for Mac OS does include a
>>> just-in-time compiler, aren't you?
>>
>> It's very likely Wolfgang never tried the Mac OS version ;-)
>
> That is true, but using the core code SYS_FUNCTIONS macro family would
> have made integration of a JIT compiler possible without hacking
> fnpovfpu.cpp. In particular because they are documented to exist exactly
> for that purpose.
>
Well, "they are documented to..." is little exaggeration.
All I have found in the sources about these macros were a couple
of lines which described just what I already knew from reading the
code and one "cryptic" allusion that it can be used somehow to allow
JIT compilation. And that sounded to me like "we could some time
use that for JIT if we finally find the time to do so"...
> Would at least have made the whole implementation
> easier...
>
Okay... could somebody please make the POVRay-for-MacOS
source code available in a file format which I can read?
Or, alternatively, point me to a _working_ .sit unpacker for Linux.
The last time I was using the stuffit expander, I sweared to ban it from
my HD: First, I unpacked the archive which took quite long for just
some hundret kb. The binaries were okay but I saw that it messed up
the newlines in the text. So, I un-packed it again with changed flags
which really took _ages_ now and finally the text files were okay but
the binaries were corrupt. Oh dear...
> Certainly! On the other hand, calling gcc is overkill to make JIT
> compilation possible :-)
>
I wonder how the Mac version does it...
Wolfgang
Post a reply to this message
|
|
| |
| |
|
|
From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 05:59:25
Message: <4110b37c@news.povray.org>
|
|
|
| |
| |
|
|
ABX wrote:
> On Wed, 04 Aug 2004 11:28:22 +0200, Wolfgang Wieser
> <wwi### [at] nospamgmxde> wrote:
>> Christoph Hormann wrote:
>> > I think the work probably would have been better invested in
>> > implementing an internal JIT compiler using the existing hooks as
>> > Thorsten explained. This would work on all x86 systems (and for Mac an
>> > implementation already exists).
>>
>> Sounded interesting. I'll have a look at that.
>> Why didn't anybody put that on a publically available todo list? ;)
>
> I have considered this task for the future versions of MegaPOV thought I
> would be veeeeery happy if you could be faster :-)
>
IMO, in the present stage, this patch is not suitable for inclusion into
MegaPOV. Need to get rid of some unclean "hacks" first.
Or, maybe better, use the existing SYS_ macros if I manage to figure out
how to use them.
BTW, will the features from MLPov be included in the next MegaPOV version?
Wolfgang
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On Wed, 04 Aug 2004 11:58:19 +0200, Wolfgang Wieser <wwi### [at] nospamgmxde>
wrote:
> BTW, will the features from MLPov be included in the next MegaPOV version?
Well... we try to be trendy ;-)
ABX
Post a reply to this message
|
|
| |
| |
|
|
From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 06:26:41
Message: <4110b9e1@news.povray.org>
|
|
|
| |
| |
|
|
In article <4110b0e3@news.povray.org> , Wolfgang Wieser
<wwi### [at] nospamgmxde> wrote:
> Or, alternatively, point me to a _working_ .sit unpacker for Linux.
<http://www.stuffit.com/cgi-bin/stuffit_loginpage.cgi?stuffitunix>
> The last time I was using the stuffit expander, I sweared to ban it from
> my HD: First, I unpacked the archive which took quite long for just
> some hundret kb. The binaries were okay but I saw that it messed up
> the newlines in the text. So, I un-packed it again with changed flags
> which really took _ages_ now and finally the text files were okay but
> the binaries were corrupt. Oh dear...
Yes, the Windows version used to come with a Mac configuration as default.
Not very useful to get Mac line endings on Windows...
Thorsten
____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde
Visit POV-Ray on the web: http://mac.povray.org
Post a reply to this message
|
|
| |
| |
|
|
From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 14:27:53
Message: <41112aa8@news.povray.org>
|
|
|
| |
| |
|
|
Christoph Hormann wrote:
> I think the work probably would have been better invested in
> implementing an internal JIT compiler using the existing hooks as
> Thorsten explained. This would work on all x86 systems (and for Mac an
> implementation already exists).
>
Well, I had a look at the PPC JIT Compiler in the Mac version. If I
understand it correctly, then it is actually assembling the binary code for
the PPC in memory and then executing it by jumping into that created
code.
After thinking about it, I found 2 reasons which can keep me from
doing something like that for i386 architecture:
(1) When I saw the POV VM instruction set it immediately reminded me
somehow on the PPC instruction set or a similar RISC instruction set
with a number of general-purpose registers etc. (I do not know the PPC
instructions in detail but this opinion was based on the feeling I had
of it from reading the computer magazines and from my experiences with
other RISCs.)
So, compiling this code into PPC code turns out to be pretty
straight-forward. In contrast, the i387 does not seem to have
these general purpose registers but instead it uses a register stack with
IMO 8 registers and there is a top-of-stack pointer and so on.
Furthermore, I am by far no expert in i386/7 assembly and I do not want
to hack tons of error-prone code to perform correct translation of
POV-VM-ASM into i387-ASM. [r0 should be top of stack...]
(2) GCC does a decent work in optimizing. The POV VM compiler produces
assembly which IMO has plenty of (seemingly?) pointless register moves.
(Don't understand me wrong, Thorsten: Good register allocation is a really
tough job.) Take for example this part of the paramflower on my homepage.
------------------------
r0 = sqrt(r0);
r5 = r5 * r0;
r0 = r5;
r5 = r6; <-- completely useless
r5 = r0; <-- useless as well
r0 = r2;
r0 = sqrt(r0);
r5 = r5 + r0;
r6 = r5;
r0 = POVFPU_Consts[k];
r5 = r0; <-- (skip)
r7 = r5; <-- why not r7=r0
r0 = r2; <-- (skip)
r5 = r0; <-- why not r5=r2
r0 = r5; <-- hmm?!
r5 = r5 * r0;
r5 = r5 * r0;
r0 = r5;
r5 = r7;
------------------------
Compiling this assembly directly into i387 code would probably not give
as good runtime performance as asking GCC would.
And I do not want to implement an optimizer especially since there is
really little chance to get better than GCC if we're only 1 or 2 people...
Maybe it would be easier to translate the POV-VM-ASM into SSE2 instructions.
But 2 reasons suggest against that:
(1) Not yet widely available.
(2) My box only has SSE1 and hence I could not test it.
Wolfgang
Post a reply to this message
|
|
| |
| |
|
|
From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 4 Aug 2004 16:28:45
Message: <411146fd@news.povray.org>
|
|
|
| |
| |
|
|
In article <41112aa8@news.povray.org> , Wolfgang Wieser
<wwi### [at] nospamgmxde> wrote:
> Well, I had a look at the PPC JIT Compiler in the Mac version. If I
> understand it correctly, then it is actually assembling the binary code for
> the PPC in memory and then executing it by jumping into that created
> code.
Yes, that is what it does.
> After thinking about it, I found 2 reasons which can keep me from
> doing something like that for i386 architecture:
>
> (1) When I saw the POV VM instruction set it immediately reminded me
> somehow on the PPC instruction set or a similar RISC instruction set
> with a number of general-purpose registers etc. (I do not know the PPC
> instructions in detail but this opinion was based on the feeling I had
> of it from reading the computer magazines and from my experiences with
> other RISCs.)
> So, compiling this code into PPC code turns out to be pretty
> straight-forward. In contrast, the i387 does not seem to have
> these general purpose registers but instead it uses a register stack with
> IMO 8 registers and there is a top-of-stack pointer and so on.
> Furthermore, I am by far no expert in i386/7 assembly and I do not want
> to hack tons of error-prone code to perform correct translation of
> POV-VM-ASM into i387-ASM. [r0 should be top of stack...]
Well, it would fit into the eight available stack places. It is not as
trivial as doing it with (at least) eight real registers.
> (2) GCC does a decent work in optimizing. The POV VM compiler produces
> assembly which IMO has plenty of (seemingly?) pointless register moves.
Yes, it does generate plenty of them, which is an artifact of the the direct
compiling from an expression tree into the final instruction set without
intermediate code.
The good thing is that most of these redundant moves can be removed without
too much work using peephole optimisation. However, it turns out that this
produces hardly any performance gain (for neither VM nor JIT code) but does
make the compiling/assembling much more complicated. What you get is below
the measurement error - about two percent raw function performance, which
translates into at most 0.5% (18 seconds per hour) speed improvement in a
heavy isosurface-using scene.
> Compiling this assembly directly into i387 code would probably not give
> as good runtime performance as asking GCC would.
> And I do not want to implement an optimizer especially since there is
> really little chance to get better than GCC if we're only 1 or 2 people...
The trick is that having the code inline reduces call overhead, which
accounts for about 10% to 15% for the average isosurface function. Thus,
while you perhaps gain 10% function speed by using gcc (thus 2.5% for the
scene), the call overhead has to be really low to make a difference. To
reach this with dynamic linking is not as easy as with truly inline compiled
code. So most likely to total difference is close to zero.
The main reason for this is that unlike integer instructions, changing
floating-point operation order tends to also change precision, which in turn
will quickly result in the compiled representation not being equivalent. As
the basic principle of compiler construction is to generate equivalent code,
compilers either perform very few optimisations at all, or, like newer
compilers do, allow disabling the strict equivalency requirement for
floating-point operations.
The neat thing is that with isosurfaces precision in the range compilers
have to preserve is only of secondary importance. Thus, the function
compiler does already perform many of the possible optimisations, and thus
when compiling all that is left are relatively few storage instructions that
can be optimized away. As I already pointed out, measuring (on Macs with VM
and JIT compiler) revealed that those hardly reduce performance.
> Maybe it would be easier to translate the POV-VM-ASM into SSE2 instructions.
Yes, that would be much easier than targeting a i387-style FPU.
> But 2 reasons suggest against that:
> (1) Not yet widely available.
Well, every processor sold at Aldi today offers it, doesn't it? ;-)
> (2) My box only has SSE1 and hence I could not test it.
I can see how this would be a problem...
Thorsten
____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde
Visit POV-Ray on the web: http://mac.povray.org
Post a reply to this message
|
|
| |
| |
|
|
From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 6 Aug 2004 12:34:33
Message: <4113b318@news.povray.org>
|
|
|
| |
| |
|
|
Thorsten Froehlich wrote:
> Well, it would fit into the eight available stack places. It is not as
> trivial as doing it with (at least) eight real registers.
>
Yes. Especially since there are also some restrictions -- e.g. several
functions can only be performed on the top-of-stack element.
It's like the POV VM e.g calling maths functions only on r0 (IIRC).
> The good thing is that most of these redundant moves can be removed
> without too much work using peephole optimisation.
>
Correct. After having sent the last posting, I played with the idea
of actually implementing a peephole optimization step...
> However, it turns out that
> this produces hardly any performance gain (for neither VM nor JIT code)
>
...but I see that you already tied.
(Good that I did yet not begin with it.)
> but does
> make the compiling/assembling much more complicated. What you get is
> below the measurement error - about two percent raw function performance,
> which translates into at most 0.5% (18 seconds per hour) speed improvement
> in a heavy isosurface-using scene.
>
Which clearly is not worth the efford.
> The trick is that having the code inline reduces call overhead, which
> accounts for about 10% to 15% for the average isosurface function.
>
This is correct. Actually, it seems the dynamically linked function call
overhead is the only disadvantage of my approach concerning performance.
I actually don't know exactly what the CPU cycles get spent on:
(*int_func)(bar); // <-- internal function
(*ext_func)(bar); // <-- external function (dynamically linked)
In C code it looks completely identical but the second one has
at least 10 times more call overhead.
> Thus,
> while you perhaps gain 10% function speed by using gcc (thus 2.5% for the
> scene), the call overhead has to be really low to make a difference. To
> reach this with dynamic linking is not as easy as with truly inline
> compiled
> code. So most likely to total difference is close to zero.
>
Well, probably it's more like that: The more complex a single function,
the more you gain. For trivial functions, the "gain" may be negative
(i.e. loss).
> The main reason for this is that unlike integer instructions, changing
> floating-point operation order tends to also change precision, which in
> turn
> will quickly result in the compiled representation not being equivalent.
>
Well, since the i386 internally has 80bit FP registers, the accuracy
of the compiled version can be expected to be at least as good as
that of the interpreted version. But of course that does not guarantee
equivalent images. OTOH, all that numerics is more or less a
trade-off between accuracy and runtime. And scenes should not
depend on the last couple of bits anyways because they then would
look completely differently when different compilers or different
architectures are used. - You already mentioned sth similar further down.
> As the basic principle of compiler construction is to generate equivalent
> code, compilers either perform very few optimisations at all, or, like
> newer compilers do, allow disabling the strict equivalency requirement for
> floating-point operations.
>
I'm not sure what GCC really does. I'm compiling with -ffast-math which
allows some IEEE violations but I'm not sure if it does lots of "dangerous"
things. But at least I verified that the register moves are optimized
away.
>> But 2 reasons suggest against that [SSE2]:
>> (1) Not yet widely available.
>
> Well, every processor sold at Aldi today offers it, doesn't it? ;-)
>
IMHO it depends on if they are currently selling AthlonXP or PentiumIV
CPUs... [At least my AthlonXP-1.47GHz does not have SSE2.]
Wolfgang
Post a reply to this message
|
|
| |
| |
|
|
From: Wolfgang Wieser
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 7 Aug 2004 06:18:49
Message: <4114ac88@news.povray.org>
|
|
|
| |
| |
|
|
Wolfgang Wieser wrote:
>> The trick is that having the code inline reduces call overhead, which
>> accounts for about 10% to 15% for the average isosurface function.
>>
> This is correct. Actually, it seems the dynamically linked function call
> overhead is the only disadvantage of my approach concerning performance.
> I actually don't know exactly what the CPU cycles get spent on:
>
> (*int_func)(bar); // <-- internal function
> (*ext_func)(bar); // <-- external function (dynamically linked)
>
> In C code it looks completely identical but the second one has
> at least 10 times more call overhead.
>
This is wrong. I tricked myself: In the above program, the int_func was
not a pointer to a function but the function itself.
More measurements show that the overhead is NOT due to
external shared object "linkage" but due to the function being called via
a function pointer - not depending on whether the function is in the
primary code or in a shared object.
Wolfgang
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> More measurements show that the overhead is NOT due to
> external shared object "linkage" but due to the function being called via
> a function pointer - not depending on whether the function is in the
> primary code or in a shared object.
Interesting. But also surprising (to me).
Could you explain why it takes an order of magnitude longer to
jump to a function via a pointer as compared to a direct reference ?
(note: I'm not very knowledgeable in low-level programming, just a
tiny idea of some assembly instructions).
- NC
Post a reply to this message
|
|
| |
| |
|
|
From: Thorsten Froehlich
Subject: Re: [announce] JITC: Really fast POVRay FPU
Date: 8 Aug 2004 11:13:09
Message: <41164305@news.povray.org>
|
|
|
| |
| |
|
|
In article <41163f3c$1@news.povray.org> , Nicolas Calimet
<pov### [at] freefr> wrote:
> Interesting. But also surprising (to me).
> Could you explain why it takes an order of magnitude longer to
> jump to a function via a pointer as compared to a direct reference ?
> (note: I'm not very knowledgeable in low-level programming, just a
> tiny idea of some assembly instructions).
It should not at all.
Thorsten
____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde
Visit POV-Ray on the web: http://mac.povray.org
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|