POV-Ray: Newsgroups: povray.beta-test: Radiosity Status: Giving Up...

POV-Ray : Newsgroups : povray.beta-test : Radiosity Status: Giving Up...		Server Time 2 Jan 2026 03:31:18 EST (-0500)

<<< Previous 10 Messages

Goto Latest 10 Messages

Next 10 Messages >>>

From: Warp
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 07:24:52
Message: <495b6494@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> clipka wrote:
> > Don't expect all these to be "naive hardware implementation" in the same sense
> > as, say, an integer addition, shift, bit-wise AND/OR/XOR or whatever.

> Exactly that is why you ought to be looking at the SSE2/3 floating-point 
> registers and associated hardware support. The x87 FPU is only there for 
> legacy support and rather inefficient.

  Unless I'm mistaken, the way SSE works, it's a bit difficult to use it
from portable C/C++ directly.

  Any decent C/C++ compiler will be directly able to use the FPU with any
floating point arithmetic you have written in your code. (Of course there's
a lot of optimization that can be done in order to use the FPU stack more
efficiently, but in principle making C/C++ code use the FPU is a relatively
trivial thing for a compiler to do.)

  However, using SSE efficiently is a lot more complicated. Unless I'm
mistaken, unlike with the FPU, there's no direct C/C++ FP -> SSE algorithm
which would work well in all cases. Modern compilers do already have SSE
optimization support, but AFAIK it's rather limited.

  If you are only going to use SSE as a direct substitute for the FPU,
I assume that would be possible, but you probably won't get any significant
speed benefit (perhaps even the contrary). In order to truely get benefit
from SSE, you need to vectorize the calculations so that you can calculate
many things in parallel. This is extremely hard, if not impossible for a
compiler to do with random C/C++ code.

  And even if you knew perfectly what you were doing and how you want your
code to be SSE-vectorized, there's no portable way of expressing that. The
only thing you can do is to try to write code in such way that some compiler
might be able to SSE-vectorize it when enough optimizations have been turned
on. And even then there's just only so much you can do.

-- 
                                                          - Warp

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 08:06:48
Message: <495b6e68@news.povray.org>

Warp wrote:
> Thorsten Froehlich <tho### [at] trfde> wrote:
>> clipka wrote:
>>> Don't expect all these to be "naive hardware implementation" in the same sense
>>> as, say, an integer addition, shift, bit-wise AND/OR/XOR or whatever.
> 
>> Exactly that is why you ought to be looking at the SSE2/3 floating-point 
>> registers and associated hardware support. The x87 FPU is only there for 
>> legacy support and rather inefficient.
> 
>   Unless I'm mistaken, the way SSE works, it's a bit difficult to use it
> from portable C/C++ directly.

You are mistaken. All modern x86 compilers (gcc,icc,vc) can use it instead 
of the x87 FPU. I think since versions 3.2, 8, and 7.1 respectively.

	Thorsten

Post a reply to this message

From: Warp
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 08:11:04
Message: <495b6f68@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> You are mistaken. All modern x86 compilers (gcc,icc,vc) can use it instead 
> of the x87 FPU. I think since versions 3.2, 8, and 7.1 respectively.

  "Can use" doesn't really tell how effectively they can use it.

-- 
                                                          - Warp

Post a reply to this message

From: clipka
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 10:20:01
Message: <web.495b8d2bcd9d1e75483cfa400@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> clipka wrote:
> > Don't expect all these to be "naive hardware implementation" in the same sense
> > as, say, an integer addition, shift, bit-wise AND/OR/XOR or whatever.
>
> Exactly that is why you ought to be looking at the SSE2/3 floating-point
> registers and associated hardware support. The x87 FPU is only there for
> legacy support and rather inefficient.

Hah! Say that again...

SSE3, SSSE3 to SSE4 is rather primitive compared to what the x87 FPU can do -
except when it comes to bulk add, subtract, multiply or divide. Which is what
they're designed for: Vectors and matrices. That's why they're called Streaming
SIMD (= Single Instruction Multiple Data) Extensions.

Search for trigonometric or logarithmic functions - you'll not find any in the
SSE2 or SSE3 sections. You'll probably find that these still rely on good old
x87 FPU instructions.

>  > Even a floating-point addition is a non-trivial thing.
>
> Actually, it is not more complex than integer addition and multiplication.

- Check for NaNs, infinities and other such things
- Normalize the smaller number to match the larger one's exponent
- Add the mantissae
- Check for mantissa overflow, re-normalizing the number if necessary
- Check for number format overflow

Doesn't sound as trivial to me as shoving two sets of bits into an array of
properly wired bit adders with carry in- and outputs, and then reading their
output lines.

IMUL is a different thing already. We're talking about something here that I'd
probably not want to implement in pure, non-clocked hardware.

Post a reply to this message

From: clipka
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 10:35:01
Message: <web.495b9047cd9d1e75483cfa400@news.povray.org>

Warp <war### [at] tagpovrayorg> wrote:
>   Unless I'm mistaken, the way SSE works, it's a bit difficult to use it
> from portable C/C++ directly.

Not really.

>   Any decent C/C++ compiler will be directly able to use the FPU with any
> floating point arithmetic you have written in your code. (Of course there's
> a lot of optimization that can be done in order to use the FPU stack more
> efficiently, but in principle making C/C++ code use the FPU is a relatively
> trivial thing for a compiler to do.)

(I'm not sure whether it will be so easy for the compiler to make good use of
the FSINCOS command.)

>   However, using SSE efficiently is a lot more complicated. Unless I'm
> mistaken, unlike with the FPU, there's no direct C/C++ FP -> SSE algorithm
> which would work well in all cases. Modern compilers do already have SSE
> optimization support, but AFAIK it's rather limited.

It doesn't seem to be very difficult with software like POV-ray: The Intel C++
compiler keeps spitting out lots of "code was VECTORIZED" at me every time I
compile the POV source code. Which is the compiler's way to say that it
inserted an SSE2 instruction.

I actually read the compiler doc about *that* stuff, and it seems to vectorize
quite a lot when allowed to. All you have to do is perform suitable math
operations on arrays of numbers.

Basically, it seems that all that *looks* like some vector or matrix math is
actually vectorized by the compiler. And guess what: POV-ray has a lot of such
stuff :)

Post a reply to this message

From: Warp
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 10:35:38
Message: <495b914a@news.povray.org>

clipka <nomail@nomail> wrote:
> Actually, from what I gather from the "Intel? 64 and IA-32 Architectures
> Software Developer?s Manual", all the fancy stuff from MMX through SSE, SSE2,
> SSE3, SSSE3 to SSE4 is rather primitive compared to what the x87 FPU can do -
> except when it comes to bulk add, subtract, multiply or divide. Which is what
> they're designed for: Vectors and matrices. That's why they're called Streaming
> SIMD (= Single Instruction Multiple Data) Extensions.

  Since POV-Ray performs a lot of matrix multiplications (as well as
vector x matrix multiplications) it could theoretically benefit from SSE
optimizations. Of course it's quite difficult to say in (portable) C++
"calculate this matrix multiplication in the most optimal way using SSE".

  OTOH, I wonder how much that would really speed it up, because AFAIK
POV-Ray spends most of its time calculating ray-boundingbox and ray-surface
intersections rather than multiplying vectors and matrices.

-- 
                                                          - Warp

Post a reply to this message

From: nemesis
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 10:50:00
Message: <web.495b93d0cd9d1e75180057960@news.povray.org>

Damn!  Isn't it exciting to see this much talk about actual povray code and
improvement rather than just read Orchid's blog posts all day?  No offense,
Andrew! :D

Post a reply to this message

From: clipka
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 11:15:01
Message: <web.495b99cfcd9d1e75483cfa400@news.povray.org>

Warp <war### [at] tagpovrayorg> wrote:
>   Since POV-Ray performs a lot of matrix multiplications (as well as
> vector x matrix multiplications) it could theoretically benefit from SSE
> optimizations. Of course it's quite difficult to say in (portable) C++
> "calculate this matrix multiplication in the most optimal way using SSE".
>
>   OTOH, I wonder how much that would really speed it up, because AFAIK
> POV-Ray spends most of its time calculating ray-boundingbox and ray-surface
> intersections rather than multiplying vectors and matrices.

On my AMD64 Linux machine, POV-Ray 3.6 runs the benchmark in ~1770 seconds,
being a generic i686 binary. MegaPOV 1.2.1 AMD64 binary does the same stunt in
~1423 seconds. I expect this to be mainly due to SSE2.

No special SSE2 "optimization hints" have been coded into MegaPOV. It's just
plain POV 3.6 code with some functionality added. And compiled with different
options. (Both were compiled using g++)

Post a reply to this message

From: clipka
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 11:15:01
Message: <web.495b9a27cd9d1e75483cfa400@news.povray.org>

"nemesis" <nam### [at] gmailcom> wrote:
> Damn!  Isn't it exciting to see this much talk about actual povray code and
> improvement rather than just read Orchid's blog posts all day?  No offense,
> Andrew! :D

Who is Orchid? Have I missed something...?

Post a reply to this message

From: Warp
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 11:21:12
Message: <495b9bf8@news.povray.org>

clipka <nomail@nomail> wrote:
> On my AMD64 Linux machine, POV-Ray 3.6 runs the benchmark in ~1770 seconds,
> being a generic i686 binary. MegaPOV 1.2.1 AMD64 binary does the same stunt in
> ~1423 seconds. I expect this to be mainly due to SSE2.

> No special SSE2 "optimization hints" have been coded into MegaPOV. It's just
> plain POV 3.6 code with some functionality added. And compiled with different
> options. (Both were compiled using g++)

  That isn't very telling if they were compiled with different options.

  They should be compiled with all the same options, except for SSE2
optimizations in order for the measurement to be reliable.

-- 
                                                          - Warp

Post a reply to this message

<<< Previous 10 Messages

Goto Latest 10 Messages

Next 10 Messages >>>