POV-Ray: Newsgroups: povray.beta-test: Radiosity Status: Giving Up...

POV-Ray : Newsgroups : povray.beta-test : Radiosity Status: Giving Up...		Server Time 29 Oct 2025 00:46:42 EDT (-0400)

<<< Previous 10 Messages

Goto Latest 10 Messages

Next 10 Messages >>>

From: andrel
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 05:41:04
Message: <495B4CA0.2030302@hotmail.com>

On 31-Dec-08 4:55, clipka wrote:
> andrel <a_l### [at] hotmailcom> wrote:
>> sqrt has at worst N/2 iteration, with N the number of significant bits.
>> That is a naive implementation. I thought there are even faster
>> algorithms. Cheap calculators often have a button for sqrt, implying
>> probably that they have a (naive) hardware implementation. I don't know
>> if it is available also on more complex processors.
> 
> From the "Intel(R) 64 and IA-32 Architectures Software Developer's Manual,
> Volume 1: Basic Architecture":
> 
> ------------------------
> 8.3.5 Basic Arithmetic Instructions
> 
> The following floating-point instructions perform basic arithmetic operations on
> floating-point numbers. Where applicable, these instructions match IEEE Standard
> 754:
> 
> FADD/FADDP Add floating point
> FIADD Add integer to floating point
> FSUB/FSUBP Subtract floating point
> FISUB Subtract integer from floating point
> FSUBR/FSUBRP Reverse subtract floating point
> FISUBR Reverse subtract floating point from integer
> FMUL/FMULP Multiply floating point
> FIMUL Multiply integer by floating point
> FDIV/FDIVP Divide floating point
> FIDIV Divide floating point by integer
> FDIVR/FDIVRP Reverse divide
> FIDIVR Reverse divide integer by floating point
> FABS Absolute value
> FCHS Change sign
> FSQRT Square root                          <--
> FPREM Partial remainder
> FPREM1 IEEE partial remainder
> FRNDINT Round to integral value
> FXTRACT Extract exponent and significand
> 
> [...]
> 
> 8.3.7 Trigonometric Instructions
> 
> The following instructions perform four common trigonometric functions:
> 
> FSIN Sine
> FCOS Cosine
> FSINCOS Sine and cosine
> FPTAN Tangent
> FPATAN Arctangent
> 
> [...]
> 
> 8.3.9 Logarithmic, Exponential, and Scale
> 
> The following instructions provide two different logarithmic functions, an
> exponential function and a scale function:
> 
> FYL2X Logarithm
> FYL2XP1 Logarithm epsilon
> F2XM1 Exponential
> FSCALE Scale
> 
> [...]
> 
> ------------------------

I didn't know even sin and cos are now hardware supported, interesting. 
That will enable you to pinpoint the last time I did some assembler 
programming to within two years. Another source gives the 8087 as the 
first one to implement FSQRT and the 80387 as the first one for the FSIN 
etc. That makes sense.
If we could now also find the average timing of these instructions, we 
might even be able to answer Warp's question on how much faster SQRT is 
compared to e.g. sin.

> 
> Don't expect all these to be "naive hardware implementation" in the same sense
> as, say, an integer addition, shift, bit-wise AND/OR/XOR or whatever. Even a
> floating-point addition is a non-trivial thing. Rather consider the FPU
> (Floating Point Unit) a computer in the computer: You feed some data into it,
> give it an instruction what to do with that data, and then basically wait for
> the (fast and highly optimized) hard-wired FPU "programlets" to complete. Or do
> some other usefull stuff in the meantime.

Microcode may be the word you are looking for.

> To my knowledge, a lot of work is saved by using Look-up-tables (like people
> used in old times before pocket calculators had a "sin", "log" or "sqrt"
> button) to optimize away a few iterations of the algorithms.
> 
> BTW, your typical pocket calculator will do just about the same. AFAIK there is
> no non-iterative algorithm for computing the square root of an arbitrary
> floating point number.

It is an iterative procedure but you need the same hardware as for a 
divide. So when you build something that can do division in hardware 
with some shift registers and a comparator, there is bound to be some 
space in the chip left to add the few gates to do a SQRT also.
You can do a SQRT and a division in pure hardware without microcode.
I wouldn't be surprised if current FPUs do use some clever tricks in 
microcode to speed up the SQRT by using a more sophisticated algorithm. 
But, as I said, I am doing not much programming close to the machine 
anymore these days.
Hmm, I have not even read Dr Dobbs for a few years. Been thinking of 
renewing my subscription for almost as long, never came round to it. 
Let's spend some cash before the year ends.

Post a reply to this message

From: Warp
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 05:50:28
Message: <495b4e74@news.povray.org>

andrel <a_l### [at] hotmailcom> wrote:
> I didn't know even sin and cos are now hardware supported, interesting.

  "Now"? In which cave are you living?-)

  When was the 80387 introduced? According to wikipedia, 1987. That's like
21 years ago...

-- 
                                                          - Warp

Post a reply to this message

From: andrel
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 06:00:11
Message: <495B511B.3080109@hotmail.com>

On 31-Dec-08 11:50, Warp wrote:
> andrel <a_l### [at] hotmailcom> wrote:
>> I didn't know even sin and cos are now hardware supported, interesting.
> 
>   "Now"? In which cave are you living?-)
> 
>   When was the 80387 introduced? According to wikipedia, 1987. That's like
> 21 years ago...
> 
I invited you to do the math ;) That 1987 was more or less the last time 
I wrote some machine code and that was not for a 386/7 system. As far as 
I am concerned the hardware and software developers for the PC 
compatibles did not really invite programming close to the hardware.

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 07:13:51
Message: <495b61ff$1@news.povray.org>

clipka wrote:
> Don't expect all these to be "naive hardware implementation" in the same sense
> as, say, an integer addition, shift, bit-wise AND/OR/XOR or whatever.

Exactly that is why you ought to be looking at the SSE2/3 floating-point 
registers and associated hardware support. The x87 FPU is only there for 
legacy support and rather inefficient.

 > Even a floating-point addition is a non-trivial thing.

Actually, it is not more complex than integer addition and multiplication.

	Thorsten

Post a reply to this message

From: Warp
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 07:24:52
Message: <495b6494@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> clipka wrote:
> > Don't expect all these to be "naive hardware implementation" in the same sense
> > as, say, an integer addition, shift, bit-wise AND/OR/XOR or whatever.

> Exactly that is why you ought to be looking at the SSE2/3 floating-point 
> registers and associated hardware support. The x87 FPU is only there for 
> legacy support and rather inefficient.

  Unless I'm mistaken, the way SSE works, it's a bit difficult to use it
from portable C/C++ directly.

  Any decent C/C++ compiler will be directly able to use the FPU with any
floating point arithmetic you have written in your code. (Of course there's
a lot of optimization that can be done in order to use the FPU stack more
efficiently, but in principle making C/C++ code use the FPU is a relatively
trivial thing for a compiler to do.)

  However, using SSE efficiently is a lot more complicated. Unless I'm
mistaken, unlike with the FPU, there's no direct C/C++ FP -> SSE algorithm
which would work well in all cases. Modern compilers do already have SSE
optimization support, but AFAIK it's rather limited.

  If you are only going to use SSE as a direct substitute for the FPU,
I assume that would be possible, but you probably won't get any significant
speed benefit (perhaps even the contrary). In order to truely get benefit
from SSE, you need to vectorize the calculations so that you can calculate
many things in parallel. This is extremely hard, if not impossible for a
compiler to do with random C/C++ code.

  And even if you knew perfectly what you were doing and how you want your
code to be SSE-vectorized, there's no portable way of expressing that. The
only thing you can do is to try to write code in such way that some compiler
might be able to SSE-vectorize it when enough optimizations have been turned
on. And even then there's just only so much you can do.

-- 
                                                          - Warp

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 08:06:48
Message: <495b6e68@news.povray.org>

Warp wrote:
> Thorsten Froehlich <tho### [at] trfde> wrote:
>> clipka wrote:
>>> Don't expect all these to be "naive hardware implementation" in the same sense
>>> as, say, an integer addition, shift, bit-wise AND/OR/XOR or whatever.
> 
>> Exactly that is why you ought to be looking at the SSE2/3 floating-point 
>> registers and associated hardware support. The x87 FPU is only there for 
>> legacy support and rather inefficient.
> 
>   Unless I'm mistaken, the way SSE works, it's a bit difficult to use it
> from portable C/C++ directly.

You are mistaken. All modern x86 compilers (gcc,icc,vc) can use it instead 
of the x87 FPU. I think since versions 3.2, 8, and 7.1 respectively.

	Thorsten

Post a reply to this message

From: Warp
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 08:11:04
Message: <495b6f68@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> You are mistaken. All modern x86 compilers (gcc,icc,vc) can use it instead 
> of the x87 FPU. I think since versions 3.2, 8, and 7.1 respectively.

  "Can use" doesn't really tell how effectively they can use it.

-- 
                                                          - Warp

Post a reply to this message

From: clipka
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 10:20:01
Message: <web.495b8d2bcd9d1e75483cfa400@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> clipka wrote:
> > Don't expect all these to be "naive hardware implementation" in the same sense
> > as, say, an integer addition, shift, bit-wise AND/OR/XOR or whatever.
>
> Exactly that is why you ought to be looking at the SSE2/3 floating-point
> registers and associated hardware support. The x87 FPU is only there for
> legacy support and rather inefficient.

Hah! Say that again...

SSE3, SSSE3 to SSE4 is rather primitive compared to what the x87 FPU can do -
except when it comes to bulk add, subtract, multiply or divide. Which is what
they're designed for: Vectors and matrices. That's why they're called Streaming
SIMD (= Single Instruction Multiple Data) Extensions.

Search for trigonometric or logarithmic functions - you'll not find any in the
SSE2 or SSE3 sections. You'll probably find that these still rely on good old
x87 FPU instructions.

>  > Even a floating-point addition is a non-trivial thing.
>
> Actually, it is not more complex than integer addition and multiplication.

- Check for NaNs, infinities and other such things
- Normalize the smaller number to match the larger one's exponent
- Add the mantissae
- Check for mantissa overflow, re-normalizing the number if necessary
- Check for number format overflow

Doesn't sound as trivial to me as shoving two sets of bits into an array of
properly wired bit adders with carry in- and outputs, and then reading their
output lines.

IMUL is a different thing already. We're talking about something here that I'd
probably not want to implement in pure, non-clocked hardware.

Post a reply to this message

From: clipka
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 10:35:01
Message: <web.495b9047cd9d1e75483cfa400@news.povray.org>

Warp <war### [at] tagpovrayorg> wrote:
>   Unless I'm mistaken, the way SSE works, it's a bit difficult to use it
> from portable C/C++ directly.

Not really.

>   Any decent C/C++ compiler will be directly able to use the FPU with any
> floating point arithmetic you have written in your code. (Of course there's
> a lot of optimization that can be done in order to use the FPU stack more
> efficiently, but in principle making C/C++ code use the FPU is a relatively
> trivial thing for a compiler to do.)

(I'm not sure whether it will be so easy for the compiler to make good use of
the FSINCOS command.)

>   However, using SSE efficiently is a lot more complicated. Unless I'm
> mistaken, unlike with the FPU, there's no direct C/C++ FP -> SSE algorithm
> which would work well in all cases. Modern compilers do already have SSE
> optimization support, but AFAIK it's rather limited.

It doesn't seem to be very difficult with software like POV-ray: The Intel C++
compiler keeps spitting out lots of "code was VECTORIZED" at me every time I
compile the POV source code. Which is the compiler's way to say that it
inserted an SSE2 instruction.

I actually read the compiler doc about *that* stuff, and it seems to vectorize
quite a lot when allowed to. All you have to do is perform suitable math
operations on arrays of numbers.

Basically, it seems that all that *looks* like some vector or matrix math is
actually vectorized by the compiler. And guess what: POV-ray has a lot of such
stuff :)

Post a reply to this message

From: Warp
Subject: Re: Radiosity Status: Giving Up...
Date: 31 Dec 2008 10:35:38
Message: <495b914a@news.povray.org>

clipka <nomail@nomail> wrote:
> Actually, from what I gather from the "Intel? 64 and IA-32 Architectures
> Software Developer?s Manual", all the fancy stuff from MMX through SSE, SSE2,
> SSE3, SSSE3 to SSE4 is rather primitive compared to what the x87 FPU can do -
> except when it comes to bulk add, subtract, multiply or divide. Which is what
> they're designed for: Vectors and matrices. That's why they're called Streaming
> SIMD (= Single Instruction Multiple Data) Extensions.

  Since POV-Ray performs a lot of matrix multiplications (as well as
vector x matrix multiplications) it could theoretically benefit from SSE
optimizations. Of course it's quite difficult to say in (portable) C++
"calculate this matrix multiplication in the most optimal way using SSE".

  OTOH, I wonder how much that would really speed it up, because AFAIK
POV-Ray spends most of its time calculating ray-boundingbox and ray-surface
intersections rather than multiplying vectors and matrices.

-- 
                                                          - Warp

Post a reply to this message

<<< Previous 10 Messages

Goto Latest 10 Messages

Next 10 Messages >>>