POV-Ray: Newsgroups: povray.programming: CUDA - NVIDIA's massively parallel programming architecture

POV-Ray : Newsgroups : povray.programming : CUDA - NVIDIA's massively parallel programming architecture		Server Time 12 Jul 2025 13:09:20 EDT (-0400)

<<< Previous 5 Messages

Goto Latest 10 Messages

Next 10 Messages >>>

From: Eero Ahonen
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 12 Feb 2007 16:42:07
Message: <45d0df2f$1@news.povray.org>

PaulSh wrote:
> POV-Ray is finally making its transition to SMP systems, which is great news
> for those of us who can afford them. However, in terms of raw CPU power the
> latest NVIDIA graphics cards would seem to completely blow away anything in
> the way of SMP solutions this side of a research lab. Their CUDA GPU
> architecture is claimed to allow up to 128 independent processing units
> each running at 1.35GHz to be thrown at computationally-intensive problems.
> So, my first thought was not SETI or protein folding, but POV-Ray. Given
> that V3.7 is going to be fully threaded, what would be the possibility of a
> CUDA version? I guess that will depend on the time and abilities of someone
> with a lot more time and a lot more ability than myself...

Dunno. But I'd guess it would be easier to compile 3.7 to one of these:
http://www.cray.com/products/xt4/specifications.html
http://www.cray.com/products/xd1/index.html

The latter one is running Linux (AFAIK x86_64), so POV should be a piece
of cake. Of course the price can be somewhere above the "too high"
-level :p.

-- 
Eero "Aero" Ahonen
   http://www.zbxt.net
      aer### [at] removethiszbxtnetinvalid

Post a reply to this message

From: Andrew Price
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 24 Mar 2007 23:10:01
Message: <web.4605f4edefc6bb74b4888b4a0@news.povray.org>

I was just looking at this page:

http://gizmodo.com/gadgets/home-entertainment/breaking-ps3-folding-ps3-triples-folding-at-homes-computing-power-to-over
-500-tflopspflops-in-spitting-range-246664.php

It would seem that each PS3 is contributing the same CPU horsepower as 24.6
pcs on average. Therefore a Pov for PS3 might be well worth while.

I would be willing to try my best to help get this done.

:)

Post a reply to this message

From: theCardinal
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 18 Apr 2007 04:00:02
Message: <web.4625ceddefc6bb7494cf37ec0@news.povray.org>

Ben Chambers <ben### [at] pacificwebguycom> wrote:
> PaulSh wrote:
> > POV-Ray is finally making its transition to SMP systems, which is great news
> > for those of us who can afford them. However, in terms of raw CPU power the
> > latest NVIDIA graphics cards would seem to completely blow away anything in
> > the way of SMP solutions this side of a research lab. Their CUDA GPU
> > architecture is claimed to allow up to 128 independent processing units
> > each running at 1.35GHz to be thrown at computationally-intensive problems.
> > So, my first thought was not SETI or protein folding, but POV-Ray. Given
> > that V3.7 is going to be fully threaded, what would be the possibility of a
> > CUDA version? I guess that will depend on the time and abilities of someone
> > with a lot more time and a lot more ability than myself...
> >
> >
>
> No good, they're only single precision.  Plus, each shader unit would
> need access to the entire scene file, which would be a pain in the a**
> to code.
>
> ...Chambers

According to the CUDA programming guide published by NVidia the GPGPU
architecture is as competent as any 32 bit processor.  More precise
computations can be simulated using multiple registers for a computation
instead of a single register if my memory serves - so I seriously doubt
this is a serious obstacle.

The device code is written in a simple extension of C, with host code either
written to match or working through the device drivers directly from any
language capable of such.

The major difficulty involved would be preparing the pov-ray source to run
efficiently on a SIMD architecture - native code may run out of the box
through the provided compiler, but the results would be poor at best
without optimization to take advantage of the particular memory hierarchies
involved.

Sharing parse trees used in the ray tracing is actually highly efficient on
this architecture - it would be stored in memory read-only from the device
accessible from any thread (without locking).  The host processor is
typically responsible for loading the parse tree, since the its preparation
is not likely to be efficient in parallel.

The biggest hurdle I am aware of is load-balancing threads to make efficient
use of the processing power available - simple subdividing the image into
unrelated render-blocks is obviously bounded by the worst-case running time
of the entire image, which may be unacceptably slow for any sufficiently
complex scene.  Perhaps someone with more in-depth knowledge of the
algorithms can determine what the limiting factors of per-ray threading
would be or other techniques.  (This may or may not have been addressed for
a multi-core implementation - since at worse for the naive implementation
you still have 1/n CPU utilization for a n-core system.)

Justin

see:
http://developer.nvidia.com/object/cuda.html#documentation

Post a reply to this message

From: Patrick Elliott
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 20 Apr 2007 16:22:30
Message: <MPG.2092cebe2e3c94f598a003@news.povray.org>

In article <web.4625ceddefc6bb7494cf37ec0@news.povray.org>, 
_the### [at] yahoocom says...
> Ben Chambers <ben### [at] pacificwebguycom> wrote:
> > PaulSh wrote:
> > > POV-Ray is finally making its transition to SMP systems, which is gre
at news
> > > for those of us who can afford them. However, in terms of raw CPU pow
er the
> > > latest NVIDIA graphics cards would seem to completely blow away anyth
ing in
> > > the way of SMP solutions this side of a research lab. Their CUDA GPU
> > > architecture is claimed to allow up to 128 independent processing uni
ts
> > > each running at 1.35GHz to be thrown at computationally-intensive pro
blems.
> > > So, my first thought was not SETI or protein folding, but POV-Ray. Gi
ven
> > > that V3.7 is going to be fully threaded, what would be the possibilit
y of a
> > > CUDA version? I guess that will depend on the time and abilities of s
omeone
> > > with a lot more time and a lot more ability than myself...
> > >
> > >
> >
> > No good, they're only single precision.  Plus, each shader unit would
> > need access to the entire scene file, which would be a pain in the a**
> > to code.
> >
> > ...Chambers
> 
> According to the CUDA programming guide published by NVidia the GPGPU
> architecture is as competent as any 32 bit processor.  More precise
> computations can be simulated using multiple registers for a computation
> instead of a single register if my memory serves - so I seriously doubt
> this is a serious obstacle.
> 
> The device code is written in a simple extension of C, with host code eit
her
> written to match or working through the device drivers directly from any
> language capable of such.
> 
> The major difficulty involved would be preparing the pov-ray source to ru
n
> efficiently on a SIMD architecture - native code may run out of the box
> through the provided compiler, but the results would be poor at best
> without optimization to take advantage of the particular memory hierarchi
es
> involved.
> 
> Sharing parse trees used in the ray tracing is actually highly efficient 
on
> this architecture - it would be stored in memory read-only from the devic
e
> accessible from any thread (without locking).  The host processor is
> typically responsible for loading the parse tree, since the its preparati
on
> is not likely to be efficient in parallel.
> 
> The biggest hurdle I am aware of is load-balancing threads to make effici
ent
> use of the processing power available - simple subdividing the image into
> unrelated render-blocks is obviously bounded by the worst-case running ti
me
> of the entire image, which may be unacceptably slow for any sufficiently
> complex scene.  Perhaps someone with more in-depth knowledge of the
> algorithms can determine what the limiting factors of per-ray threading
> would be or other techniques.  (This may or may not have been addressed f
or
> a multi-core implementation - since at worse for the naive implementation
> you still have 1/n CPU utilization for a n-core system.)
> 
> Justin
> 
> see:
> http://developer.nvidia.com/object/cuda.html#documentation
> 
Umm. Sorry but: A) trying to use more registers to do the processing 
adds more overhead to the process than using them does in the first 
place, I think, and that presumes you can even do it effectively. B) 
Most cards are not 32-bit internally, or at least do not provide 32-bit 
for every register or process they perform. C) POVRay is often now 
compiled for 64-bit, so... and D) Its hardly trivial to write code that 
uses multi-core processing, when dealing with things that are often 
quite linear in implementation. A fact that *still* results in Radiosity 
and some other features being in the "not yet working" mode in the new 
multi-core CPU version of POVRay already being developed. Adding in a 
graphics card that would only be directly compatible with the 32-bit 
compile of it just adds more headaches.

And that is my fairly inexpert opinion. I suspect that the detailed 
explanation would go well past what I said and be more specific as to 
why it just won't work currently. In fact, I am certain of it, since you 
are like the third person to bring it up in the last year, so the 
explanation for why it won't work is someplace in the archives already.

-- 
void main () {

    call functional_code()
  else
    call crash_windows();
}

<A HREF='http://www.daz3d.com/index.php?refid=16130551'>Get 3D Models,
 
3D Content, and 3D Software at DAZ3D!</A>

Post a reply to this message

From: Chambers
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 21 Apr 2007 14:29:24
Message: <462a5804@news.povray.org>

_theCardinal wrote:
> Ben Chambers <ben### [at] pacificwebguycom> wrote:
>> No good, they're only single precision.  Plus, each shader unit would
>> need access to the entire scene file, which would be a pain in the a**
>> to code.
>>
>> ...Chambers
> 
> According to the CUDA programming guide published by NVidia the GPGPU
> architecture is as competent as any 32 bit processor.  More precise
> computations can be simulated using multiple registers for a computation
> instead of a single register if my memory serves - so I seriously doubt
> this is a serious obstacle.

But double precision is actually 64bit.  Until recently (I don't 
remember exactly which model), NVidia didn't even do full 32bit FP (that 
is, single precision), but only 24.  POV-Ray, for the last dozen years, 
has done 64bit FP (double precision), as the extra accuracy is necessary 
for the types of computations it does.

Sure, you can simulate it in the same way you can use two integer units 
to simulate a fixed point number, but the result is slow.  Perhaps if 
Intel surprises everyone, and releases their next graphics chip as a 
double precision FP monsters, we'd be able to take advantage of that, 
but the current ATI / NVidia cards aren't up to the task of dealing with 
POV-Ray.

> The major difficulty involved would be preparing the pov-ray source to run
> efficiently on a SIMD architecture - native code may run out of the box
> through the provided compiler, but the results would be poor at best
> without optimization to take advantage of the particular memory hierarchies
> involved.

Once the 3.7 source is out, it should be much easier, as the major task 
is simply fitting it to a parallel paradigm.  Having already done that, 
porting to different parallel architectures should be trivial (relative 
to the original threading support, that is).

-- 
...Ben Chambers
www.pacificwebguy.com

Post a reply to this message

From: Warp
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 21 Apr 2007 14:36:17
Message: <462a59a1@news.povray.org>

Chambers <ben### [at] pacificwebguycom> wrote:
> Perhaps if 
> Intel surprises everyone, and releases their next graphics chip as a 
> double precision FP monsters, we'd be able to take advantage of that, 

  Exactly how would it be different from current FPUs?

-- 
                                                          - Warp

Post a reply to this message

From: theCardinal
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 21 Apr 2007 21:35:01
Message: <web.462abb54efc6bb7494cf37ec0@news.povray.org>

Chambers <ben### [at] pacificwebguycom> wrote:
> _theCardinal wrote:
> > Ben Chambers <ben### [at] pacificwebguycom> wrote:
> >> No good, they're only single precision.  Plus, each shader unit would
> >> need access to the entire scene file, which would be a pain in the a**
> >> to code.
> >>
> >> ...Chambers
> >
> > According to the CUDA programming guide published by NVidia the GPGPU
> > architecture is as competent as any 32 bit processor.  More precise
> > computations can be simulated using multiple registers for a computation
> > instead of a single register if my memory serves - so I seriously doubt
> > this is a serious obstacle.
>
> But double precision is actually 64bit.  Until recently (I don't
> remember exactly which model), NVidia didn't even do full 32bit FP (that
> is, single precision), but only 24.  POV-Ray, for the last dozen years,
> has done 64bit FP (double precision), as the extra accuracy is necessary
> for the types of computations it does.
>
> Sure, you can simulate it in the same way you can use two integer units
> to simulate a fixed point number, but the result is slow.  Perhaps if
> Intel surprises everyone, and releases their next graphics chip as a
> double precision FP monsters, we'd be able to take advantage of that,
> but the current ATI / NVidia cards aren't up to the task of dealing with
> POV-Ray.
>
> > The major difficulty involved would be preparing the pov-ray source to run
> > efficiently on a SIMD architecture - native code may run out of the box
> > through the provided compiler, but the results would be poor at best
> > without optimization to take advantage of the particular memory hierarchies
> > involved.
>
> Once the 3.7 source is out, it should be much easier, as the major task
> is simply fitting it to a parallel paradigm.  Having already done that,
> porting to different parallel architectures should be trivial (relative
> to the original threading support, that is).
>
> --
> ...Ben Chambers
> www.pacificwebguy.com

Few things:

"But double precision is actually 64bit."
To be technical the number of bits used for a double is implementation
dependent.  The requirement is simply that a float <= double.  It is up to
compiler to decide how to interpret that.  Using double in lieu of float
simply indicates the desire for additional precision - not the requirement
(in C and C++).  Hence it is impossible in general to say povray is using
64 bits.  See: The C++ Programming Language (TCPL) 74-75.

Compilers may have more than a few techniques to simulate 64 bit computation
on a 32 bit architecture, but I am not experienced enough in compiler design
to state them within reasonable doubt.  Its worth noting that the time lost
in doing 2 ops instead of 1 is easily regained in shifting from 1-2
processors to an array of processors, so this is not a concern provided the
utilization of the array is sufficiently high.

CUDA is a new beast - designed for 8800 or later generation cards by NVidia.
 This means that the vast majority of cards in use today do not support it -
and probably won't for the next 2-3 years.  According to the specification I
read the registers are full 32 bit, not 24 as in earlier cards.  For more
detailed information google CUDA and browse the documentation provided
along with the SDK.

I would like to clarify that a dual-processor design is not likely to share
much in common with a SIMD (single instruction multiple data) architecture.
 In particular - reaching optimal utilization when sets of processors are
required to share the same instruction set is not terrible easy, this
requirement doesn't exist at all for a dual-core architecture (and is one
of the reasons it moved into consumer systems).  General ray-tracing does
show great potential though, since its intuitively rare to have an image
where the number of rays to evaluate varies rapidly over the image.
(Refracting through a 'bag of marbles' would be a good counterexample
though - a small shift in direction would wildly vary in the number of rays
to computer.  I doubt there is any tractable deterministic method to check
if this is the case though)

Personally I am less concerned with the usefulness of the implementation
that in its experimental value.

Thanks,

Justin

Post a reply to this message

From: theCardinal
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 21 Apr 2007 23:35:01
Message: <web.462ad74befc6bb7494cf37ec0@news.povray.org>

"Q: Does CUDA support Double Precision Floating Point arithmetic?

 A: CUDA supports the C "double" data type.  However on G80
    (e.g. GeForce 8800) GPUs, these types will get demoted to 32-bit
    floats.  NVIDIA GPUs supporting double precision in hardware will
    become available in late 2007."
Source: NVidia CUDA release notes version 0.8.

My guess is that where NVidia goes ATI has already gone or is not far
behind.

Justin

Post a reply to this message

From: theCardinal
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 21 Apr 2007 23:50:01
Message: <web.462ada77efc6bb7494cf37ec0@news.povray.org>

Warp <war### [at] tagpovrayorg> wrote:
> Chambers <ben### [at] pacificwebguycom> wrote:
> > Perhaps if
> > Intel surprises everyone, and releases their next graphics chip as a
> > double precision FP monsters, we'd be able to take advantage of that,
>
>   Exactly how would it be different from current FPUs?
>
> --
>                                                           - Warp

There are 2 typical variants of FPU around right now - 32 bit and 64 bit,
matching the type of processor they are included on.  Having a 64 bit FPU
is necessary for doing 64 bit arithmetic in hardware - but is not
sufficient.  It would also require a 64 bit operating system such as
windows XP x64, vista 64-bit, or a version of Unix compiled for 64 bit
systems.  All mainstream 64 bit processors can run in a limited 32-bit mode
to back-support 32 bit execution, if this is the case the precision
available is still only 32 bit in hardware.

Software packages exist for manipulating floating point data in various
extended formats - but as Mr. Chambers mentioned they are not as efficient
as hardware support.

Justin

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: CUDA - NVIDIA's massively parallel programming architecture
Date: 22 Apr 2007 00:49:56
Message: <462ae974@news.povray.org>

_theCardinal wrote:
> There are 2 typical variants of FPU around right now - 32 bit and 64 bit,
> matching the type of processor they are included on.  Having a 64 bit FPU
> is necessary for doing 64 bit arithmetic in hardware - but is not
> sufficient.  It would also require a 64 bit operating system such as
> windows XP x64, vista 64-bit, or a version of Unix compiled for 64 bit
> systems.  All mainstream 64 bit processors can run in a limited 32-bit mode
> to back-support 32 bit execution, if this is the case the precision
> available is still only 32 bit in hardware.

Wrong, unless you meant to write CPU.

FPUs on x86 as well as virtually all non-embedded-use-only processors
support at least 64-bit floating point precision in hardware..

	Thorsten, POV-Team

Post a reply to this message

<<< Previous 5 Messages

Goto Latest 10 Messages

Next 10 Messages >>>