POV-Ray : Newsgroups : povray.programming : CUDA - NVIDIA's massively parallel programming architecture : Re: CUDA - NVIDIA's massively parallel programming architecture Server Time
17 May 2024 08:18:55 EDT (-0400)
  Re: CUDA - NVIDIA's massively parallel programming architecture  
From:  theCardinal
Date: 18 Apr 2007 04:00:02
Message: <web.4625ceddefc6bb7494cf37ec0@news.povray.org>
Ben Chambers <ben### [at] pacificwebguycom> wrote:
> PaulSh wrote:
> > POV-Ray is finally making its transition to SMP systems, which is great news
> > for those of us who can afford them. However, in terms of raw CPU power the
> > latest NVIDIA graphics cards would seem to completely blow away anything in
> > the way of SMP solutions this side of a research lab. Their CUDA GPU
> > architecture is claimed to allow up to 128 independent processing units
> > each running at 1.35GHz to be thrown at computationally-intensive problems.
> > So, my first thought was not SETI or protein folding, but POV-Ray. Given
> > that V3.7 is going to be fully threaded, what would be the possibility of a
> > CUDA version? I guess that will depend on the time and abilities of someone
> > with a lot more time and a lot more ability than myself...
> >
> >
>
> No good, they're only single precision.  Plus, each shader unit would
> need access to the entire scene file, which would be a pain in the a**
> to code.
>
> ...Chambers

According to the CUDA programming guide published by NVidia the GPGPU
architecture is as competent as any 32 bit processor.  More precise
computations can be simulated using multiple registers for a computation
instead of a single register if my memory serves - so I seriously doubt
this is a serious obstacle.

The device code is written in a simple extension of C, with host code either
written to match or working through the device drivers directly from any
language capable of such.

The major difficulty involved would be preparing the pov-ray source to run
efficiently on a SIMD architecture - native code may run out of the box
through the provided compiler, but the results would be poor at best
without optimization to take advantage of the particular memory hierarchies
involved.

Sharing parse trees used in the ray tracing is actually highly efficient on
this architecture - it would be stored in memory read-only from the device
accessible from any thread (without locking).  The host processor is
typically responsible for loading the parse tree, since the its preparation
is not likely to be efficient in parallel.

The biggest hurdle I am aware of is load-balancing threads to make efficient
use of the processing power available - simple subdividing the image into
unrelated render-blocks is obviously bounded by the worst-case running time
of the entire image, which may be unacceptably slow for any sufficiently
complex scene.  Perhaps someone with more in-depth knowledge of the
algorithms can determine what the limiting factors of per-ray threading
would be or other techniques.  (This may or may not have been addressed for
a multi-core implementation - since at worse for the naive implementation
you still have 1/n CPU utilization for a n-core system.)

Justin

see:
http://developer.nvidia.com/object/cuda.html#documentation


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.