POV-Ray: Newsgroups: povray.pov4.discussion.general: GPGPU again: Re: GPGPU again

POV-Ray : Newsgroups : povray.pov4.discussion.general : GPGPU again : Re: GPGPU again		Server Time 20 Apr 2024 06:19:21 EDT (-0400)

From: Warp
Date: 23 May 2016 11:57:33
Message: <5743286c@news.povray.org>

jhu <nomail@nomail> wrote:
> So, CUDA and accompanying hardware has made great strides the past few years,
> including double-precision float and recursion support. How about having support
> for CUDA or OpenCL?

The problem is that CUDA (and I assume OpenCL) is not just a thousand
generic CPUs that you can run any code you want independently of each
other.

CUDA uses the so-called SIMT design (a bit like SIMD, but a bit different).
This means, roughly, that there's one single stream of executable code
that all the CUDA cores are executing in parallel. Not only does this
mean that all the cores have to run the exact same code (ie. you can't
just run one task in one core and a different in another), this imposes
certain limitations and inefficiencies.

One of the biggest inefficiencies is that conditionals may cause severe
speed penalties. That's because every CUDA core needs to be "in sync"
with each other, when executing that stream of executable code. If some
cores perform the body of a conditional while others don't, those others
need to wait for the ones that do, until they "meet" at a common point.
The longer the conditional body is, the worse the penalty. (Essentially
it's like every core were executing the longest conditional branch, even
if just one core does.)

(Although with regard to that "all cores must run the same code", it might
not be that simple. If I'm not mistaken, graphics cards have, in fact,
several "pipelines" which are able to run independent code in parallel,
using their own portion of the CUDA cores. Something like 8 such
"pipelines", each with 40 CUDA cores, meaning that you can run 8 different
tasks, each task being able to use 40 cores. Or something along those lines.
The main purpose of this is, AFAIK, to be able to render polygons with
different shaders in parallel, but CUDA refurbishes this to run any code
you want.)

There are of course other limitations, such as transferring data between
the main RAM and the GPU's RAM, and such.

Programs that use CUDA need to be specifically designed with these
limitations in mind.

-- 
                                                          - Warp

Post a reply to this message