POV-Ray: Newsgroups: povray.off-topic: Suggestion: OpenCL

POV-Ray : Newsgroups : povray.off-topic : Suggestion: OpenCL		Server Time 12 Jul 2025 23:09:11 EDT (-0400)

<<< Previous 10 Messages

Goto Initial 10 Messages

From: Invisible
Subject: Re: Suggestion: OpenCL
Date: 18 Aug 2009 09:20:34
Message: <4a8aaaa2$1@news.povray.org>

>> I was under the impression that all cores in the bunch would have to 
>> take the same branch of the if - you can't have half go one way and 
>> half go the other. (That's what makes it a SIMD architecture.)
> 
> I'm sorry for jumping in, here, but surely you mean SMP, not SIMD?
> 
> SIMD is single-instruction multiple data.. e.g. performing an add on 4 
> WORD values simultaneously with one instruction.

Indeed. And on the GPU, you can say "for all these thirty pixels you're 
processing, multiply each one by the corresponding texture pixel". For 
example. One instruction, executed on 30 different pairs of data values. 
SIMD.

Post a reply to this message

From: Invisible
Subject: Re: Suggestion: OpenCL
Date: 18 Aug 2009 09:24:20
Message: <4a8aab84$1@news.povray.org>

>> I was under the impression that all cores in the bunch would have to 
>> take the same branch of the if - you can't have half go one way and 
>> half go the other. (That's what makes it a SIMD architecture.)
> 
> In fact, I think NVidia does allow alternate code paths - but the unit 
> executes both code paths, and then discards the one you don't need.

That also seems plausible. (I'd think sorting the data into two seperate 
queues would be more efficient. But obviously I haven't measured it...)

>> Again, depends on whether you're writing a shader, or using a GPGPU 
>> system.
> 
> Same thing nowadays, its just the shaders are written in a dialect of C 
> that's more capable than the ones used for graphics.

More like, if you use GPGPU, you don't have to pretend that your program 
is a "shader" that takes "pixels" and generates other "pixels". You can 
structure it more freely.

>> But they don't allow recursion. That's the issue.
> 
> To be precise, they don't allow indefinite recursion.  You can actually 
> have recursion where you specify the number of levels at compile time. 
> After that, however, you can't change it without compiling a new shader.

In other words, the compiler can unroll the loops for you. ;-)

>> Do you know how hard it is to draw a cube made of 8 cubes and measure 
>> all their sides? Do you know how long it takes to expand (x+y)^9 
>> manually, by hand? Do you have any idea how long it takes to figure 
>> out what the pattern is and where it's coming from?
>>
>> ...and then I discover some entry-level textbook that tells me how 
>> some dead guy already figured all this out several centuries ago. AND 
>> FOR ARBITRARY EXPONENTS!! >_<
> 
> Isn't that what Pascal's triangle is for?

Yes, I realise that *now*. :-P

>> Why do I bother?
> 
> Be honest, it's because you enjoy it :)

Partly. But also because I keep hoping somebody will be impressed. 
(Stupid, I know...)

Post a reply to this message

From: Invisible
Subject: Re: Suggestion: OpenCL
Date: 18 Aug 2009 09:25:35
Message: <4a8aabcf@news.povray.org>

scott wrote:
>>> I was thinking more about how to store the scene on the GPU 
>>> efficiently, if you just have a triangle list it is relatively simple.
>>
>> Again, depends on whether you're writing a shader, or using a GPGPU 
>> system.
> 
> It still runs on the same hardware though, which is highly optimised for 
> graphics operations, you need to be aware of this no matter how you 
> shape your code.

Sure. I'm just saying GPGPU gives you a little more freedom (at the cost 
of, currently, being manufacturer-specific).

>> But they don't allow recursion.
> 
> Yes that was what I said earlier, for something like raytracing to 
> unlimited depths I think the CPU needs to step in every so often to 
> process the current set of rays (ie kill and spawn rays where appropriate).
> 
> Of course if you want to limit yourself to a fixed depth (ie a max trace 
> level of 8 or whatever) then you can "unroll" the recursion so it works 
> on the GPU.  Might not be as efficient as letting the CPU do it though.

I'm just thinking that if the CPU has to intervine that much, the PCI 
bus is rapidly going to become a bottleneck to the whole system.

Post a reply to this message

From: scott
Subject: Re: Suggestion: OpenCL
Date: 18 Aug 2009 09:51:32
Message: <4a8ab1e4$1@news.povray.org>

> Indeed. And on the GPU, you can say "for all these thirty pixels you're 
> processing, multiply each one by the corresponding texture pixel". For 
> example. One instruction, executed on 30 different pairs of data values. 
> SIMD.

If you've ever done a array processing with eg MatLab you will be familiar 
with how to restructure algorithms to work in this sort of environment.

Things like doing "OUTPUT = A * B + (1-A) * C" is a single instruction that 
can operate on every value of the array, but essentially lets you choose 
output B or C based on the value of A.  This is often very useful and fast 
for converting typical one-value-at-a-time algorithms.

Reminds me of a built-in MatLab function to convert an RGB image to HSV. 
The function was actually looping through every pixel and calling the 
convert function (which had several if's in it).  I rewrote the conversion 
function to work on whole arrays at a time and it was orders of magnitude 
faster.  That's what you need to do for GPU programming too.

Post a reply to this message

From: Invisible
Subject: Re: Suggestion: OpenCL
Date: 18 Aug 2009 09:57:56
Message: <4a8ab364$1@news.povray.org>

>> Indeed. And on the GPU, you can say "for all these thirty pixels 
>> you're processing, multiply each one by the corresponding texture 
>> pixel". For example. One instruction, executed on 30 different pairs 
>> of data values. SIMD.
> 
> If you've ever done a array processing with eg MatLab you will be 
> familiar with how to restructure algorithms to work in this sort of 
> environment.

Indeed, this is part of what I hated about Matlab; If it isn't an array, 
you can't do anything with it. (That and the absurd syntax...)

> Things like doing "OUTPUT = A * B + (1-A) * C" is a single instruction 
> that can operate on every value of the array, but essentially lets you 
> choose output B or C based on the value of A.  This is often very useful 
> and fast for converting typical one-value-at-a-time algorithms.

Me being me, I would have expected a conditional statement to be faster 
than a redundant computation.

I guess back in the days before FPUs, when the RAM was faster than the 
CPU, that might even have been true. But today it seems it doesn't 
matter how inefficient an algorithm is, just so long as it has good 
cache behaviour and doesn't stall the pipeline. *sigh*

> Reminds me of a built-in MatLab function to convert an RGB image to HSV. 
> The function was actually looping through every pixel and calling the 
> convert function (which had several if's in it).  I rewrote the 
> conversion function to work on whole arrays at a time and it was orders 
> of magnitude faster.  That's what you need to do for GPU programming too.

Whenever you have a system like MatLab or SQL which is inherantly 
designed to do parallel processing, letting the intensively-tuned 
parallel engine do its stuff rather than explicitly looping yourself is 
always, always goig to be faster. ;-)

Post a reply to this message

From: Darren New
Subject: Re: Suggestion: OpenCL
Date: 18 Aug 2009 11:26:21
Message: <4a8ac81d$1@news.povray.org>

scott wrote:
> That's what you need to do for GPU programming too.

And APL, and J. :-) I did the same thing with APL graphics the professor had 
written, and dropped the time from 15 minutes to 20 seconds.

-- 
   Darren New, San Diego CA, USA (PST)
   "We'd like you to back-port all the changes in 2.0
    back to version 1.0."
   "We've done that already. We call it 2.0."

Post a reply to this message

From: andrel
Subject: Re: Suggestion: OpenCL
Date: 18 Aug 2009 14:51:34
Message: <4A8AF83A.6050503@hotmail.com>

On 18-8-2009 15:57, Invisible wrote:
>>> Indeed. And on the GPU, you can say "for all these thirty pixels 
>>> you're processing, multiply each one by the corresponding texture 
>>> pixel". For example. One instruction, executed on 30 different pairs 
>>> of data values. SIMD.
>>
>> If you've ever done a array processing with eg MatLab you will be 
>> familiar with how to restructure algorithms to work in this sort of 
>> environment.
> 
> Indeed, this is part of what I hated about Matlab; If it isn't an array, 
> you can't do anything with it. (That and the absurd syntax...)

Sometimes I wished you did stop flaunting your ignorance, this might be 
one of those.

>> Things like doing "OUTPUT = A * B + (1-A) * C" is a single instruction 
>> that can operate on every value of the array, but essentially lets you 
>> choose output B or C based on the value of A.  This is often very 
>> useful and fast for converting typical one-value-at-a-time algorithms.
> 
> Me being me, I would have expected a conditional statement to be faster 
> than a redundant computation.
> 
> I guess back in the days before FPUs, when the RAM was faster than the 
> CPU, that might even have been true. But today it seems it doesn't 
> matter how inefficient an algorithm is, just so long as it has good 
> cache behaviour and doesn't stall the pipeline. *sigh*

Actually it is a different way of thinking. Remember battlechess? The PC 
did not have a language or paradigm that would make something like that 
seem feasible. The Amiga had, with its combined CPU and blitter 
architecture. Having seen the Amiga example it is easy to see how you 
can simulate that in software. So battlechess on the PC could not have 
been developed, not because it was technically impossible, but because 
it would have been too slow if you had designed it using the 
conventional un-parallelized paradigms.

I have probably mentioned it here before, but I once wrote an Ising 
model simulation on an Amiga 256 by 256 (IIRC) at 7 frames per second 
(in 1988 +-1), using almost exclusively the blitter. For that I had to 
implement my own bitwise addition and subtraction, easy, just a number 
of AND and XOR blit operations (if you don't know how to do it, I know 
just the course for you ;) ). I also had to have a chance of flipping a 
spin that is for each spin a fraction that depends on the temperature. I 
even solved that with almost only the blitter. (left as an exercise to 
the reader/mascot) I think it was several orders of magnitude faster 
than what I could have accomplished if I did it the traditional way with 
loops, IFs and floating point comparisons to random numbers.

>> Reminds me of a built-in MatLab function to convert an RGB image to 
>> HSV. The function was actually looping through every pixel and calling 
>> the convert function (which had several if's in it).  I rewrote the 
>> conversion function to work on whole arrays at a time and it was 
>> orders of magnitude faster.  That's what you need to do for GPU 
>> programming too.
> 
> Whenever you have a system like MatLab or SQL which is inherantly 
> designed to do parallel processing, letting the intensively-tuned 
> parallel engine do its stuff rather than explicitly looping yourself is 
> always, always goig to be faster. ;-)
If only because the interpreter is rather slow because of the complexity 
of the language.

Post a reply to this message

From: scott
Subject: Re: Suggestion: OpenCL
Date: 19 Aug 2009 02:58:41
Message: <4a8ba2a1@news.povray.org>

> I guess back in the days before FPUs, when the RAM was faster than the 
> CPU, that might even have been true. But today it seems it doesn't matter 
> how inefficient an algorithm is, just so long as it has good cache 
> behaviour and doesn't stall the pipeline. *sigh*

See to me if an algorithm has good cache behaviour and doesn't stall the 
pipeline that would seem like an efficient algorithm *for modern PCs*.  Of 
course if you use "how fast would it run on my Amiga" as your indicator of 
efficiency then no, it doesn't need to be an "efficient" algorithm to be 
efficient on a modern PC.

Post a reply to this message

From: scott
Subject: Re: Suggestion: OpenCL
Date: 20 Aug 2009 06:02:59
Message: <4a8d1f53@news.povray.org>

> I'm just thinking that if the CPU has to intervine that much, the PCI bus 
> is rapidly going to become a bottleneck to the whole system.

It seems that current GPU raytracers are 5-6x slower than CPU ones:

http://graphics.stanford.edu/papers/gpu_kdtree/gpu_kdtree.ppt

So there doesn't seem to be much benefit, even if you could easily utilise 
the GPU from POV.

Post a reply to this message

From: Invisible
Subject: Re: Suggestion: OpenCL
Date: 20 Aug 2009 06:35:40
Message: <4a8d26fc$1@news.povray.org>

scott wrote:

> It seems that current GPU raytracers are 5-6x slower than CPU ones:
> 
> http://graphics.stanford.edu/papers/gpu_kdtree/gpu_kdtree.ppt
> 
> So there doesn't seem to be much benefit, even if you could easily 
> utilise the GPU from POV.

I vaguely recall this came up on one of the Haskell lists recently. 
Discussion along the lines of

- Team X wrote a raytracer and it was 5x slower on the GPU than on the CPU.

- Team Y write another raytracer with was 2x faster on the GPU than on 
the CPU. Team X just did the job badly.

I don't have any references tho...

Post a reply to this message

<<< Previous 10 Messages

Goto Initial 10 Messages