|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
>> I was thinking more about how to store the scene on the GPU efficiently,
>> if you just have a triangle list it is relatively simple.
>
> Again, depends on whether you're writing a shader, or using a GPGPU
> system.
It still runs on the same hardware though, which is highly optimised for
graphics operations, you need to be aware of this no matter how you shape
your code.
> But they don't allow recursion.
Yes that was what I said earlier, for something like raytracing to unlimited
depths I think the CPU needs to step in every so often to process the
current set of rays (ie kill and spawn rays where appropriate).
Of course if you want to limit yourself to a fixed depth (ie a max trace
level of 8 or whatever) then you can "unroll" the recursion so it works on
the GPU. Might not be as efficient as letting the CPU do it though.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible wrote:
> I was under the impression that all cores in the bunch would have to
> take the same branch of the if - you can't have half go one way and half
> go the other. (That's what makes it a SIMD architecture.)
I'm sorry for jumping in, here, but surely you mean SMP, not SIMD?
SIMD is single-instruction multiple data.. e.g. performing an add on 4
WORD values simultaneously with one instruction.
--
~Mike
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
>> I was under the impression that all cores in the bunch would have to
>> take the same branch of the if - you can't have half go one way and
>> half go the other. (That's what makes it a SIMD architecture.)
>
> I'm sorry for jumping in, here, but surely you mean SMP, not SIMD?
>
> SIMD is single-instruction multiple data.. e.g. performing an add on 4
> WORD values simultaneously with one instruction.
Indeed. And on the GPU, you can say "for all these thirty pixels you're
processing, multiply each one by the corresponding texture pixel". For
example. One instruction, executed on 30 different pairs of data values.
SIMD.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
>> I was under the impression that all cores in the bunch would have to
>> take the same branch of the if - you can't have half go one way and
>> half go the other. (That's what makes it a SIMD architecture.)
>
> In fact, I think NVidia does allow alternate code paths - but the unit
> executes both code paths, and then discards the one you don't need.
That also seems plausible. (I'd think sorting the data into two seperate
queues would be more efficient. But obviously I haven't measured it...)
>> Again, depends on whether you're writing a shader, or using a GPGPU
>> system.
>
> Same thing nowadays, its just the shaders are written in a dialect of C
> that's more capable than the ones used for graphics.
More like, if you use GPGPU, you don't have to pretend that your program
is a "shader" that takes "pixels" and generates other "pixels". You can
structure it more freely.
>> But they don't allow recursion. That's the issue.
>
> To be precise, they don't allow indefinite recursion. You can actually
> have recursion where you specify the number of levels at compile time.
> After that, however, you can't change it without compiling a new shader.
In other words, the compiler can unroll the loops for you. ;-)
>> Do you know how hard it is to draw a cube made of 8 cubes and measure
>> all their sides? Do you know how long it takes to expand (x+y)^9
>> manually, by hand? Do you have any idea how long it takes to figure
>> out what the pattern is and where it's coming from?
>>
>> ...and then I discover some entry-level textbook that tells me how
>> some dead guy already figured all this out several centuries ago. AND
>> FOR ARBITRARY EXPONENTS!! >_<
>
> Isn't that what Pascal's triangle is for?
Yes, I realise that *now*. :-P
>> Why do I bother?
>
> Be honest, it's because you enjoy it :)
Partly. But also because I keep hoping somebody will be impressed.
(Stupid, I know...)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
scott wrote:
>>> I was thinking more about how to store the scene on the GPU
>>> efficiently, if you just have a triangle list it is relatively simple.
>>
>> Again, depends on whether you're writing a shader, or using a GPGPU
>> system.
>
> It still runs on the same hardware though, which is highly optimised for
> graphics operations, you need to be aware of this no matter how you
> shape your code.
Sure. I'm just saying GPGPU gives you a little more freedom (at the cost
of, currently, being manufacturer-specific).
>> But they don't allow recursion.
>
> Yes that was what I said earlier, for something like raytracing to
> unlimited depths I think the CPU needs to step in every so often to
> process the current set of rays (ie kill and spawn rays where appropriate).
>
> Of course if you want to limit yourself to a fixed depth (ie a max trace
> level of 8 or whatever) then you can "unroll" the recursion so it works
> on the GPU. Might not be as efficient as letting the CPU do it though.
I'm just thinking that if the CPU has to intervine that much, the PCI
bus is rapidly going to become a bottleneck to the whole system.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> Indeed. And on the GPU, you can say "for all these thirty pixels you're
> processing, multiply each one by the corresponding texture pixel". For
> example. One instruction, executed on 30 different pairs of data values.
> SIMD.
If you've ever done a array processing with eg MatLab you will be familiar
with how to restructure algorithms to work in this sort of environment.
Things like doing "OUTPUT = A * B + (1-A) * C" is a single instruction that
can operate on every value of the array, but essentially lets you choose
output B or C based on the value of A. This is often very useful and fast
for converting typical one-value-at-a-time algorithms.
Reminds me of a built-in MatLab function to convert an RGB image to HSV.
The function was actually looping through every pixel and calling the
convert function (which had several if's in it). I rewrote the conversion
function to work on whole arrays at a time and it was orders of magnitude
faster. That's what you need to do for GPU programming too.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
>> Indeed. And on the GPU, you can say "for all these thirty pixels
>> you're processing, multiply each one by the corresponding texture
>> pixel". For example. One instruction, executed on 30 different pairs
>> of data values. SIMD.
>
> If you've ever done a array processing with eg MatLab you will be
> familiar with how to restructure algorithms to work in this sort of
> environment.
Indeed, this is part of what I hated about Matlab; If it isn't an array,
you can't do anything with it. (That and the absurd syntax...)
> Things like doing "OUTPUT = A * B + (1-A) * C" is a single instruction
> that can operate on every value of the array, but essentially lets you
> choose output B or C based on the value of A. This is often very useful
> and fast for converting typical one-value-at-a-time algorithms.
Me being me, I would have expected a conditional statement to be faster
than a redundant computation.
I guess back in the days before FPUs, when the RAM was faster than the
CPU, that might even have been true. But today it seems it doesn't
matter how inefficient an algorithm is, just so long as it has good
cache behaviour and doesn't stall the pipeline. *sigh*
> Reminds me of a built-in MatLab function to convert an RGB image to HSV.
> The function was actually looping through every pixel and calling the
> convert function (which had several if's in it). I rewrote the
> conversion function to work on whole arrays at a time and it was orders
> of magnitude faster. That's what you need to do for GPU programming too.
Whenever you have a system like MatLab or SQL which is inherantly
designed to do parallel processing, letting the intensively-tuned
parallel engine do its stuff rather than explicitly looping yourself is
always, always goig to be faster. ;-)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
scott wrote:
> That's what you need to do for GPU programming too.
And APL, and J. :-) I did the same thing with APL graphics the professor had
written, and dropped the time from 15 minutes to 20 seconds.
--
Darren New, San Diego CA, USA (PST)
"We'd like you to back-port all the changes in 2.0
back to version 1.0."
"We've done that already. We call it 2.0."
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 18-8-2009 15:57, Invisible wrote:
>>> Indeed. And on the GPU, you can say "for all these thirty pixels
>>> you're processing, multiply each one by the corresponding texture
>>> pixel". For example. One instruction, executed on 30 different pairs
>>> of data values. SIMD.
>>
>> If you've ever done a array processing with eg MatLab you will be
>> familiar with how to restructure algorithms to work in this sort of
>> environment.
>
> Indeed, this is part of what I hated about Matlab; If it isn't an array,
> you can't do anything with it. (That and the absurd syntax...)
Sometimes I wished you did stop flaunting your ignorance, this might be
one of those.
>> Things like doing "OUTPUT = A * B + (1-A) * C" is a single instruction
>> that can operate on every value of the array, but essentially lets you
>> choose output B or C based on the value of A. This is often very
>> useful and fast for converting typical one-value-at-a-time algorithms.
>
> Me being me, I would have expected a conditional statement to be faster
> than a redundant computation.
>
> I guess back in the days before FPUs, when the RAM was faster than the
> CPU, that might even have been true. But today it seems it doesn't
> matter how inefficient an algorithm is, just so long as it has good
> cache behaviour and doesn't stall the pipeline. *sigh*
Actually it is a different way of thinking. Remember battlechess? The PC
did not have a language or paradigm that would make something like that
seem feasible. The Amiga had, with its combined CPU and blitter
architecture. Having seen the Amiga example it is easy to see how you
can simulate that in software. So battlechess on the PC could not have
been developed, not because it was technically impossible, but because
it would have been too slow if you had designed it using the
conventional un-parallelized paradigms.
I have probably mentioned it here before, but I once wrote an Ising
model simulation on an Amiga 256 by 256 (IIRC) at 7 frames per second
(in 1988 +-1), using almost exclusively the blitter. For that I had to
implement my own bitwise addition and subtraction, easy, just a number
of AND and XOR blit operations (if you don't know how to do it, I know
just the course for you ;) ). I also had to have a chance of flipping a
spin that is for each spin a fraction that depends on the temperature. I
even solved that with almost only the blitter. (left as an exercise to
the reader/mascot) I think it was several orders of magnitude faster
than what I could have accomplished if I did it the traditional way with
loops, IFs and floating point comparisons to random numbers.
>> Reminds me of a built-in MatLab function to convert an RGB image to
>> HSV. The function was actually looping through every pixel and calling
>> the convert function (which had several if's in it). I rewrote the
>> conversion function to work on whole arrays at a time and it was
>> orders of magnitude faster. That's what you need to do for GPU
>> programming too.
>
> Whenever you have a system like MatLab or SQL which is inherantly
> designed to do parallel processing, letting the intensively-tuned
> parallel engine do its stuff rather than explicitly looping yourself is
> always, always goig to be faster. ;-)
If only because the interpreter is rather slow because of the complexity
of the language.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> I guess back in the days before FPUs, when the RAM was faster than the
> CPU, that might even have been true. But today it seems it doesn't matter
> how inefficient an algorithm is, just so long as it has good cache
> behaviour and doesn't stall the pipeline. *sigh*
See to me if an algorithm has good cache behaviour and doesn't stall the
pipeline that would seem like an efficient algorithm *for modern PCs*. Of
course if you use "how fast would it run on my Amiga" as your indicator of
efficiency then no, it doesn't need to be an "efficient" algorithm to be
efficient on a modern PC.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|