POV-Ray: Newsgroups: povray.off-topic: Optimizations...: Re: Optimizations...

POV-Ray : Newsgroups : povray.off-topic : Optimizations... : Re: Optimizations...		Server Time 12 Jul 2025 20:13:25 EDT (-0400)
From: Invisible
Date: 23 Jul 2009 06:42:32
Message: <4a683e98$1@news.povray.org>
>> I find it particularly disturbing that merely changing the order in 
>> which the bytes reside in memory produces a 50x speedup.
> 
> If you get the bytes in the "wrong" order then the CPU is going to be 
> stalled the whole time waiting for fetches from normal RAM.  Get them in 
> the "right" order and the CPU can whizz through at full speed, while the 
> cache is updated almost in parallel from normal RAM.

Oh, sure, I understand *why* this happens. But it is very, very 
disturbing to realise that just doing the same instructions in a 
different order, or storing the same data in a different order, can have 
such a vast effect on performance. It seems we have reached an age where 
the complexity class of an algorithm is almost irrelevant, and the only 
important property is the specific low-level details of exactly how the 
algorithm has been implemented on a specific hardware platform.

This is the antithesis of high-level programming.

>> This practically guarantees that no high-level programming language 
>> can ever be fast.
> 
> Surely the compiler just needs to be a bit clever about how it does stuff?

Sure. Because working out what order an arbitrary program wants to 
access its data in is not, in any way, equivilent to the halting 
problem. No sir!

Of course, matrix multiplication is a trivially simple task. (That's why 
they chose it.) But now consider building a huge, highly complex program 
that does a crapload of numerical work, and trying to figure out where 
the hell the performance is going, and how to fix it... Good luck with that.

>> Question: How much faster does this go on the GPU? [I'm going to stick 
>> my neck out and say it'd take longer to transfer the data to the GPU 
>> than to actually do the calculation.]
> 
> According to Wikipedia the PCIx bus is 1064 MB/s, but I don't know if 
> that would be the limiting factor or not.
> 
> Anyway I found this:
> 
> http://graphics.stanford.edu/papers/gpumatrixmult/
> 
> It seems that an ATI X800XT can beat a P4 3GHz, but I have no idea how 
> today's latest hardware would work out.

Hmm, interesting...

>> Do any current GPUs support double-precision yet?
> 
> Not very many, only really on the ones focussed for non-graphics work.

Right, OK.

>>> Next assignment, do the same activity on the inverse matrix function :-)
>>
>> Hehehe, yeah... I'm guessing that's going to be a tad slower. ;-)
> 
> But also an important part of lots of simulation software.

I'm not disagreeing. ;-)

>> Well, the initial version took "only" 5 hours on a dual quad-Xeon 3.15 
>> GHz server. How slow can it possibly be on my ancient 32-bit 
>> single-core AMD Athlon 1700+ 1.5 GHz? :-P
> 
> I think the initial version was only using a single core.

No, *all* the versions except the final one were using only a single core.

> Also does 
> Haskell do any of the optimisations at all mentioned, or is it basically 
> going to be comparable performance to the initial version?

Depends.

Would you like me to use lists, immutable arrays, mutable arrays, 
unboxed arrays or parallel arrays? Should I use 2D arrays or 1D arrays? 
Should I allow integers or doubles, or stick to only integers?

Much like in Java or C, there are multiple ways you could code this in 
Haskell, each with different benefits and drawbacks. (E.g., I can 
practically guarantee that using single-linked lists will be 
astronomically inefficient in time and space...)

Haskell offers some interesting possibilities not available in Java or 
C. For example, Haskell has lightweight threads, and primitives for 
"easily" turning sequential code into parallel code. It also has 
parallel arrays, which are pretty much designed specifically for this 
exact kind of problem. [Then again, I don't have a multicore box...]

It might be interesting to see how fast I can make this. It's kind of a 
pitty I can't run their code on my PC for comparison, but I think if I 
can make Haskell do the same task in under a second I'll be happy.

(Hmm. Their fastest single-core implementation takes 0.5 seconds, on a 
leading-edge 3 GHz processor. Maybe expecting > 1 second on an obsolete 
1 GHz CPU is asking too much?)

In particular, parallel arrays do freaky things with memory layout which 
are supposed to improve performance. Be interesting to see if it 
actually works at all, even on a single-core box...
Post a reply to this message