POV-Ray: Newsgroups: povray.off-topic: Optimizations...

POV-Ray : Newsgroups : povray.off-topic : Optimizations...		Server Time 5 Sep 2024 19:28:28 EDT (-0400)

Goto Latest 10 Messages

Next 9 Messages >>>

From: Neeum Zawan
Subject: Optimizations...
Date: 22 Jul 2009 15:00:51
Message: <4a6761e3$1@news.povray.org>

http://stellar.mit.edu/S/course/6/fa08/6.197/courseMaterial/topics/topic2/lectureNotes/Intro_and_MxM/Intro_and_MxM.pdf


-- 
No, Taco Bell is NOT the Mexican Phone Company!

Post a reply to this message

From: clipka
Subject: Re: Optimizations...
Date: 22 Jul 2009 15:40:00
Message: <web.4a676a1b9486820469042aac0@news.povray.org>

Neeum Zawan <m.n### [at] ieeeorg> wrote:
>
http://stellar.mit.edu/S/course/6/fa08/6.197/courseMaterial/topics/topic2/lectureNotes/Intro_and_MxM/Intro_and_MxM.pd
f

11 operations per clock cycle on a single core - man, that's good...

Post a reply to this message

From: Invisible
Subject: Re: Optimizations...
Date: 23 Jul 2009 05:19:46
Message: <4a682b32$1@news.povray.org>

I find it quite scary how they can constantly gain 2x speedups. Every 
time they change something, it goes another 2x or 4x faster. You would 
think that there would be a physical limit to how fast it can possibly 
go, but no, it keeps getting faster without end. Hmm...

Just for giggles, I could implement this in Haskell and see how many 
million times slower it is. But then, since it isn't running on the same 
hardware, it would be kind of a moot comparison.

Post a reply to this message

From: scott
Subject: Re: Optimizations...
Date: 23 Jul 2009 05:30:51
Message: <4a682dcb@news.povray.org>

> I find it quite scary how they can constantly gain 2x speedups. Every time 
> they change something, it goes another 2x or 4x faster. You would think 
> that there would be a physical limit to how fast it can possibly go, but 
> no, it keeps getting faster without end. Hmm...

Yeh, it makes you realise that just because you think you have some "fast" C 
code, there is probably still some big speedups to be made.

Next assignment, do the same activity on the inverse matrix function :-)

> Just for giggles, I could implement this in Haskell and see how many 
> million times slower it is.

I wonder if it would finish in one day? :-)

Post a reply to this message

From: Invisible
Subject: Re: Optimizations...
Date: 23 Jul 2009 05:56:08
Message: <4a6833b8$1@news.povray.org>

>> I find it quite scary how they can constantly gain 2x speedups. Every 
>> time they change something, it goes another 2x or 4x faster. You would 
>> think that there would be a physical limit to how fast it can possibly 
>> go, but no, it keeps getting faster without end. Hmm...
> 
> Yeh, it makes you realise that just because you think you have some 
> "fast" C code, there is probably still some big speedups to be made.

I find it particularly disturbing that merely changing the order in 
which the bytes reside in memory produces a 50x speedup. This 
practically guarantees that no high-level programming language can ever 
be fast. It means that abstraction and maintainability will forever be 
the enemies of performance and efficiency.

Also... It's an 8-core server, but parallel processing gives only a 3.5x 
speedup? That's pretty interesting. ;-)

Question: How much faster does this go on the GPU? [I'm going to stick 
my neck out and say it'd take longer to transfer the data to the GPU 
than to actually do the calculation.] Do any current GPUs support 
double-precision yet?

> Next assignment, do the same activity on the inverse matrix function :-)

Hehehe, yeah... I'm guessing that's going to be a tad slower. ;-)

>> Just for giggles, I could implement this in Haskell and see how many 
>> million times slower it is.
> 
> I wonder if it would finish in one day? :-)

Well, the initial version took "only" 5 hours on a dual quad-Xeon 3.15 
GHz server. How slow can it possibly be on my ancient 32-bit single-core 
AMD Athlon 1700+ 1.5 GHz? :-P

Post a reply to this message

From: scott
Subject: Re: Optimizations...
Date: 23 Jul 2009 06:22:49
Message: <4a6839f9@news.povray.org>

> I find it particularly disturbing that merely changing the order in which 
> the bytes reside in memory produces a 50x speedup.

If you get the bytes in the "wrong" order then the CPU is going to be 
stalled the whole time waiting for fetches from normal RAM.  Get them in the 
"right" order and the CPU can whizz through at full speed, while the cache 
is updated almost in parallel from normal RAM.

> This practically guarantees that no high-level programming language can 
> ever be fast.

Surely the compiler just needs to be a bit clever about how it does stuff?

> Question: How much faster does this go on the GPU? [I'm going to stick my 
> neck out and say it'd take longer to transfer the data to the GPU than to 
> actually do the calculation.]

According to Wikipedia the PCIx bus is 1064 MB/s, but I don't know if that 
would be the limiting factor or not.

Anyway I found this:

http://graphics.stanford.edu/papers/gpumatrixmult/

It seems that an ATI X800XT can beat a P4 3GHz, but I have no idea how 
today's latest hardware would work out.

> Do any current GPUs support double-precision yet?

Not very many, only really on the ones focussed for non-graphics work.

>> Next assignment, do the same activity on the inverse matrix function :-)
>
> Hehehe, yeah... I'm guessing that's going to be a tad slower. ;-)

But also an important part of lots of simulation software.

> Well, the initial version took "only" 5 hours on a dual quad-Xeon 3.15 GHz 
> server. How slow can it possibly be on my ancient 32-bit single-core AMD 
> Athlon 1700+ 1.5 GHz? :-P

I think the initial version was only using a single core.  Also does Haskell 
do any of the optimisations at all mentioned, or is it basically going to be 
comparable performance to the initial version?

Post a reply to this message

From: Invisible
Subject: Re: Optimizations...
Date: 23 Jul 2009 06:42:32
Message: <4a683e98$1@news.povray.org>

>> I find it particularly disturbing that merely changing the order in 
>> which the bytes reside in memory produces a 50x speedup.
> 
> If you get the bytes in the "wrong" order then the CPU is going to be 
> stalled the whole time waiting for fetches from normal RAM.  Get them in 
> the "right" order and the CPU can whizz through at full speed, while the 
> cache is updated almost in parallel from normal RAM.

Oh, sure, I understand *why* this happens. But it is very, very 
disturbing to realise that just doing the same instructions in a 
different order, or storing the same data in a different order, can have 
such a vast effect on performance. It seems we have reached an age where 
the complexity class of an algorithm is almost irrelevant, and the only 
important property is the specific low-level details of exactly how the 
algorithm has been implemented on a specific hardware platform.

This is the antithesis of high-level programming.

>> This practically guarantees that no high-level programming language 
>> can ever be fast.
> 
> Surely the compiler just needs to be a bit clever about how it does stuff?

Sure. Because working out what order an arbitrary program wants to 
access its data in is not, in any way, equivilent to the halting 
problem. No sir!

Of course, matrix multiplication is a trivially simple task. (That's why 
they chose it.) But now consider building a huge, highly complex program 
that does a crapload of numerical work, and trying to figure out where 
the hell the performance is going, and how to fix it... Good luck with that.

>> Question: How much faster does this go on the GPU? [I'm going to stick 
>> my neck out and say it'd take longer to transfer the data to the GPU 
>> than to actually do the calculation.]
> 
> According to Wikipedia the PCIx bus is 1064 MB/s, but I don't know if 
> that would be the limiting factor or not.
> 
> Anyway I found this:
> 
> http://graphics.stanford.edu/papers/gpumatrixmult/
> 
> It seems that an ATI X800XT can beat a P4 3GHz, but I have no idea how 
> today's latest hardware would work out.

Hmm, interesting...

>> Do any current GPUs support double-precision yet?
> 
> Not very many, only really on the ones focussed for non-graphics work.

Right, OK.

>>> Next assignment, do the same activity on the inverse matrix function :-)
>>
>> Hehehe, yeah... I'm guessing that's going to be a tad slower. ;-)
> 
> But also an important part of lots of simulation software.

I'm not disagreeing. ;-)

>> Well, the initial version took "only" 5 hours on a dual quad-Xeon 3.15 
>> GHz server. How slow can it possibly be on my ancient 32-bit 
>> single-core AMD Athlon 1700+ 1.5 GHz? :-P
> 
> I think the initial version was only using a single core.

No, *all* the versions except the final one were using only a single core.

> Also does 
> Haskell do any of the optimisations at all mentioned, or is it basically 
> going to be comparable performance to the initial version?

Depends.

Would you like me to use lists, immutable arrays, mutable arrays, 
unboxed arrays or parallel arrays? Should I use 2D arrays or 1D arrays? 
Should I allow integers or doubles, or stick to only integers?

Much like in Java or C, there are multiple ways you could code this in 
Haskell, each with different benefits and drawbacks. (E.g., I can 
practically guarantee that using single-linked lists will be 
astronomically inefficient in time and space...)

Haskell offers some interesting possibilities not available in Java or 
C. For example, Haskell has lightweight threads, and primitives for 
"easily" turning sequential code into parallel code. It also has 
parallel arrays, which are pretty much designed specifically for this 
exact kind of problem. [Then again, I don't have a multicore box...]

It might be interesting to see how fast I can make this. It's kind of a 
pitty I can't run their code on my PC for comparison, but I think if I 
can make Haskell do the same task in under a second I'll be happy.

(Hmm. Their fastest single-core implementation takes 0.5 seconds, on a 
leading-edge 3 GHz processor. Maybe expecting > 1 second on an obsolete 
1 GHz CPU is asking too much?)

In particular, parallel arrays do freaky things with memory layout which 
are supposed to improve performance. Be interesting to see if it 
actually works at all, even on a single-core box...

Post a reply to this message

From: scott
Subject: Re: Optimizations...
Date: 23 Jul 2009 07:13:33
Message: <4a6845dd@news.povray.org>

> Of course, matrix multiplication is a trivially simple task. (That's why 
> they chose it.) But now consider building a huge, highly complex program 
> that does a crapload of numerical work, and trying to figure out where the 
> hell the performance is going, and how to fix it... Good luck with that.

Even in huge programs the bottlenecks are often only due to a tiny 
proportion of code running a vast number of times (for instance, I am pretty 
sure almost all of the CPU time for my CFD solver is spent inverting 
matricies).  Compilers are already quite good at optimising such things, but 
it appears they can still do a lot more.  For example they already can spot 
fairly basic patterns in read and write locations and rearrange the 
instructions to give better performance, adding more complexity just means 
more work for the compiler writers and handling more special cases.  Whether 
anyone can be bothered though is another matter...

> Would you like me to use lists, immutable arrays, mutable arrays, unboxed 
> arrays or parallel arrays? Should I use 2D arrays or 1D arrays? Should I 
> allow integers or doubles, or stick to only integers?

You could do a similar exercise to what the guy did.  Start off with a 
really "obvious" way of doing it, then try out different things and time 
each one.  Make a nice list like he did showing how much faster you can make 
it than the "initial" version.

Post a reply to this message

From: Invisible
Subject: Re: Optimizations...
Date: 23 Jul 2009 07:30:14
Message: <4a6849c6$1@news.povray.org>

>> Of course, matrix multiplication is a trivially simple task. (That's 
>> why they chose it.) But now consider building a huge, highly complex 
>> program that does a crapload of numerical work, and trying to figure 
>> out where the hell the performance is going, and how to fix it... Good 
>> luck with that.
> 
> Even in huge programs the bottlenecks are often only due to a tiny 
> proportion of code running a vast number of times (for instance, I am 
> pretty sure almost all of the CPU time for my CFD solver is spent 
> inverting matricies).  Compilers are already quite good at optimising 
> such things, but it appears they can still do a lot more.

I suspect few compilers are going to detect that you should transpose 
one matrix while performing matrix multiplication just by inspecting the 
souce code for the matric multiplication function.



Interestingly, in Haskell you could do something like

   matrix_multiply x y = core_multiply x (transpose y)

and then do something like

   {-# RULES "transpose^2" transpose . transpose = id #-}

What this does is everywhere the compiler sees a matrix being transposed 
twice, it automatically removes the transpositions. So if you multiply 
two matricies and one of them is *already* transposed, matrix_multiply 
won't bother transposing it again. It just gets passed to core_multiply 
unchanged.

You could store certain matricies permanently transposed, and then 
transpose them back just before passing them to matrix_multiple. The 
compiler will optimise out the transpose operation automatically. And 
all this without needing to modify the compiler to add a special 
optimisation pass; it's just some plain ordinary Haskell source code.

>> Would you like me to use lists, immutable arrays, mutable arrays, 
>> unboxed arrays or parallel arrays? Should I use 2D arrays or 1D 
>> arrays? Should I allow integers or doubles, or stick to only integers?
> 
> You could do a similar exercise to what the guy did.  Start off with a 
> really "obvious" way of doing it, then try out different things and time 
> each one.  Make a nice list like he did showing how much faster you can 
> make it than the "initial" version.

Indeed, this is my plan. I'm just trying to get the naive list-based 
version to work. (Currently it gives the wrong answers, which obviously 
isn't a lot of use...)

Post a reply to this message

From: scott
Subject: Re: Optimizations...
Date: 23 Jul 2009 08:02:08
Message: <4a685140@news.povray.org>

> Indeed, this is my plan. I'm just trying to get the naive list-based 
> version to work. (Currently it gives the wrong answers, which obviously 
> isn't a lot of use...)

I could have written a working C version by now :-D

Also I wonder how his results compare to something like MatLab that 
supposedly has very fast matrix handling functions built it...

Post a reply to this message

Goto Latest 10 Messages

Next 9 Messages >>>