POV-Ray: Newsgroups: povray.off-topic: Three guesses why: Re: Three guesses why

POV-Ray : Newsgroups : povray.off-topic : Three guesses why : Re: Three guesses why		Server Time 3 Jul 2025 13:20:11 EDT (-0400)
From: Invisible
Date: 17 Jun 2011 04:12:02
Message: <4dfb0c52@news.povray.org>
On 16/06/2011 10:40 PM, Warp wrote:

>    One thing I thought of:
>
>    A programming language/compiler cannot automatically parallelize all
> possible programs given to it in the most optimal way.

Sure. But parallelism doesn't necessarily have to be "automatic". Doing 
it manually is OK, so long as you make it really easy to specify what 
you actually want. That works.

> Some problems are more
> easily parallelized in an automatic way, while others are much harder.

Quite. GPUs have been merrily parallising a large class of important 
algorithms for decades now. It's no secret that only certain algorithms 
fit the GPU model, but when they do, boy do they go fast!

On the other hand, how do you make your word processor go faster? Hell, 
*can* you make your word processor go faster? From what I've seen, the 
thing that usually slows these down is I/O latency (possibly due to lack 
of RAM forcing you to use virtual memory). Adding more CPU power is 
going to do nothing about that.

> This is just a fact
> (and I wouldn't be surprised if such an optimization problem would be
> at the very least in the NP category

It took me a moment to think about this one. My initial reaction was 
"isn't the NP category only defined for *sequential* algorithms?" And 
then I realised that the process of transforming a program could very 
well be a sequential algorithm, and hence amenable to classification.

OOC, how many problems can you name for which there is no known optimal 
algorithm? How many problems provably cannot be solved optimally?

>    It may even be that in some situation you might even want just the
> program using one core if it can't use the others efficiently, so that
> you can run something else on the others, than the one program hogging
> everything for itself, making other programs run more slowly than they
> would have to.

Even for programs which do run faster in parallel, due to Amdahl's law, 
there's usually a point at which adding more cores does not improve 
performance. So you probably don't ever want to say "use as many cores 
as I have", because that tends to be pessimal. You probably want to 
choose the optimal number of cores for each particular program.

>    This automatic parallelization may be also give the false illusion of
> efficiency, when in fact it's far from it. After all, you will see the
> program using 100% of all the cores. That's pretty efficient, isn't it?
> Of course this is just an illusion. CPU usage percentage doesn't necessarily
> directly translate to efficiency.

Quite.

> (And in fact, I think there are examples of programs which actually run
> *slower* on multiple cores than on one core, even though they are using
> 100% of CPU time on each, as contradictory as that might sound.)

This is not a theoretical idea. This is something which commonly happens 
in the real world.

Spawn two threads. Have one thread compute the odd-numbered elements of 
an array, and the other thread compute the even-numbered elements. This 
way, the work is split in half, right?

Except now each CPU core continuously invalidates the other core's L1 
cache lines, causing constant memory traffic and hence very high 
latency. It's probably vastly slower than just doing all the work in one 
thread.

Of course, the solution here is pretty trivial; rather than interleave 
the elements, split the array in the middle and have one thread process 
each half. If the elements really are independent, you should now see a 
50% speedup.

In case this sounds like the kind of mistake only a novice would make, 
consider GHC's new parallel garbage collector. Initially they programmed 
it to use all available cores. But they quickly discovered that due to 
locality issues, performance was rubbish. It turns out to be *faster* to 
let cores sit idle rather than generate lots of cache traffic sometimes.

(The solution they eventually implemented was that each core has its own 
heap, and garbage-collects it in parallel. At least, for the first 
generation. The second generation has a shared heap, which is still 
split into "chunks" and collected in parallel.)
Post a reply to this message