POV-Ray: Newsgroups: povray.off-topic: Three guesses why

POV-Ray : Newsgroups : povray.off-topic : Three guesses why		Server Time 15 Mar 2026 08:13:57 EDT (-0400)

<<< Previous 2 Messages

Goto Latest 10 Messages

Next 10 Messages >>>

From: Invisible
Subject: Re: Three guesses why
Date: 17 Jun 2011 04:12:02
Message: <4dfb0c52@news.povray.org>

On 16/06/2011 10:40 PM, Warp wrote:

>    One thing I thought of:
>
>    A programming language/compiler cannot automatically parallelize all
> possible programs given to it in the most optimal way.

Sure. But parallelism doesn't necessarily have to be "automatic". Doing 
it manually is OK, so long as you make it really easy to specify what 
you actually want. That works.

> Some problems are more
> easily parallelized in an automatic way, while others are much harder.

Quite. GPUs have been merrily parallising a large class of important 
algorithms for decades now. It's no secret that only certain algorithms 
fit the GPU model, but when they do, boy do they go fast!

On the other hand, how do you make your word processor go faster? Hell, 
*can* you make your word processor go faster? From what I've seen, the 
thing that usually slows these down is I/O latency (possibly due to lack 
of RAM forcing you to use virtual memory). Adding more CPU power is 
going to do nothing about that.

> This is just a fact
> (and I wouldn't be surprised if such an optimization problem would be
> at the very least in the NP category

It took me a moment to think about this one. My initial reaction was 
"isn't the NP category only defined for *sequential* algorithms?" And 
then I realised that the process of transforming a program could very 
well be a sequential algorithm, and hence amenable to classification.

OOC, how many problems can you name for which there is no known optimal 
algorithm? How many problems provably cannot be solved optimally?

>    It may even be that in some situation you might even want just the
> program using one core if it can't use the others efficiently, so that
> you can run something else on the others, than the one program hogging
> everything for itself, making other programs run more slowly than they
> would have to.

Even for programs which do run faster in parallel, due to Amdahl's law, 
there's usually a point at which adding more cores does not improve 
performance. So you probably don't ever want to say "use as many cores 
as I have", because that tends to be pessimal. You probably want to 
choose the optimal number of cores for each particular program.

>    This automatic parallelization may be also give the false illusion of
> efficiency, when in fact it's far from it. After all, you will see the
> program using 100% of all the cores. That's pretty efficient, isn't it?
> Of course this is just an illusion. CPU usage percentage doesn't necessarily
> directly translate to efficiency.

Quite.

> (And in fact, I think there are examples of programs which actually run
> *slower* on multiple cores than on one core, even though they are using
> 100% of CPU time on each, as contradictory as that might sound.)

This is not a theoretical idea. This is something which commonly happens 
in the real world.

Spawn two threads. Have one thread compute the odd-numbered elements of 
an array, and the other thread compute the even-numbered elements. This 
way, the work is split in half, right?

Except now each CPU core continuously invalidates the other core's L1 
cache lines, causing constant memory traffic and hence very high 
latency. It's probably vastly slower than just doing all the work in one 
thread.

Of course, the solution here is pretty trivial; rather than interleave 
the elements, split the array in the middle and have one thread process 
each half. If the elements really are independent, you should now see a 
50% speedup.

In case this sounds like the kind of mistake only a novice would make, 
consider GHC's new parallel garbage collector. Initially they programmed 
it to use all available cores. But they quickly discovered that due to 
locality issues, performance was rubbish. It turns out to be *faster* to 
let cores sit idle rather than generate lots of cache traffic sometimes.

(The solution they eventually implemented was that each core has its own 
heap, and garbage-collects it in parallel. At least, for the first 
generation. The second generation has a shared heap, which is still 
split into "chunks" and collected in parallel.)

Post a reply to this message

From: Darren New
Subject: Re: Three guesses why
Date: 17 Jun 2011 16:50:07
Message: <4dfbbdff$1@news.povray.org>

On 6/17/2011 1:12, Invisible wrote:
> On the other hand, how do you make your word processor go faster?

Is your word processor really not fast enough?

> *can* you make your word processor go faster?

Spell check in one thread, layout of pages you haven't looked at in another 
thread, etc.  I think they're already making it faster, if you actually 
watch Word or Reader open a big document, for example.

> From what I've seen, the thing
> that usually slows these down is I/O latency

I find that's almost all of the slow-down, actually. Have you seen the video 
of the SSD RAID array? Where they open all four office programs with 
documents in about half a second?  Where they open every program installed 
on the computer in about 10 seconds?

>> This is just a fact
>> (and I wouldn't be surprised if such an optimization problem would be
>> at the very least in the NP category
>
> It took me a moment to think about this one. My initial reaction was "isn't
> the NP category only defined for *sequential* algorithms?"

Sort of. A NDTM isn't exactly a sequential machine.

If it's NP, it's going to be NP on any fixed number of processors. Only if 
you let the number of processors scale with the size of the problem does it 
come into question. (Granted, that's how a lot of SIMD theory is defined.)

> OOC, how many problems can you name for which there is no known optimal
> algorithm? How many problems provably cannot be solved optimally?

Certainly all the problems that can't be solved can't be solved optimally. 
Otherwise, you can always brute force all possible answers and pick out the 
optimal one, assuming you have a way of generating answers and evaluating 
their relative merits.  I think your question is too ill-specified to be 
answered clearly, tho.

> Even for programs which do run faster in parallel, due to Amdahl's law,
> there's usually a point at which adding more cores does not improve
> performance.

Certainly having more cores than the number of instructions the program 
executes is going to be pointless, so mathematically speaking, the law 
*always* applies, even if it's silly in practice to take it to that 
exaggeration.

> Except now each CPU core continuously invalidates the other core's L1 cache
> lines, causing constant memory traffic and hence very high latency. It's
> probably vastly slower than just doing all the work in one thread.

And for people who don't have a technical background:

http://blogs.msdn.com/b/ericlippert/archive/2011/06/16/atomicity-volatility-and-immutability-are-different-part-three.aspx

> (The solution they eventually implemented was that each core has its own
> heap, and garbage-collects it in parallel. At least, for the first
> generation. The second generation has a shared heap, which is still split
> into "chunks" and collected in parallel.)

GC is definitely one of the harder problems out there. :-) One of those "in 
theory it's easy, in practice getting good performance is really hard."

-- 
Darren New, San Diego CA, USA (PST)
   "Coding without comments is like
    driving without turn signals."

Post a reply to this message

From: Warp
Subject: Re: Three guesses why
Date: 18 Jun 2011 07:49:53
Message: <4dfc90e1@news.povray.org>

Darren New <dne### [at] sanrrcom> wrote:
> Certainly having more cores than the number of instructions the program 
> executes is going to be pointless, so mathematically speaking, the law 
> *always* applies, even if it's silly in practice to take it to that 
> exaggeration.

  I don't think it works like that. Of course each CPU runs the *same* code,
but each on different input data (or partially different, as some of the
input data may well be shared among the threads, only some parameters being
different; POV-Ray 3.7 would be a perfect example of this).

  Perhaps you meant that having more cores than the amount of generated data
would be useless (because each processor has to output at least one bit of
data to be useful). Since generated data typically has quite many bits, it's
not like you would hit this limit any time soon.

-- 
                                                          - Warp

Post a reply to this message

From: Warp
Subject: Re: Three guesses why
Date: 18 Jun 2011 07:55:20
Message: <4dfc9227@news.povray.org>

Invisible <voi### [at] devnull> wrote:
> Quite. GPUs have been merrily parallising a large class of important 
> algorithms for decades now.

  I wonder what gives you that impression.

  The first GPU to support programmable shaders was the GeForce 3 Series,
released in 2001. Back then support for implementing more-or-less generic
algorithms with these shaders was very limited.

  As programmable shader languages were improved, the possibilities
increased, and there were quite many "hacks" to make programmable shaders
do something else than to paint pixels, but it probably was not until CUDA
was published that there was "official" support for generic algorithm
programming on GPUs. The first GPU supporting CUDA was the GeForce 8800
Series, released in 2006.

-- 
                                                          - Warp

Post a reply to this message

From: Darren New
Subject: Re: Three guesses why
Date: 18 Jun 2011 15:56:54
Message: <4dfd0306$1@news.povray.org>

On 6/18/2011 4:49, Warp wrote:
> Darren New<dne### [at] sanrrcom>  wrote:
>> Certainly having more cores than the number of instructions the program
>> executes is going to be pointless, so mathematically speaking, the law
>> *always* applies, even if it's silly in practice to take it to that
>> exaggeration.

>    Perhaps you meant that having more cores than the amount of generated data

I was speaking entirely theoretically. Amahl's law says you only get so much 
speed-up, no matter how many CPUs you throw at it. If the sequential program 
executes 100 billion instructions over the course of solving the problem, 
giving it more than 100 billion processors cannot possibly speed it up any 
farther.

Sort of like an O(n) question ignoring the constant factor sort of discussion.

> would be useless (because each processor has to output at least one bit of
> data to be useful). Since generated data typically has quite many bits, it's
> not like you would hit this limit any time soon.

Exactly. That's why I said "mathematically speaking."

When I studied SIMD algorithms in school, the assumption was that you always 
had N processors, where N was the same number you were giving to O(N) sorts 
of calculations.

-- 
Darren New, San Diego CA, USA (PST)
   "Coding without comments is like
    driving without turn signals."

Post a reply to this message

From: Invisible
Subject: Re: Three guesses why
Date: 21 Jun 2011 03:55:44
Message: <4e004e80$1@news.povray.org>

On 18/06/2011 12:55 PM, Warp wrote:
> Invisible<voi### [at] devnull>  wrote:
>> Quite. GPUs have been merrily parallising a large class of important
>> algorithms for decades now.
>
>    I wonder what gives you that impression.
>
>    The first GPU to support programmable shaders was the GeForce 3 Series,
> released in 2001.

OK. So the early ones were entirely hard-wired. They're still using 
parallel processing to solve a real-world problem. My point was that 
some kinds of algorithms are very easy to solve with parallel processing.

Post a reply to this message

From: Invisible
Subject: Re: Three guesses why
Date: 21 Jun 2011 04:31:13
Message: <4e0056d1$1@news.povray.org>

>> On the other hand, how do you make your word processor go faster?
>
> Is your word processor really not fast enough?

Sometimes. (Depending on which one I'm using.)

>> *can* you make your word processor go faster?
>
> Spell check in one thread, layout of pages you haven't looked at in
> another thread, etc. I think they're already making it faster, if you
> actually watch Word or Reader open a big document, for example.

I wonder if anybody has fixed that thing where closing the CD drive 
causes the entire Windows OS to lock up for 30 seconds yet?

>> From what I've seen, the thing
>> that usually slows these down is I/O latency
>
> I find that's almost all of the slow-down, actually.

I'm just bitter because our document management system requires you to 
open a Word document that lives on a Windows share at HQ. Of course, for 
all the people at HQ, that's just a subnet away. But for everyone 
else... SMB over a WAN is not fast. 5 Mbit/sec bandwidth and 105 ms 
latency is not fun with such a chatty protocol.

> Have you seen the
> video of the SSD RAID array? Where they open all four office programs
> with documents in about half a second? Where they open every program
> installed on the computer in about 10 seconds?

I do remember configuring my Amiga 1200 to boot from a RAM disk. Jesus, 
that was fast...

>> OOC, how many problems can you name for which there is no known optimal
>> algorithm? How many problems provably cannot be solved optimally?
>
> Certainly all the problems that can't be solved can't be solved
> optimally. Otherwise, you can always brute force all possible answers
> and pick out the optimal one, assuming you have a way of generating
> answers and evaluating their relative merits. I think your question is
> too ill-specified to be answered clearly, tho.

OK, so a problem would be provably impossible to solve if the search 
space is infinite or if there's no way to enumerate all possible solutions?

>> Even for programs which do run faster in parallel, due to Amdahl's law,
>> there's usually a point at which adding more cores does not improve
>> performance.
>
> Certainly having more cores than the number of instructions the program
> executes is going to be pointless, so mathematically speaking, the law
> *always* applies, even if it's silly in practice to take it to that
> exaggeration.

Sure. My point was that for some programs, adding more cores yields more 
speed, at least until you reach ridiculous limits. For other programs, 
adding just one or two cores means you already reach that point.

> GC is definitely one of the harder problems out there. :-) One of those
> "in theory it's easy, in practice getting good performance is really hard."

I think the fundamental thing is that GC wants to traverse pointers 
backwards, but the heap structure only allows you to traverse them forwards.

Other than that, we keep getting processors with more and more cores, 
but exactly the same amount of memory bandwidth. This seems very 
unsustainable.

Post a reply to this message

From: Warp
Subject: Re: Three guesses why
Date: 21 Jun 2011 13:41:44
Message: <4e00d7d8@news.povray.org>

Invisible <voi### [at] devnull> wrote:
> OK. So the early ones were entirely hard-wired. They're still using 
> parallel processing to solve a real-world problem. My point was that 
> some kinds of algorithms are very easy to solve with parallel processing.

  Scanline rendering is easy to parallelize because all the pixels can be
rendered independently of each other (not unlike raytracing).

-- 
                                                          - Warp

Post a reply to this message

From: Darren New
Subject: Re: Three guesses why
Date: 21 Jun 2011 13:55:45
Message: <4e00db21$1@news.povray.org>

On 6/21/2011 1:31, Invisible wrote:
> I wonder if anybody has fixed that thing where closing the CD drive causes
> the entire Windows OS to lock up for 30 seconds yet?

Yes. They call it "NT".  Your flaw is thinking the entire OS is locked up, 
when it's really just explorer waiting to read the disk.

> WAN is not fast. 5 Mbit/sec bandwidth and 105 ms latency is not fun with
> such a chatty protocol.

Yeah. Try starting up SunOS and X Windows on a 4-meg sparcstation with no 
internal disk paging over the 10Mbps ethernet.  Ninety seconds to switch 
window focus?

> I do remember configuring my Amiga 1200 to boot from a RAM disk. Jesus, that
> was fast...

Amazing how when you get rid of the moving parts and the simulation of 
moving parts it goes really fast.

>>> OOC, how many problems can you name for which there is no known optimal
>>> algorithm? How many problems provably cannot be solved optimally?
>>
>> Certainly all the problems that can't be solved can't be solved
>> optimally. Otherwise, you can always brute force all possible answers
>> and pick out the optimal one, assuming you have a way of generating
>> answers and evaluating their relative merits. I think your question is
>> too ill-specified to be answered clearly, tho.
>
> OK, so a problem would be provably impossible to solve if the search space
> is infinite or if there's no way to enumerate all possible solutions?

It would be provably impossible to solve via brute force, sure. That's 
exactly what keeps the halting problem from being solved, in at least some 
sense.

Beyond that, you'll have to say what you mean by "solving optimally". We 
know the optimal Big-O for a number of problems. Is that what you're talking 
about?

> Sure. My point was that for some programs, adding more cores yields more
> speed, at least until you reach ridiculous limits. For other programs,
> adding just one or two cores means you already reach that point.

Right. My comment was more of an aside.

I also realized yesterday, when driving past the building I tried to model 
for my game that slowed the frame rate to a crawl when I actually drew it, 
that the reason the universe doesn't have trouble keeping up with the 
frame-rate is that it really does have N processors for a problem of size N.

How do you calculate in real time exactly what it would look like to have 
light reflecting off that building built out of all those atoms?  Give each 
atom the ability to process its contribution to the reflected light in 
parallel.  How do you do the clipping and shadow casting in real time? Have 
each atom doing the clipping and shadow casting in parallel for all the 
pixels it affects. Etc.

(Yes, maybe obvious, but it kind of twinged with this conversation.)

> I think the fundamental thing is that GC wants to traverse pointers
> backwards, but the heap structure only allows you to traverse them forwards.

I don't think GC wants to traverse backwards, except perhaps in purely 
functional languages/heaps. GC works by starting at the roots and seeing 
what's still reachable, so that's all forward pointers.

> Other than that, we keep getting processors with more and more cores, but
> exactly the same amount of memory bandwidth. This seems very unsustainable.

Yep.  That's why NUMA is getting more popular and such.

-- 
Darren New, San Diego CA, USA (PST)
   "Coding without comments is like
    driving without turn signals."

Post a reply to this message

From: Invisible
Subject: Re: Three guesses why
Date: 22 Jun 2011 04:07:13
Message: <4e01a2b1$1@news.povray.org>

On 21/06/2011 06:41 PM, Warp wrote:
> Invisible<voi### [at] devnull>  wrote:
>> OK. So the early ones were entirely hard-wired. They're still using
>> parallel processing to solve a real-world problem. My point was that
>> some kinds of algorithms are very easy to solve with parallel processing.
>
>    Scanline rendering is easy to parallelize because all the pixels can be
> rendered independently of each other (not unlike raytracing).

Indeed, a great many graphical tasks parallise very well. For example, 
image scaling (whether bilinear or trilinear or anisotropic or 
whatever). Or DCT / inverse DCT operations [on independent pixel blocks].

Other tasks don't parallise so well.

Post a reply to this message

<<< Previous 2 Messages

Goto Latest 10 Messages

Next 10 Messages >>>