POV-Ray: Newsgroups: povray.beta-test: Radiosity performance: thread count anomaly

POV-Ray : Newsgroups : povray.beta-test : Radiosity performance: thread count anomaly		Server Time 3 Nov 2025 11:38:31 EST (-0500)

Goto Latest 10 Messages

Next 1 Messages >>>

From: clipka
Subject: Radiosity performance: thread count anomaly
Date: 19 Jan 2009 07:50:00
Message: <web.497475dd467b0290b2c85f720@news.povray.org>

Something *REALLY* weird is going on with the beta.30-rad1. Look at these
performance measurements (done with Linux "time" command - I find the builtin
timing stats a bit off). The scene was the "rad-def-test.pov" sample scene,
modified to use the "2Bounce" settings from "rad_def.inc".

Timings are in seconds; "real" is wall clock time; "user" and "sys" is CPU time
spent in user and kernel mode, respectively; version used was actually not
exactly beta.30-rad1 but a slightly modified one, but I guess we see the same
effect:


beta.29 on 4 cores (just for reference):

real 7.00
user 25.71
sys 0.03


beta.30-rad1 on 4 cores:

real 53.99
user 195.49
sys 0.11


beta.30-rad1 throttled to use 1 core only:

real 292.78
user 292.48
sys 0.03


Uh - so running on 4 cores, the beta.30-rad1 is not just faster, but actually
MORE EFFICIENT than running on a single core...?!? I guess that would qualify
for a nobel prize in informatics (if there was such a thing)...

I actually made the effort of cross-checking with a classic (analog) wristwatch
to make sure the "time" command isn't broken, but I got similar wall-clock
values, and CPU time look very much plausible.


Some more values:

+WT real   user    sys
1   292.78 292.48  0.03
2   93.00  184.70  0.21
3   82.16  239.41  0.08
4   53.99  195.49  0.11
5   57.66  209.04  0.11
6   54.81  200.58  0.17

Interestingly enough, there seems to be no clear correlation between number of
threads and efficiency - but for some reasons particularly inefficient
operation seems to correlate with particularly few time spent in kernel mode.

Initially I observed the effect while I had four separate 1-threaded instances
of POV working on different scenes in parallel. However, the effect shows
independent of total CPU workload.


Something's utterly wrong; I even thought whether I might have messed up
parameter order on some function call, causing the thread count (or thread
number) to drive some quality parameter... but then again the stats - number of
rays shot, samples gathered and what-have-you - stay pretty much the same
(except for the execution time of course), varying by less than 1% margin. Only
the distribution of radiosity samples per pretrace step vary by anything more,
but even there we have less than 10% deviation.


Someone with *any* idea what might go wrong here? (Even the weirdest ideas
welcome, as they might happen to trigger some inspiration.)

Post a reply to this message

From: nemesis
Subject: Re: Radiosity performance: thread count anomaly
Date: 19 Jan 2009 10:24:11
Message: <49749b1b$1@news.povray.org>

clipka escreveu:
> Something *REALLY* weird is going on with the beta.30-rad1. Look at these
> performance measurements (done with Linux "time" command - I find the builtin
> timing stats a bit off). The scene was the "rad-def-test.pov" sample scene,
> modified to use the "2Bounce" settings from "rad_def.inc".
> 
> Timings are in seconds; "real" is wall clock time; "user" and "sys" is CPU time
> spent in user and kernel mode, respectively; version used was actually not
> exactly beta.30-rad1 but a slightly modified one, but I guess we see the same
> effect:
> 
> 
> beta.29 on 4 cores (just for reference):
> 
> real 7.00
> user 25.71
> sys 0.03
> 
> 
> beta.30-rad1 on 4 cores:
> 
> real 53.99
> user 195.49
> sys 0.11
> 
> 
> beta.30-rad1 throttled to use 1 core only:
> 
> real 292.78
> user 292.48
> sys 0.03
> 
> 
> Uh - so running on 4 cores, the beta.30-rad1 is not just faster, but actually
> MORE EFFICIENT than running on a single core...?!? I guess that would qualify
> for a nobel prize in informatics (if there was such a thing)...

Forgive my stupid ignorance, but shouldn't it be expected that usage of 
more cores to handle more threads separately and truly running in 
parallel would lead naturally to such boost in performance?  Isn't that 
what all this multicore hype is all about?

Post a reply to this message

From: Ive
Subject: Re: Radiosity performance: thread count anomaly
Date: 19 Jan 2009 11:49:45
Message: <4974af29$1@news.povray.org>

clipka wrote:
> Someone with *any* idea what might go wrong here? (Even the weirdest ideas
> welcome, as they might happen to trigger some inspiration.)
> 

Maybe just a much better cache access and/or jump prediction as a result 
of using multiple cores? There are e.g. a few SSE2 instructions just to 
optimize the memory access.

Proper aligned memory on 16-byte boundaries as a side effect when using 
4 cores and misalignment when using just one?

Ok. Just wild thoughts, but speed optimization (or let's say the search 
for reasons that cause the lack of expected speed) for contemporary 
processors seems quite tricky.

-Ive

Post a reply to this message

From: clipka
Subject: Re: Radiosity performance: thread count anomaly
Date: 19 Jan 2009 12:35:00
Message: <web.4974b8e32965dbd09b482c50@news.povray.org>

Ive <"ive### [at] lilysoftorg"> wrote:
> Maybe just a much better cache access and/or jump prediction as a result
> of using multiple cores? There are e.g. a few SSE2 instructions just to
> optimize the memory access.

Hm... why should jump prediction or cache access suffer from running only a
single thread on a single core, while the other cores are basically just idle?

I'd agree if the system would keep rotating the thread among multiple cores to
keep an even load per core. But Linux doesn't: It keeps the thread "pinned" to
a single core, which can "calibrate" itself to run the POV code, without having
to "re-calibrate" due to task switches. There are plenty enough idle cores to
take care of any other task that might occasionally pop up.

On the opposite, a multi-core system maxed out by 4 POV-Ray threads, with not
much blocking to do, will be forced to do some task switching in order to
"squeeze in" other jobs; in addition, it will probably kick out the POV threads
alternatingly, so no single thread can run undisturbed, and I'd actually expect
assignment of threads vs. cores to change much more frequently.

> Proper aligned memory on 16-byte boundaries as a side effect when using
> 4 cores and misalignment when using just one?

Interesting theory. But I guess there are not many data structures that vary in
size depending on the thread count.

> Ok. Just wild thoughts,

I asked for it :)

I wouldn't bother much, if I didn't have the "gut feeling" that this effect
might be related to the overall dramatic slowdown seen in beta.30-rad1 with too
many scenes.

Post a reply to this message

From: clipka
Subject: Re: Radiosity performance: thread count anomaly
Date: 19 Jan 2009 17:20:00
Message: <web.4974fbb72965dbd09b482c50@news.povray.org>

nemesis <nam### [at] gmailcom> wrote:
> > beta.30-rad1 on 4 cores:
> >
> > real 53.99
> > user 195.49
> > sys 0.11
> >
> >
> > beta.30-rad1 throttled to use 1 core only:
> >
> > real 292.78
> > user 292.48
> > sys 0.03
>
> Forgive my stupid ignorance, but shouldn't it be expected that usage of
> more cores to handle more threads separately and truly running in
> parallel would lead naturally to such boost in performance?  Isn't that
> what all this multicore hype is all about?

A multicore system is basically just a multiprocessor system, with the
processors placed on a single die to (a) reduce costs, (b) share more
components among the CPUs (e.g. cache) to reduce synchronization overhead, and
(c) speed up synchronization of the remaining components by reducing signal
path lengths.

The benefit, like in a multiprocessor system, is from no more than multiple
workers doing the same job. So if you have N processors, you'd expect a speed
gain of the factor N, minus some overhead work introduced by the
multithreading.

Look again at the figures above:

1 core  -> 293 seconds
4 cores ->  54 seconds

Either my math is rusty, or this is a speed gain by more than the number of
cores...

Post a reply to this message

From: Warp
Subject: Re: Radiosity performance: thread count anomaly
Date: 19 Jan 2009 19:21:37
Message: <49751911@news.povray.org>

clipka <nomail@nomail> wrote:
> A multicore system is basically just a multiprocessor system, with the
> processors placed on a single die to (a) reduce costs, (b) share more
> components among the CPUs (e.g. cache) to reduce synchronization overhead, and
> (c) speed up synchronization of the remaining components by reducing signal
> path lengths.

  Actually shared/non-shared caches can have a big effect and make a
notable difference between multiprocessor and multicore systems.

  Sometimes a shared cache can be beneficial, especially if one single
program runs several threads, all of them sharing the same data. However,
sometimes a shared cache can be detrimental, especially for unrelated
processes which do not share data but must share cache space because
the cores don't have separate caches.

  With POV-Ray I must assume that it benefits from a shared cache, or
at worst it is not hindered by it. (Given that most data POV-Ray 3.7
uses is read-only, it wouldn't make too much of a difference if each
core had its own independent cache.)

> Look again at the figures above:

> 1 core  -> 293 seconds
> 4 cores ->  54 seconds

> Either my math is rusty, or this is a speed gain by more than the number of
> cores...

  How many times was the test run? Was there lot of variation?

  It would be interesting it the test was made with something which takes
significantly longer to render (eg. 15 minutes with 1 core or so.).

-- 
                                                          - Warp

Post a reply to this message

From: clipka
Subject: Re: Radiosity performance: thread count anomaly
Date: 19 Jan 2009 20:30:00
Message: <web.497528612965dbd09b482c50@news.povray.org>

Warp <war### [at] tagpovrayorg> wrote:
>   Actually shared/non-shared caches can have a big effect and make a
> notable difference between multiprocessor and multicore systems.

Sure, there is a performance impact related to caching; however, comparing it to
the ideal "N cores = N-fold performance" situation, shared cache is a
non-hindrance at best. Just like non-shared cache is. Seen this way, neither
gives a performance *benefit* - they both add overhead, which varies with how
it is used.


>   With POV-Ray I must assume that it benefits from a shared cache, or
> at worst it is not hindered by it. (Given that most data POV-Ray 3.7
> uses is read-only, it wouldn't make too much of a difference if each
> core had its own independent cache.)

If we're talking about either N*X MB for all threads or X MB for N threads, then
I guess you're right in that shared N*X MB are of benefit for POV, due to more
stuff fitting into it. However, when talking about X MB for all threads vs. X
MB for N threads each, then the separate caches are probably of benefit,
because each thread does have its local data structures - stack, buffers for
optimization, and so on - that would reduce the space available for common data
in a shared cache.


> > Look again at the figures above:
>
> > 1 core  -> 293 seconds
> > 4 cores ->  54 seconds
>
> > Either my math is rusty, or this is a speed gain by more than the number of
> > cores...
>
>   How many times was the test run? Was there lot of variation?

Variation between different scenes - yes, lots of. Some rendered almost
identical (talking about CPU time) regardless of number of CPUs.

Variations in the render times itself - not significantly. Something like a
swing of 5%, maybe 10%.


>   It would be interesting it the test was made with something which takes
> significantly longer to render (eg. 15 minutes with 1 core or so.).

3 hours 47 minutes enough for your taste?

Compare the stats for rad_def_test.pov using the "IndoorHQ" settings:

****************************************************************************
4 cores:

Render Statistics
Image Resolution 800 x 600
----------------------------------------------------------------------------
Pixels:           550205   Samples:           71514   Smpls/Pxl: 0.13
Rays:           25547811   Saved:                 0   Max Level: 800/600
----------------------------------------------------------------------------
Ray->Shape Intersection          Tests       Succeeded  Percentage
----------------------------------------------------------------------------
Box                           12875803         9499052     73.77
Cone/Cylinder                 13638055         2543768     18.65
CSG Intersection               4454973         3421296     76.80
CSG Union                      4454973         4034232     90.56
Plane                         25547811         9317330     36.47
Sphere                        26254590        25970944     98.92
Torus                          4542688         4039987     88.93
Torus Bound                    4542688         4265423     93.90
Bounding Box                 413047062        60880325     14.74
----------------------------------------------------------------------------
Roots tested:               4265423   eliminated:              3179024
----------------------------------------------------------------------------
Radiosity samples calculated:            86116 (0.63 %)
Radiosity samples reused:             13643598
----------------------------------------------------------------------------
Radiosity (final) calculated:            44237 (0.48 %)
Radiosity (final) reused:              9152963
----------------------------------------------------------------------------
  Pass     Depth 0    Depth 1    Depth 2           Total
----------------------------------------------------------------------------
  1            130       3440       2882            6452
  2            475       3815        408            4698
  3           1900       4762        247            6909
  4           6386       4451        149           10986
  5+          9611       2894        329           12834
  Final      35129        484       8624           44237
----------------------------------------------------------------------------
  Total      53631      19846      12639           86116
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Render Time:
  Photon Time:      No photons
  Radiosity Time:   0 hours  4 minutes 24 seconds (264.683 seconds)
              using 20 thread(s) with 1577.354 CPU-seconds total
  Trace Time:       0 hours 36 minutes 40 seconds (2200.203 seconds)
              using 4 thread(s) with 7994.706 CPU-seconds total
POV-Ray finished

real 2595.37
user 9559.26
sys 7.86

****************************************************************************
1 core:

Render Statistics
Image Resolution 800 x 600
----------------------------------------------------------------------------
Pixels:           550205   Samples:           70785   Smpls/Pxl: 0.13
Rays:           25425517   Saved:                 0   Max Level: 800/600
----------------------------------------------------------------------------
Ray->Shape Intersection          Tests       Succeeded  Percentage
----------------------------------------------------------------------------
Box                           12838763         9455865     73.65
Cone/Cylinder                 13605760         2543618     18.70
CSG Intersection               4434377         3401653     76.71
CSG Union                      4434377         4014406     90.53
Plane                         25425517         9258652     36.41
Sphere                        26132908        25858301     98.95
Torus                          4495444         4001169     89.00
Torus Bound                    4495444         4224994     93.98
Bounding Box                 411171127        60673334     14.76
----------------------------------------------------------------------------
Roots tested:               4224994   eliminated:              3145389
----------------------------------------------------------------------------
Radiosity samples calculated:            86020 (0.63 %)
Radiosity samples reused:             13542466
----------------------------------------------------------------------------
Radiosity (final) calculated:            43905 (0.48 %)
Radiosity (final) reused:              9055291
----------------------------------------------------------------------------
  Pass     Depth 0    Depth 1    Depth 2           Total
----------------------------------------------------------------------------
  1            130       3398       2844            6372
  2            475       3775        387            4637
  3           1900       4829        290            7019
  4           6372       4463        472           11307
  5+          9590       2896        294           12780
  Final      34818        490       8597           43905
----------------------------------------------------------------------------
  Total      53285      19851      12884           86020
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Render Time:
  Photon Time:      No photons
  Radiosity Time:   0 hours 39 minutes 13 seconds (2353.549 seconds)
              using 5 thread(s) with 3330.858 CPU-seconds total
  Trace Time:       3 hours 47 minutes 46 seconds (13666.809 seconds)
              using 1 thread(s) with 13666.880 CPU-seconds total
POV-Ray finished

real 16998.36
user 16997.89
sys 0.43

****************************************************************************

Factor >6 here, instead of the expected 4.

I have to note however that in this case, the results cannot be compared 100%:
The multi-core render was run with the fix for the mapped-and-transformed
texture issue, which turned out to have some impact on runtime, while the
single-core render was run before applying the fix, and I didn't bother to
re-run it yet. It doesn't change the general tendency though.

Post a reply to this message

From: grammophone
Subject: Re: Radiosity performance: thread count anomaly
Date: 5 Mar 2009 20:20:00
Message: <web.49b079222965dbd0ab1785f80@news.povray.org>

Let's see the the problem from a different perspective: Povray 3.7 beta 31
almost stalls when specifying recursion_limit above 1.

My CPU is Intel Quad Core at 4GHz running vista. I rendered the above scene,
rad_def_test.pov, using the "Radiosity_OutdoorHQ" setting. As expected, 3.7
beta 31 was much faster that povray 3.6. But when i added the option
"recursion_limit 2", i get:

Povray 3.6: Total time 561.40 seconds.
Povray 3.7 beta 31: Total time 12960.69 seconds.

Yes, that is more than 23 times slower, despite the fact that 3.7 used all four
cores instead of one. I am sure i used the same rendering settings in both
versions:

512 x 384 +A0.3 +AM2 +R2.

The output images were identical.

Maybe this is a bug.

Post a reply to this message

From: grammophone
Subject: Re: Radiosity performance: thread count anomaly
Date: 6 Mar 2009 06:05:01
Message: <web.49b103422965dbd0a6ed3d1f0@news.povray.org>

Corection:

Povray 3.7 beta 31: Total *CPU* time 12960.69 seconds, total time 3335 seconds.

So, the rendering took *23* times more CPU resources than Povray 3.6. and *6*
times more execution time.

Post a reply to this message

From: clipka
Subject: Re: Radiosity performance: thread count anomaly
Date: 6 Mar 2009 07:40:00
Message: <web.49b118a12965dbd08f505d3e0@news.povray.org>

"grammophone" <eml### [at] ingr> wrote:
> Povray 3.6: Total time 561.40 seconds.
> Povray 3.7 beta 31: Total time 12960.69 seconds.
>
> Maybe this is a bug.

Not really - it's more of a known issue. The radiosity code needed quite some
overhaul to get rid of some uglinesses, but one of the many replacement parts
(apparently the most delicate one) still needs some tuning.

Post a reply to this message

Goto Latest 10 Messages

Next 1 Messages >>>