|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Something *REALLY* weird is going on with the beta.30-rad1. Look at these
performance measurements (done with Linux "time" command - I find the builtin
timing stats a bit off). The scene was the "rad-def-test.pov" sample scene,
modified to use the "2Bounce" settings from "rad_def.inc".
Timings are in seconds; "real" is wall clock time; "user" and "sys" is CPU time
spent in user and kernel mode, respectively; version used was actually not
exactly beta.30-rad1 but a slightly modified one, but I guess we see the same
effect:
beta.29 on 4 cores (just for reference):
real 7.00
user 25.71
sys 0.03
beta.30-rad1 on 4 cores:
real 53.99
user 195.49
sys 0.11
beta.30-rad1 throttled to use 1 core only:
real 292.78
user 292.48
sys 0.03
Uh - so running on 4 cores, the beta.30-rad1 is not just faster, but actually
MORE EFFICIENT than running on a single core...?!? I guess that would qualify
for a nobel prize in informatics (if there was such a thing)...
I actually made the effort of cross-checking with a classic (analog) wristwatch
to make sure the "time" command isn't broken, but I got similar wall-clock
values, and CPU time look very much plausible.
Some more values:
+WT real user sys
1 292.78 292.48 0.03
2 93.00 184.70 0.21
3 82.16 239.41 0.08
4 53.99 195.49 0.11
5 57.66 209.04 0.11
6 54.81 200.58 0.17
Interestingly enough, there seems to be no clear correlation between number of
threads and efficiency - but for some reasons particularly inefficient
operation seems to correlate with particularly few time spent in kernel mode.
Initially I observed the effect while I had four separate 1-threaded instances
of POV working on different scenes in parallel. However, the effect shows
independent of total CPU workload.
Something's utterly wrong; I even thought whether I might have messed up
parameter order on some function call, causing the thread count (or thread
number) to drive some quality parameter... but then again the stats - number of
rays shot, samples gathered and what-have-you - stay pretty much the same
(except for the execution time of course), varying by less than 1% margin. Only
the distribution of radiosity samples per pretrace step vary by anything more,
but even there we have less than 10% deviation.
Someone with *any* idea what might go wrong here? (Even the weirdest ideas
welcome, as they might happen to trigger some inspiration.)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
clipka escreveu:
> Something *REALLY* weird is going on with the beta.30-rad1. Look at these
> performance measurements (done with Linux "time" command - I find the builtin
> timing stats a bit off). The scene was the "rad-def-test.pov" sample scene,
> modified to use the "2Bounce" settings from "rad_def.inc".
>
> Timings are in seconds; "real" is wall clock time; "user" and "sys" is CPU time
> spent in user and kernel mode, respectively; version used was actually not
> exactly beta.30-rad1 but a slightly modified one, but I guess we see the same
> effect:
>
>
> beta.29 on 4 cores (just for reference):
>
> real 7.00
> user 25.71
> sys 0.03
>
>
> beta.30-rad1 on 4 cores:
>
> real 53.99
> user 195.49
> sys 0.11
>
>
> beta.30-rad1 throttled to use 1 core only:
>
> real 292.78
> user 292.48
> sys 0.03
>
>
> Uh - so running on 4 cores, the beta.30-rad1 is not just faster, but actually
> MORE EFFICIENT than running on a single core...?!? I guess that would qualify
> for a nobel prize in informatics (if there was such a thing)...
Forgive my stupid ignorance, but shouldn't it be expected that usage of
more cores to handle more threads separately and truly running in
parallel would lead naturally to such boost in performance? Isn't that
what all this multicore hype is all about?
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
clipka wrote:
> Someone with *any* idea what might go wrong here? (Even the weirdest ideas
> welcome, as they might happen to trigger some inspiration.)
>
Maybe just a much better cache access and/or jump prediction as a result
of using multiple cores? There are e.g. a few SSE2 instructions just to
optimize the memory access.
Proper aligned memory on 16-byte boundaries as a side effect when using
4 cores and misalignment when using just one?
Ok. Just wild thoughts, but speed optimization (or let's say the search
for reasons that cause the lack of expected speed) for contemporary
processors seems quite tricky.
-Ive
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Ive <"ive### [at] lilysoftorg"> wrote:
> Maybe just a much better cache access and/or jump prediction as a result
> of using multiple cores? There are e.g. a few SSE2 instructions just to
> optimize the memory access.
Hm... why should jump prediction or cache access suffer from running only a
single thread on a single core, while the other cores are basically just idle?
I'd agree if the system would keep rotating the thread among multiple cores to
keep an even load per core. But Linux doesn't: It keeps the thread "pinned" to
a single core, which can "calibrate" itself to run the POV code, without having
to "re-calibrate" due to task switches. There are plenty enough idle cores to
take care of any other task that might occasionally pop up.
On the opposite, a multi-core system maxed out by 4 POV-Ray threads, with not
much blocking to do, will be forced to do some task switching in order to
"squeeze in" other jobs; in addition, it will probably kick out the POV threads
alternatingly, so no single thread can run undisturbed, and I'd actually expect
assignment of threads vs. cores to change much more frequently.
> Proper aligned memory on 16-byte boundaries as a side effect when using
> 4 cores and misalignment when using just one?
Interesting theory. But I guess there are not many data structures that vary in
size depending on the thread count.
> Ok. Just wild thoughts,
I asked for it :)
I wouldn't bother much, if I didn't have the "gut feeling" that this effect
might be related to the overall dramatic slowdown seen in beta.30-rad1 with too
many scenes.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
nemesis <nam### [at] gmailcom> wrote:
> > beta.30-rad1 on 4 cores:
> >
> > real 53.99
> > user 195.49
> > sys 0.11
> >
> >
> > beta.30-rad1 throttled to use 1 core only:
> >
> > real 292.78
> > user 292.48
> > sys 0.03
>
> Forgive my stupid ignorance, but shouldn't it be expected that usage of
> more cores to handle more threads separately and truly running in
> parallel would lead naturally to such boost in performance? Isn't that
> what all this multicore hype is all about?
A multicore system is basically just a multiprocessor system, with the
processors placed on a single die to (a) reduce costs, (b) share more
components among the CPUs (e.g. cache) to reduce synchronization overhead, and
(c) speed up synchronization of the remaining components by reducing signal
path lengths.
The benefit, like in a multiprocessor system, is from no more than multiple
workers doing the same job. So if you have N processors, you'd expect a speed
gain of the factor N, minus some overhead work introduced by the
multithreading.
Look again at the figures above:
1 core -> 293 seconds
4 cores -> 54 seconds
Either my math is rusty, or this is a speed gain by more than the number of
cores...
Post a reply to this message
|
|
| |
| |
|
|
From: Warp
Subject: Re: Radiosity performance: thread count anomaly
Date: 19 Jan 2009 19:21:37
Message: <49751911@news.povray.org>
|
|
|
| |
| |
|
|
clipka <nomail@nomail> wrote:
> A multicore system is basically just a multiprocessor system, with the
> processors placed on a single die to (a) reduce costs, (b) share more
> components among the CPUs (e.g. cache) to reduce synchronization overhead, and
> (c) speed up synchronization of the remaining components by reducing signal
> path lengths.
Actually shared/non-shared caches can have a big effect and make a
notable difference between multiprocessor and multicore systems.
Sometimes a shared cache can be beneficial, especially if one single
program runs several threads, all of them sharing the same data. However,
sometimes a shared cache can be detrimental, especially for unrelated
processes which do not share data but must share cache space because
the cores don't have separate caches.
With POV-Ray I must assume that it benefits from a shared cache, or
at worst it is not hindered by it. (Given that most data POV-Ray 3.7
uses is read-only, it wouldn't make too much of a difference if each
core had its own independent cache.)
> Look again at the figures above:
> 1 core -> 293 seconds
> 4 cores -> 54 seconds
> Either my math is rusty, or this is a speed gain by more than the number of
> cores...
How many times was the test run? Was there lot of variation?
It would be interesting it the test was made with something which takes
significantly longer to render (eg. 15 minutes with 1 core or so.).
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp <war### [at] tagpovrayorg> wrote:
> Actually shared/non-shared caches can have a big effect and make a
> notable difference between multiprocessor and multicore systems.
Sure, there is a performance impact related to caching; however, comparing it to
the ideal "N cores = N-fold performance" situation, shared cache is a
non-hindrance at best. Just like non-shared cache is. Seen this way, neither
gives a performance *benefit* - they both add overhead, which varies with how
it is used.
> With POV-Ray I must assume that it benefits from a shared cache, or
> at worst it is not hindered by it. (Given that most data POV-Ray 3.7
> uses is read-only, it wouldn't make too much of a difference if each
> core had its own independent cache.)
If we're talking about either N*X MB for all threads or X MB for N threads, then
I guess you're right in that shared N*X MB are of benefit for POV, due to more
stuff fitting into it. However, when talking about X MB for all threads vs. X
MB for N threads each, then the separate caches are probably of benefit,
because each thread does have its local data structures - stack, buffers for
optimization, and so on - that would reduce the space available for common data
in a shared cache.
> > Look again at the figures above:
>
> > 1 core -> 293 seconds
> > 4 cores -> 54 seconds
>
> > Either my math is rusty, or this is a speed gain by more than the number of
> > cores...
>
> How many times was the test run? Was there lot of variation?
Variation between different scenes - yes, lots of. Some rendered almost
identical (talking about CPU time) regardless of number of CPUs.
Variations in the render times itself - not significantly. Something like a
swing of 5%, maybe 10%.
> It would be interesting it the test was made with something which takes
> significantly longer to render (eg. 15 minutes with 1 core or so.).
3 hours 47 minutes enough for your taste?
Compare the stats for rad_def_test.pov using the "IndoorHQ" settings:
****************************************************************************
4 cores:
Render Statistics
Image Resolution 800 x 600
----------------------------------------------------------------------------
Pixels: 550205 Samples: 71514 Smpls/Pxl: 0.13
Rays: 25547811 Saved: 0 Max Level: 800/600
----------------------------------------------------------------------------
Ray->Shape Intersection Tests Succeeded Percentage
----------------------------------------------------------------------------
Box 12875803 9499052 73.77
Cone/Cylinder 13638055 2543768 18.65
CSG Intersection 4454973 3421296 76.80
CSG Union 4454973 4034232 90.56
Plane 25547811 9317330 36.47
Sphere 26254590 25970944 98.92
Torus 4542688 4039987 88.93
Torus Bound 4542688 4265423 93.90
Bounding Box 413047062 60880325 14.74
----------------------------------------------------------------------------
Roots tested: 4265423 eliminated: 3179024
----------------------------------------------------------------------------
Radiosity samples calculated: 86116 (0.63 %)
Radiosity samples reused: 13643598
----------------------------------------------------------------------------
Radiosity (final) calculated: 44237 (0.48 %)
Radiosity (final) reused: 9152963
----------------------------------------------------------------------------
Pass Depth 0 Depth 1 Depth 2 Total
----------------------------------------------------------------------------
1 130 3440 2882 6452
2 475 3815 408 4698
3 1900 4762 247 6909
4 6386 4451 149 10986
5+ 9611 2894 329 12834
Final 35129 484 8624 44237
----------------------------------------------------------------------------
Total 53631 19846 12639 86116
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Render Time:
Photon Time: No photons
Radiosity Time: 0 hours 4 minutes 24 seconds (264.683 seconds)
using 20 thread(s) with 1577.354 CPU-seconds total
Trace Time: 0 hours 36 minutes 40 seconds (2200.203 seconds)
using 4 thread(s) with 7994.706 CPU-seconds total
POV-Ray finished
real 2595.37
user 9559.26
sys 7.86
****************************************************************************
1 core:
Render Statistics
Image Resolution 800 x 600
----------------------------------------------------------------------------
Pixels: 550205 Samples: 70785 Smpls/Pxl: 0.13
Rays: 25425517 Saved: 0 Max Level: 800/600
----------------------------------------------------------------------------
Ray->Shape Intersection Tests Succeeded Percentage
----------------------------------------------------------------------------
Box 12838763 9455865 73.65
Cone/Cylinder 13605760 2543618 18.70
CSG Intersection 4434377 3401653 76.71
CSG Union 4434377 4014406 90.53
Plane 25425517 9258652 36.41
Sphere 26132908 25858301 98.95
Torus 4495444 4001169 89.00
Torus Bound 4495444 4224994 93.98
Bounding Box 411171127 60673334 14.76
----------------------------------------------------------------------------
Roots tested: 4224994 eliminated: 3145389
----------------------------------------------------------------------------
Radiosity samples calculated: 86020 (0.63 %)
Radiosity samples reused: 13542466
----------------------------------------------------------------------------
Radiosity (final) calculated: 43905 (0.48 %)
Radiosity (final) reused: 9055291
----------------------------------------------------------------------------
Pass Depth 0 Depth 1 Depth 2 Total
----------------------------------------------------------------------------
1 130 3398 2844 6372
2 475 3775 387 4637
3 1900 4829 290 7019
4 6372 4463 472 11307
5+ 9590 2896 294 12780
Final 34818 490 8597 43905
----------------------------------------------------------------------------
Total 53285 19851 12884 86020
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Render Time:
Photon Time: No photons
Radiosity Time: 0 hours 39 minutes 13 seconds (2353.549 seconds)
using 5 thread(s) with 3330.858 CPU-seconds total
Trace Time: 3 hours 47 minutes 46 seconds (13666.809 seconds)
using 1 thread(s) with 13666.880 CPU-seconds total
POV-Ray finished
real 16998.36
user 16997.89
sys 0.43
****************************************************************************
Factor >6 here, instead of the expected 4.
I have to note however that in this case, the results cannot be compared 100%:
The multi-core render was run with the fix for the mapped-and-transformed
texture issue, which turned out to have some impact on runtime, while the
single-core render was run before applying the fix, and I didn't bother to
re-run it yet. It doesn't change the general tendency though.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Let's see the the problem from a different perspective: Povray 3.7 beta 31
almost stalls when specifying recursion_limit above 1.
My CPU is Intel Quad Core at 4GHz running vista. I rendered the above scene,
rad_def_test.pov, using the "Radiosity_OutdoorHQ" setting. As expected, 3.7
beta 31 was much faster that povray 3.6. But when i added the option
"recursion_limit 2", i get:
Povray 3.6: Total time 561.40 seconds.
Povray 3.7 beta 31: Total time 12960.69 seconds.
Yes, that is more than 23 times slower, despite the fact that 3.7 used all four
cores instead of one. I am sure i used the same rendering settings in both
versions:
512 x 384 +A0.3 +AM2 +R2.
The output images were identical.
Maybe this is a bug.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Corection:
Povray 3.7 beta 31: Total *CPU* time 12960.69 seconds, total time 3335 seconds.
So, the rendering took *23* times more CPU resources than Povray 3.6. and *6*
times more execution time.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
"grammophone" <eml### [at] ingr> wrote:
> Povray 3.6: Total time 561.40 seconds.
> Povray 3.7 beta 31: Total time 12960.69 seconds.
>
> Maybe this is a bug.
Not really - it's more of a known issue. The radiosity code needed quite some
overhaul to get rid of some uglinesses, but one of the many replacement parts
(apparently the most delicate one) still needs some tuning.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|