|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
William F Pokorny <ano### [at] anonymousorg> wrote:
> ...
> Yep, I couldn't let these weird results go...
> ...
> As for why (2) where the raw SDL encoding is winning over an inbuilt
> compiled result. Looked at it with the linux perf profiling tools and it
> looks like when the pow() requests (done internally as exp()s and
> log()s) come at the hardware too fast, some are getting delayed. I see a
> big jump in pow() hardware cycle counts and under them irq and timer
> routines which are not there in the SDL hard coded case. Not sure if
> this 'hold up a minute' by the cpu is to control power or it's the way
> the hardware (an i3) handles too many overlapping pow() requests.
>
> Some better timing data below.
>
> I think going forward in povr ...
if it's any help, I'd be happy to compile 'povr'[*] and run your test scene(s)
on an i5, for comparison; also, can capture session(s) and send transcripts.
(assuming that if I configure + build under /tmp, povr will use installed v3.8
povray.{conf,ini} files)
[*] different configurations, if wanted.
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 07:26:01
Message: <5e9ae3c9@news.povray.org>
|
|
|
| |
| |
|
|
On 4/17/20 10:35 AM, jr wrote:
> hi,
>
> William F Pokorny <ano### [at] anonymousorg> wrote:
...
>
> if it's any help, I'd be happy to compile 'povr'[*] and run your test scene(s)
> on an i5, for comparison; also, can capture session(s) and send transcripts.
> (assuming that if I configure + build under /tmp, povr will use installed v3.8
> povray.{conf,ini} files)
>
> [*] different configurations, if wanted.
>
Of interest I think and easier for now would be if you (or others) could
run the attached v3.8 scene. You don't need povr to see or test for the
pow() pileup the issue.
I'm thinking anyone on a system where the simd instructions are <=256
bits wide will probably see <= SDL speed for the inbuilt command though
it should be faster. Those with avx512 instructions set cpu 'might' see
'really' fast results for both as IIRC with that set we get a hardware
exp() instruction.
Bill P.
//-------------------------------------------------
#version 3.8;
// Using recent v3.8, set +r<n> to get run times 60s+ maybe.
// Prefix the command with the system - not shell - time command.
//
// /usr/bin/time povray f_supreTest.pov +a0.0 +am1 +r2
// or
// \time povray f_supreTest.pov +a0.0 +am1 +r2
//
// Results for my Ubuntu 18.04 i3 system running the default 4
// threads below. v38 master at commit 74b3ebe, but any should do.
//
// The inbuilt result should be faster, but it's
// almost 24% slower for my system. User time.
//
global_settings { assumed_gamma 1 }
#declare Grey50 = srgb <0.5,0.5,0.5>;
background { color Grey50 }
#declare Camera00 = camera {
perspective
location <3,3,-3.001>
sky y
angle 35
right x*(image_width/image_height)
look_at <0,0,0>
}
#declare White = srgb <1,1,1>;
#declare Light00 = light_source { <50,150,-250>, White }
#declare Red = srgb <1,0,0>;
#declare CylinderX = cylinder { -1*x, 1*x, 0.01 pigment { Red } }
#declare Green = srgb <0,1,0>;
#declare CylinderY = cylinder { -1*y, 1*y, 0.01 pigment { Green } }
#declare Blue = srgb <0,0,1>;
#declare CylinderZ = cylinder { -1*z, 1*z, 0.01 pigment { Blue } }
#include "functions.inc"
// SDL coded version.
#declare EW = 1/3;
#declare NS = 1/4;
#declare P2 = (2.0/EW);
#declare P3 = EW*(1.0/NS);
#declare P4 = 2*(1/NS);
#declare P5 = (NS*0.5);
#declare Fn00 = function {
-1+pow((pow((pow(abs(x),P2)
+pow(abs(y),P2)),P3)
+pow(abs(z),P4)),P5)
}
#declare Iso99 = isosurface {
// function { Fn00(x,y,z) } // 154.544s
function { -f_superellipsoid(x,y,z,1/3,1/4) } // 191.359s +23.82%
contained_by { box { -2.0,2.0 } }
threshold 0
accuracy 0.0005
max_gradient 5.1
pigment { color Green }
}
//--- scene ---
camera { Camera00 }
light_source { Light00 }
object { CylinderX }
object { CylinderY }
object { CylinderZ }
object { Iso99 }
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
William F Pokorny <ano### [at] anonymousorg> wrote:
> Of interest I think and easier for now would be if you (or others) could
> run the attached v3.8 scene.
see p.b.misc, same subject.
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 13:35:45
Message: <5e9b3a71$1@news.povray.org>
|
|
|
| |
| |
|
|
On 4/18/20 12:23 PM, jr wrote:
> hi,
>
> William F Pokorny <ano### [at] anonymousorg> wrote:
>> Of interest I think and easier for now would be if you (or others) could
>> run the attached v3.8 scene.
>
> see p.b.misc, same subject
>
Thank you. Interesting.
My 4th gen i3 at 22nm relative results a lot like your earlier 32nm
generation i3 results. My i3 the same generation as your i5, but the
relative differences are larger. Oh! your i5-4570 looks to be limited to
one thread per core, so yeah, that looks not too different than my 2
core results.
Looks to me, more or less, lines up performance difference with what I
see too - the SDL coded version is faster... Didn't say it outright, but
my guess is in the SDL method the pow()s are tossed at the processor
somewhat slower and so 'fewer/(none?)' are asked to wait for some set
time period.
Bill P.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
William F Pokorny <ano### [at] anonymousorg> wrote:
> On 4/18/20 12:23 PM, jr wrote:
> > ...
> Thank you. Interesting.
>
> My 4th gen i3 at 22nm relative results a lot like your earlier 32nm
> generation i3 results. My i3 the same generation as your i5, but the
> relative differences are larger. Oh! your i5-4570 looks to be limited to
> one thread per core, so yeah, that looks not too different than my 2
> core results.
yes. forgot to say, the 'povray.ini's on all machines set 'work_threads'. one
per core, except the goose which is set to '2', hence override.
> Looks to me, more or less, lines up performance difference with what I
> see too - the SDL coded version is faster... Didn't say it outright, but
> my guess is in the SDL method the pow()s are tossed at the processor
> somewhat slower and so 'fewer/(none?)' are asked to wait for some set
> time period.
would "spacing" with 'nanosleep(2)' help?
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 15:02:07
Message: <5e9b4eaf$1@news.povray.org>
|
|
|
| |
| |
|
|
On 4/18/20 2:12 PM, jr wrote:
> hi,
...
>
> would "spacing" with 'nanosleep(2)' help?
...
>
Thought about such things and, yes, expect something like that might help.
The solution I settled upon was to add a field to f_superellipsoid()
which lets me switch to a single float version of the code. The
hardware/alg/SIMD? lanes are wide enough singles run fast like we'd
expect from an inbuilt. Except at the parameter edges (near zero, larger
value differences) of the EW,NS, it's working well enough the difference
is impossible to spot unless you run value or image compares of some
kind. Single nearly 2x faster than the SDL version and even faster than
the inbuilt at double float given the pow() bottleneck.
Trick helps enough, I wonder if some other inbuilts could benefit from a
float over double option too. But, I'm deleting many of the more obscure
built in functions(1). We have functions for shapes and 'things' that
are interesting to run - once - but not generally useful otherwise. Plus
the values and polarities are all over the place with them. Leaves not
many functions where the trick might apply.
Bill P.
(1) - Maybe at some point down the road I'll create a f_museum()
function and roll all of the obscure stuff into that one function by index.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
William F Pokorny <ano### [at] anonymousorg> wrote:
> ...
> Trick helps enough, I wonder if some other inbuilts could benefit from a
> float over double option too.
does the .. cost of extra speed, in context, matter so much? asking because
(and perhaps I'm completely off-track) only today was a post (by user 'guarnio')
where the problem is/was the range of float not being enough.
> But, I'm deleting many of the more obscure
> built in functions(1). We have functions for shapes and 'things' that
> are interesting to run - once - but not generally useful otherwise. Plus
> the values and polarities are all over the place with them. Leaves not
> many functions where the trick might apply.
>
> Bill P.
>
> (1) - Maybe at some point down the road I'll create a f_museum()
> function and roll all of the obscure stuff into that one function by index.
I think that if 'f_museum' is created first, and then various functions
"retired" there, they'll remain available at all times. ("v good" at voicing my
opinions :-))
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 19 Apr 2020 18:13:06
Message: <5e9cccf2$1@news.povray.org>
|
|
|
| |
| |
|
|
On 4/18/20 4:12 PM, jr wrote:
> hi,
>
> William F Pokorny <ano### [at] anonymousorg> wrote:
>> ...
>> Trick helps enough, I wonder if some other inbuilts could benefit from a
>> float over double option too.
>
> does the .. cost of extra speed, in context, matter so much? asking because
> (and perhaps I'm completely off-track) only today was a post (by user 'guarnio')
> where the problem is/was the range of float not being enough.
>
Not trying to be flippant, but I think it does when it does, and doesn't
when it doesn't. It's a judgement.
The scale and range of a scene with respect to accuracy as an issue is
always there relative to the accuracy you have available.
With functions and isosurfaces, the speed of even very fast inbuilt
functions matters because you mostly want to combine them with other
functions to create whatever. The performance of all those functions
mixed together mathematically is what can quickly get out of hand to the
point of being practically unusable performance wise.
With functions and isosurfaces, we already have an object with user
variable accuracy via the accuracy value passed which is often << 7/8
digits (I typically use 0.0005). I've done some limited testing and the
isosurface solver and - partly due the types of functional input - it
cannot deliver more than 6-7 digits of accuracy max as a rule sometimes
less. With other object types and solvers you can get up in the 11/12
digit ranges though often less. All at doubles.
Relatedly, I believe in going after better performance continually in
software tools - otherwise you're on the slippery slope to poky. :-)
Bill P.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
William F Pokorny <ano### [at] anonymousorg> wrote:
> On 4/18/20 4:12 PM, jr wrote:
> > ... the problem is/was the range of float not being enough.
>
> Not trying to be flippant, but I think it does when it does, and doesn't
> when it doesn't. It's a judgement.
>
> The scale and range of a scene with respect to accuracy as an issue is
> always there relative to the accuracy you have available.
naively, I'd assumed some kind of upgrade/development "policy" that sees all
floats replaced with doubles, in time.
> ...
> Relatedly, I believe in going after better performance continually in
> software tools - otherwise you're on the slippery slope to poky. :-)
hmm, I probably "sit on the fence" on that. eg agree with you when it's a
compiler or other s/ware which has to take h/ware developments into account, but
kind of disagree for, say, programs not tied to h/ware, like 'sed'.
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 22 Apr 2020 08:30:34
Message: <5ea038ea@news.povray.org>
|
|
|
| |
| |
|
|
On 4/21/20 3:52 AM, jr wrote:
> hi,
>
...
>>
>> The scale and range of a scene with respect to accuracy as an issue is
>> always there relative to the accuracy you have available.
>
> naively, I'd assumed some kind of upgrade/development "policy" that sees all
> floats replaced with doubles, in time.
> Maybe. I'm not aware of any such policy, but I'm not a core developer.
The code base is internally mostly at double floats. There are a few
places like bounding and color management where single floats get used.
Done to save storage in the former I think or where the additional
accuracy is of no practical value (to color results at least) in the
later. On 'my' list to look at moving these to doubles.
For povr in the continuous pattern wave modification code I recently
moved a few pattern stored values from singles to doubles. Partly to
avoid the type conversions, but mostly because my grand plan is to flush
out the function/pattern code so the interplay between functions and
patterns is as seamless as it can be. I didn't want functions modified
by a wave modifier to be getting single float parameters - in a way not
visible to the user - when the reasonable assumption is everything is at
double floats.
>> ...
>> Relatedly, I believe in going after better performance continually in
>> software tools - otherwise you're on the slippery slope to poky. :-)
>
> hmm, I probably "sit on the fence" on that. eg agree with you when it's a
> compiler or other s/ware which has to take h/ware developments into account, but
> kind of disagree for, say, programs not tied to h/ware, like 'sed'.
>
I'm with you I think. I failed to be clear (I 'was' too flippant :-)). I
am pushing for continual performance testing and especially an
unwillingness to take much slowdown due changes over time without
compensating improvements somewhere.
What has happened intentionally - or not - with POV-Ray moving v37 to
v38 and the generic architecture compile shipped with linux
distributions is a 30-40% slow down with certain common types of scenes.
https://github.com/POV-Ray/povray/issues/363
This after running down a lot of stuff like dynamic casts in the ray
tracing code to recover performance seen in the benchmark scene.
In part the benchmark scene doesn't cover but a small slice of
functionality in POV-Ray and mostly this was all that was getting run
for performance testing.
I believe too, too many times we said this change is only a 1 or 2% slow
down... Do enough of those in a year and you are well on to pokey at
year end. The 1 or 2% slowdown at year end is relative to current
performance. Many later changes, if looked at January 1st, might have
been rejected out of hand as being too much of a slow down.
Aside: The GNU build methodology supports a code marking method for
hardware optimized versions of functions that get picked/set at 'load
time' depending upon your particular hardware or certain hardware
capabilities. Both compiler and hand optimized code can be implemented
in this way. Yes, this a reason my personal povr version is headed to a
GNU only build(1) process. I want to play with this capability in povr
proper.
Bill P.
(1) - Our current vector template class looks to be somewhat in the way
of best 'compiler' hardware optimization...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|