|
|
|
|
|
|
| |
| |
|
|
From: William F Pokorny
Subject: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 15 Apr 2020 12:55:44
Message: <5e973c90$1@news.povray.org>
|
|
|
| |
| |
|
|
In my povr branch I'm flipping the shape polarity of f_superellipsoid
from positive to negative so it's a more standard implementation for a
function.
On seeing the code many could be constants, I asked myself how fast can
we go if we C++ compiled for a particular superellipsoid. The answer is
a lot faster! See result (i) below. But yeah, not that realistic as a
general approach as it's basically inlining enough of the function call
to enable some compiler optimizations it looks.
I then didn't leave well enough alone and proceeded down the rabbit
hole. The summary of this journey with respect to v38 is basically.
(1) Adopt the shadow cache fixes in my solver branch and discussed in
github pull request #358. Single, expensive shapes like this especially
those shadow cache fixes speed things up. See (a -> c) and (b -> h). 20+
% speed up from v38 master.
(2) For reasons I do not understand it looks like coding
f_superellipsoid as a macro is a lot faster than any inbuilt method. See
(a -> b) and (c -> h). It works so do it in v38 as at least an option to
normal function calls. 24.5% speed up. (Anyone have a thought as to why?
We are passing fewer parameters, but that seems a stretch to explain the
bulk of the difference. I've not dug.)
(3) Sort of a general educational / learning thing - and perhaps a place
where v38 could be better. I often code equations in the function call
parameter positions. Where these can be SDL declare/local constants, do
the latter as it's a lot faster. See (d -> c) 12% speed up. Here it
should be the parser could fix the values itself on the calls with
enough smarts, but it doesn't and so the function / vm looks to be doing
these evaluations on each call from the isosurface.
Some reference code / comments below.
Bill P.
//---
#declare EW = 0.5;
#declare NS = 0.5;
#declare P2 = (2.0/EW);
#declare P3 = EW*(1.0/NS);
#declare P4 = 2*(1/NS);
#declare P5 = (NS*0.5);
#declare Fn00 = function {
-1+pow((pow((pow(abs(x),P2)
+pow(abs(y),P2)),P3)
+pow(abs(z),P4)),P5)
}
#declare Iso99 = isosurface {
//----- v38 master
//a) function { -f_superellipsoid(x,y,z,0.5,0.5) }
//a) 268.56s, 274.586s Extra negation? No.
//a) Primarily one of my shadow cache fixes.
//a) (double root evaluations)
//b) function { Fn00(x,y,z) }
//b) 206.44s, 210.211s // In v38 master too, a
//b) macro would be better.
//------ povr
//c) function { f_superellipsoid(x,y,z,0,0.5,0.5,0,0) }
//c 207.772s // See above. Better -22% from master.
//d) function { f_superellipsoid(x,y,z,1,(2.0/EW),
// EW*(1.0/NS),2*(1/NS),(NS*0.5)) }
//d) 237.771s // Calcs in args. +13.70%
//e) function { f_superellipsoid(x,y,z,1,P2,P3,P4,P5) }
//e) 209.125s // With conditional still slower. +0.65%
//f) function { f_superellipsoid(x,y,z,1,P2,P3,P4,P5) }
//f) 207.478s // hard code conditional. -0.79%
//g) function { f_superellipsoid(x,y,z,1,P2,P3,P4,P5) }
//g) 206.670s // Allocate new DBL vars. -0.53%
function { Fn00(x,y,z) } //h)
//h) 156.363s // Interesting. A macro would be better.
//i) function { f_superellipsoid(x,y,z,1,P2,P3,P4,P5) }
//i) 65.552s // Compile with constants in place. 2-3x faster.
contained_by { box { -2.0,2.0 } }
threshold 0
accuracy 0.0005
max_gradient 5.1
pigment { color Green }
}
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 17 Apr 2020 09:23:49
Message: <5e99ade5$1@news.povray.org>
|
|
|
| |
| |
|
|
On 4/15/20 12:55 PM, William F Pokorny wrote:
...
>
> (2) For reasons I do not understand it looks like coding
> f_superellipsoid as a macro is a lot faster than any inbuilt method. See
> (a -> b) and (c -> h). It works so do it in v38 as at least an option to
> normal function calls. 24.5% speed up. (Anyone have a thought as to why?
> We are passing fewer parameters, but that seems a stretch to explain the
> bulk of the difference. I've not dug.)
>
Yep, I couldn't let these weird results go...
First, I was stupid in doing my initial C++ hard coded arguments compile
in chosing EW = 0.5 and NS = 0.5. These, of course, resolve to calling
some of the pow()s as pow(...,1.0). So, all the performance values below
were done with EW = 1/3 and NW = 1/4 and the hard coded arguments
compile is a lot less 'wow' faster.
As for why (2) where the raw SDL encoding is winning over an inbuilt
compiled result. Looked at it with the linux perf profiling tools and it
looks like when the pow() requests (done internally as exp()s and
log()s) come at the hardware too fast, some are getting delayed. I see a
big jump in pow() hardware cycle counts and under them irq and timer
routines which are not there in the SDL hard coded case. Not sure if
this 'hold up a minute' by the cpu is to control power or it's the way
the hardware (an i3) handles too many overlapping pow() requests.
Some better timing data below.
I think going forward in povr I am going to stick with the two
traditional arguments, but offer a switch to get to a float over double
version of the code as this looks to be a big help - on my hardware at
least - perhaps not on all. Yes, the results are very slightly different
(1/255) values in my testing, but if it looks OK and is a 40%+ faster
it's mostly a who cares I think.
Aside: I looked at this with both g++ and clang. Here clang a little
slower overall - but the strange overall 'relative' performance results
hold for both compilers. This aligns with my belief this a core library
scheduling relative to the hardware thing.
Optimization is, on modern hardware especially, a brutal business. I was
hoping in looking at this to find something in the vm implementation
which could be fixed for an across the board benefit, but this has
turned out to be something particular with doing so many pow()s at
nearly the same time.
(Lots of implications pop into my head for static compiles over dynamic
and changing performance as core library routines get optimized for the
most modern hardware... I am going to let these question go.)
Bill P.
clang results. Actual cpu time in seconds, not elapsed.
----------------------------
Function coded in SDL (pow). 4 precalc args.
1 --> 2.431
2 --> 2.448
3 --> 3.384
4 --> 4.137
Function inbuilt with single floats (powf). 4 precalc args.
1 --> 1.363 (2.431 -> 1.363 -43.93%)
2 --> 1.379 (2.448 -> 1.379 -43.67%)
3 --> 1.864 (3.384 -> 1.864 -44.92%)
4 --> 2.235 (4.137 -> 2.235 -45.98%)
Function inbuilt with double floats (pow). 4 precalc args.
1 --> 3.121 (2.431 -> 3.121 +28.38%)
2 --> 3.117 (2.448 -> 3.117 +27.33%)
3 --> 4.122 (3.384 -> 4.122 +21.81%)
4 --> 4.857 (4.137 -> 4.857 +17.40%)
--- Traditional f_superellipsoid passing 2 args EW, NS.
Function inbuilt with single floats (powf). 2 traditional args.
1 --> 1.383 (2.431 -> 1.383 -43.11%)
2 --> 1.391 (2.448 -> 1.391 -43.18%)
3 --> 1.908 (3.384 -> 1.908 -43.62%)
4 --> 2.307 (4.137 -> 2.307 -44.23%)
Function inbuilt with double floats (pow). 2 traditional args.
1 --> 3.114 (2.431 -> 3.114 +28.10%)
2 --> 3.125 (2.448 -> 3.125 +27.66%)
3 --> 4.148 (3.384 -> 4.148 +22.58%)
4 --> 4.879 (4.137 -> 4.879 +17.94%)
---- Hard coded the args internal to f_superellipsoid.
(Again, this just a what if, how fast, kinda thing)
Function inbuilt with double floats (pow). No args.
1 --> 1.806 (2.431 -> 1.806 -25.71%)
2 --> 1.797 (2.448 -> 1.797 -27.22%)
3 --> 2.463 (3.384 -> 2.463 -27.22%)
4 --> 2.973 (4.137 -> 2.973 -28.14%)
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 17 Apr 2020 09:27:28
Message: <5e99aec0$1@news.povray.org>
|
|
|
| |
| |
|
|
On 4/17/20 9:23 AM, William F Pokorny wrote:
> On 4/15/20 12:55 PM, William F Pokorny wrote:
> ...
And, dang it, forgot to say the 1,2,3,4 in the left columns is the
number of threads used for the cpu time measured.
Bill P.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
William F Pokorny <ano### [at] anonymousorg> wrote:
> ...
> Yep, I couldn't let these weird results go...
> ...
> As for why (2) where the raw SDL encoding is winning over an inbuilt
> compiled result. Looked at it with the linux perf profiling tools and it
> looks like when the pow() requests (done internally as exp()s and
> log()s) come at the hardware too fast, some are getting delayed. I see a
> big jump in pow() hardware cycle counts and under them irq and timer
> routines which are not there in the SDL hard coded case. Not sure if
> this 'hold up a minute' by the cpu is to control power or it's the way
> the hardware (an i3) handles too many overlapping pow() requests.
>
> Some better timing data below.
>
> I think going forward in povr ...
if it's any help, I'd be happy to compile 'povr'[*] and run your test scene(s)
on an i5, for comparison; also, can capture session(s) and send transcripts.
(assuming that if I configure + build under /tmp, povr will use installed v3.8
povray.{conf,ini} files)
[*] different configurations, if wanted.
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 07:26:01
Message: <5e9ae3c9@news.povray.org>
|
|
|
| |
| |
|
|
On 4/17/20 10:35 AM, jr wrote:
> hi,
>
> William F Pokorny <ano### [at] anonymousorg> wrote:
...
>
> if it's any help, I'd be happy to compile 'povr'[*] and run your test scene(s)
> on an i5, for comparison; also, can capture session(s) and send transcripts.
> (assuming that if I configure + build under /tmp, povr will use installed v3.8
> povray.{conf,ini} files)
>
> [*] different configurations, if wanted.
>
Of interest I think and easier for now would be if you (or others) could
run the attached v3.8 scene. You don't need povr to see or test for the
pow() pileup the issue.
I'm thinking anyone on a system where the simd instructions are <=256
bits wide will probably see <= SDL speed for the inbuilt command though
it should be faster. Those with avx512 instructions set cpu 'might' see
'really' fast results for both as IIRC with that set we get a hardware
exp() instruction.
Bill P.
//-------------------------------------------------
#version 3.8;
// Using recent v3.8, set +r<n> to get run times 60s+ maybe.
// Prefix the command with the system - not shell - time command.
//
// /usr/bin/time povray f_supreTest.pov +a0.0 +am1 +r2
// or
// \time povray f_supreTest.pov +a0.0 +am1 +r2
//
// Results for my Ubuntu 18.04 i3 system running the default 4
// threads below. v38 master at commit 74b3ebe, but any should do.
//
// The inbuilt result should be faster, but it's
// almost 24% slower for my system. User time.
//
global_settings { assumed_gamma 1 }
#declare Grey50 = srgb <0.5,0.5,0.5>;
background { color Grey50 }
#declare Camera00 = camera {
perspective
location <3,3,-3.001>
sky y
angle 35
right x*(image_width/image_height)
look_at <0,0,0>
}
#declare White = srgb <1,1,1>;
#declare Light00 = light_source { <50,150,-250>, White }
#declare Red = srgb <1,0,0>;
#declare CylinderX = cylinder { -1*x, 1*x, 0.01 pigment { Red } }
#declare Green = srgb <0,1,0>;
#declare CylinderY = cylinder { -1*y, 1*y, 0.01 pigment { Green } }
#declare Blue = srgb <0,0,1>;
#declare CylinderZ = cylinder { -1*z, 1*z, 0.01 pigment { Blue } }
#include "functions.inc"
// SDL coded version.
#declare EW = 1/3;
#declare NS = 1/4;
#declare P2 = (2.0/EW);
#declare P3 = EW*(1.0/NS);
#declare P4 = 2*(1/NS);
#declare P5 = (NS*0.5);
#declare Fn00 = function {
-1+pow((pow((pow(abs(x),P2)
+pow(abs(y),P2)),P3)
+pow(abs(z),P4)),P5)
}
#declare Iso99 = isosurface {
// function { Fn00(x,y,z) } // 154.544s
function { -f_superellipsoid(x,y,z,1/3,1/4) } // 191.359s +23.82%
contained_by { box { -2.0,2.0 } }
threshold 0
accuracy 0.0005
max_gradient 5.1
pigment { color Green }
}
//--- scene ---
camera { Camera00 }
light_source { Light00 }
object { CylinderX }
object { CylinderY }
object { CylinderZ }
object { Iso99 }
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
William F Pokorny <ano### [at] anonymousorg> wrote:
> Of interest I think and easier for now would be if you (or others) could
> run the attached v3.8 scene.
see p.b.misc, same subject.
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 13:35:45
Message: <5e9b3a71$1@news.povray.org>
|
|
|
| |
| |
|
|
On 4/18/20 12:23 PM, jr wrote:
> hi,
>
> William F Pokorny <ano### [at] anonymousorg> wrote:
>> Of interest I think and easier for now would be if you (or others) could
>> run the attached v3.8 scene.
>
> see p.b.misc, same subject
>
Thank you. Interesting.
My 4th gen i3 at 22nm relative results a lot like your earlier 32nm
generation i3 results. My i3 the same generation as your i5, but the
relative differences are larger. Oh! your i5-4570 looks to be limited to
one thread per core, so yeah, that looks not too different than my 2
core results.
Looks to me, more or less, lines up performance difference with what I
see too - the SDL coded version is faster... Didn't say it outright, but
my guess is in the SDL method the pow()s are tossed at the processor
somewhat slower and so 'fewer/(none?)' are asked to wait for some set
time period.
Bill P.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
William F Pokorny <ano### [at] anonymousorg> wrote:
> On 4/18/20 12:23 PM, jr wrote:
> > ...
> Thank you. Interesting.
>
> My 4th gen i3 at 22nm relative results a lot like your earlier 32nm
> generation i3 results. My i3 the same generation as your i5, but the
> relative differences are larger. Oh! your i5-4570 looks to be limited to
> one thread per core, so yeah, that looks not too different than my 2
> core results.
yes. forgot to say, the 'povray.ini's on all machines set 'work_threads'. one
per core, except the goose which is set to '2', hence override.
> Looks to me, more or less, lines up performance difference with what I
> see too - the SDL coded version is faster... Didn't say it outright, but
> my guess is in the SDL method the pow()s are tossed at the processor
> somewhat slower and so 'fewer/(none?)' are asked to wait for some set
> time period.
would "spacing" with 'nanosleep(2)' help?
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 15:02:07
Message: <5e9b4eaf$1@news.povray.org>
|
|
|
| |
| |
|
|
On 4/18/20 2:12 PM, jr wrote:
> hi,
...
>
> would "spacing" with 'nanosleep(2)' help?
...
>
Thought about such things and, yes, expect something like that might help.
The solution I settled upon was to add a field to f_superellipsoid()
which lets me switch to a single float version of the code. The
hardware/alg/SIMD? lanes are wide enough singles run fast like we'd
expect from an inbuilt. Except at the parameter edges (near zero, larger
value differences) of the EW,NS, it's working well enough the difference
is impossible to spot unless you run value or image compares of some
kind. Single nearly 2x faster than the SDL version and even faster than
the inbuilt at double float given the pow() bottleneck.
Trick helps enough, I wonder if some other inbuilts could benefit from a
float over double option too. But, I'm deleting many of the more obscure
built in functions(1). We have functions for shapes and 'things' that
are interesting to run - once - but not generally useful otherwise. Plus
the values and polarities are all over the place with them. Leaves not
many functions where the trick might apply.
Bill P.
(1) - Maybe at some point down the road I'll create a f_museum()
function and roll all of the obscure stuff into that one function by index.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
William F Pokorny <ano### [at] anonymousorg> wrote:
> ...
> Trick helps enough, I wonder if some other inbuilts could benefit from a
> float over double option too.
does the .. cost of extra speed, in context, matter so much? asking because
(and perhaps I'm completely off-track) only today was a post (by user 'guarnio')
where the problem is/was the range of float not being enough.
> But, I'm deleting many of the more obscure
> built in functions(1). We have functions for shapes and 'things' that
> are interesting to run - once - but not generally useful otherwise. Plus
> the values and polarities are all over the place with them. Leaves not
> many functions where the trick might apply.
>
> Bill P.
>
> (1) - Maybe at some point down the road I'll create a f_museum()
> function and roll all of the obscure stuff into that one function by index.
I think that if 'f_museum' is created first, and then various functions
"retired" there, they'll remain available at all times. ("v good" at voicing my
opinions :-))
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|