POV-Ray : Newsgroups : povray.beta-test : v3.8 Clean up TODOs. f_superellipsoid() / shadow cache. Server Time
8 Oct 2024 17:18:21 EDT (-0400)
  v3.8 Clean up TODOs. f_superellipsoid() / shadow cache. (Message 1 to 10 of 13)  
Goto Latest 10 Messages Next 3 Messages >>>
From: William F Pokorny
Subject: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 15 Apr 2020 12:55:44
Message: <5e973c90$1@news.povray.org>
In my povr branch I'm flipping the shape polarity of f_superellipsoid 
from positive to negative so it's a more standard implementation for a 
function.

On seeing the code many could be constants, I asked myself how fast can 
we go if we C++ compiled for a particular superellipsoid. The answer is 
a lot faster! See result (i) below. But yeah, not that realistic as a 
general approach as it's basically inlining enough of the function call 
to enable some compiler optimizations it looks.

I then didn't leave well enough alone and proceeded down the rabbit 
hole. The summary of this journey with respect to v38 is basically.

(1) Adopt the shadow cache fixes in my solver branch and discussed in 
github pull request #358. Single, expensive shapes like this especially 
those shadow cache fixes speed things up. See (a -> c) and (b -> h). 20+ 
% speed up from v38 master.

(2) For reasons I do not understand it looks like coding 
f_superellipsoid as a macro is a lot faster than any inbuilt method. See 
(a -> b) and (c -> h). It works so do it in v38 as at least an option to 
normal function calls. 24.5% speed up. (Anyone have a thought as to why? 
We are passing fewer parameters, but that seems a stretch to explain the 
bulk of the difference. I've not dug.)

(3) Sort of a general educational / learning thing - and perhaps a place 
where v38 could be better. I often code equations in the function call 
parameter positions. Where these can be SDL declare/local constants, do 
the latter as it's a lot faster. See (d -> c) 12% speed up. Here it 
should be the parser could fix the values itself on the calls with 
enough smarts, but it doesn't and so the function / vm looks to be doing 
these evaluations on each call from the isosurface.

Some reference code / comments below.

Bill P.

//---
      #declare EW = 0.5;
#declare NS = 0.5;
#declare P2 = (2.0/EW);
#declare P3 = EW*(1.0/NS);
#declare P4 = 2*(1/NS);
#declare P5 = (NS*0.5);
#declare Fn00 = function {
     -1+pow((pow((pow(abs(x),P2)
                 +pow(abs(y),P2)),P3)
                 +pow(abs(z),P4)),P5)
}
#declare Iso99 = isosurface {

//----- v38 master
//a) function { -f_superellipsoid(x,y,z,0.5,0.5) }
     //a) 268.56s, 274.586s  Extra negation? No.
     //a) Primarily one of my shadow cache fixes.
     //a) (double root evaluations)

//b) function { Fn00(x,y,z) }
     //b)  206.44s, 210.211s  // In v38 master too, a
     //b) macro would be better.

//------ povr
//c) function { f_superellipsoid(x,y,z,0,0.5,0.5,0,0) }
     //c 207.772s  // See above. Better -22% from master.

//d) function { f_superellipsoid(x,y,z,1,(2.0/EW),
//             EW*(1.0/NS),2*(1/NS),(NS*0.5)) }
     //d) 237.771s  // Calcs in args. +13.70%

//e) function { f_superellipsoid(x,y,z,1,P2,P3,P4,P5) }
     //e) 209.125s  // With conditional still slower. +0.65%

//f) function { f_superellipsoid(x,y,z,1,P2,P3,P4,P5) }
     //f) 207.478s  // hard code conditional. -0.79%

//g) function { f_superellipsoid(x,y,z,1,P2,P3,P4,P5) }
     //g) 206.670s  // Allocate new DBL vars. -0.53%

     function { Fn00(x,y,z) }  //h)
     //h) 156.363s  // Interesting. A macro would be better.

//i) function { f_superellipsoid(x,y,z,1,P2,P3,P4,P5) }
//i)  65.552s  // Compile with constants in place. 2-3x faster.

     contained_by { box { -2.0,2.0 } }
     threshold 0
     accuracy 0.0005
     max_gradient 5.1
     pigment { color Green }
}


Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 17 Apr 2020 09:23:49
Message: <5e99ade5$1@news.povray.org>
On 4/15/20 12:55 PM, William F Pokorny wrote:
...
> 
> (2) For reasons I do not understand it looks like coding 
> f_superellipsoid as a macro is a lot faster than any inbuilt method. See 
> (a -> b) and (c -> h). It works so do it in v38 as at least an option to 
> normal function calls. 24.5% speed up. (Anyone have a thought as to why? 
> We are passing fewer parameters, but that seems a stretch to explain the 
> bulk of the difference. I've not dug.)
> 

Yep, I couldn't let these weird results go...

First, I was stupid in doing my initial C++ hard coded arguments compile 
in chosing EW = 0.5 and NS = 0.5. These, of course, resolve to calling 
some of the pow()s as pow(...,1.0). So, all the performance values below 
were done with EW = 1/3 and NW = 1/4 and the hard coded arguments 
compile is a lot less 'wow' faster.

As for why (2) where the raw SDL encoding is winning over an inbuilt 
compiled result. Looked at it with the linux perf profiling tools and it 
looks like when the pow() requests (done internally as exp()s and 
log()s) come at the hardware too fast, some are getting delayed. I see a 
big jump in pow() hardware cycle counts and under them irq and timer 
routines which are not there in the SDL hard coded case. Not sure if 
this 'hold up a minute' by the cpu is to control power or it's the way 
the hardware (an i3) handles too many overlapping pow() requests.

Some better timing data below.

I think going forward in povr I am going to stick with the two 
traditional arguments, but offer a switch to get to a float over double 
version of the code as this looks to be a big help - on my hardware at 
least - perhaps not on all. Yes, the results are very slightly different 
(1/255) values in my testing, but if it looks OK and is a 40%+ faster 
it's mostly a who cares I think.

Aside: I looked at this with both g++ and clang. Here clang a little 
slower overall - but the strange overall 'relative' performance results 
hold for both compilers. This aligns with my belief this a core library 
scheduling relative to the hardware thing.

Optimization is, on modern hardware especially, a brutal business. I was 
hoping in looking at this to find something in the vm implementation 
which could be fixed for an across the board benefit, but this has 
turned out to be something particular with doing so many pow()s at 
nearly the same time.

(Lots of implications pop into my head for static compiles over dynamic 
and changing performance as core library routines get optimized for the 
most modern hardware... I am going to let these question go.)

Bill P.

clang results. Actual cpu time in seconds, not elapsed.
----------------------------

Function coded in SDL (pow). 4 precalc args.
1 -->  2.431
2 -->  2.448
3 -->  3.384
4 -->  4.137

Function inbuilt with single floats (powf). 4 precalc args.
1 -->  1.363  (2.431 -> 1.363  -43.93%)
2 -->  1.379  (2.448 -> 1.379  -43.67%)
3 -->  1.864  (3.384 -> 1.864  -44.92%)
4 -->  2.235  (4.137 -> 2.235  -45.98%)

Function inbuilt with double floats (pow).  4 precalc args.
1 -->  3.121  (2.431 -> 3.121  +28.38%)
2 -->  3.117  (2.448 -> 3.117  +27.33%)
3 -->  4.122  (3.384 -> 4.122  +21.81%)
4 -->  4.857  (4.137 -> 4.857  +17.40%)

--- Traditional f_superellipsoid passing 2 args EW, NS.

Function inbuilt with single floats (powf). 2 traditional args.
1 -->  1.383  (2.431 -> 1.383  -43.11%)
2 -->  1.391  (2.448 -> 1.391  -43.18%)
3 -->  1.908  (3.384 -> 1.908  -43.62%)
4 -->  2.307  (4.137 -> 2.307  -44.23%)

Function inbuilt with double floats (pow). 2 traditional args.
1 -->  3.114  (2.431 -> 3.114  +28.10%)
2 -->  3.125  (2.448 -> 3.125  +27.66%)
3 -->  4.148  (3.384 -> 4.148  +22.58%)
4 -->  4.879  (4.137 -> 4.879  +17.94%)

---- Hard coded the args internal to f_superellipsoid.
(Again, this just a what if, how fast, kinda thing)

Function inbuilt with double floats (pow). No args.
1 -->  1.806  (2.431 -> 1.806  -25.71%)
2 -->  1.797  (2.448 -> 1.797  -27.22%)
3 -->  2.463  (3.384 -> 2.463  -27.22%)
4 -->  2.973  (4.137 -> 2.973  -28.14%)


Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 17 Apr 2020 09:27:28
Message: <5e99aec0$1@news.povray.org>
On 4/17/20 9:23 AM, William F Pokorny wrote:
> On 4/15/20 12:55 PM, William F Pokorny wrote:
> ...

And, dang it, forgot to say the 1,2,3,4 in the left columns is the 
number of threads used for the cpu time measured.

Bill P.


Post a reply to this message

From: jr
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 17 Apr 2020 10:40:01
Message: <web.5e99be1d1e176347827e2b3e0@news.povray.org>
hi,

William F Pokorny <ano### [at] anonymousorg> wrote:
> ...
> Yep, I couldn't let these weird results go...
> ...
> As for why (2) where the raw SDL encoding is winning over an inbuilt
> compiled result. Looked at it with the linux perf profiling tools and it
> looks like when the pow() requests (done internally as exp()s and
> log()s) come at the hardware too fast, some are getting delayed. I see a
> big jump in pow() hardware cycle counts and under them irq and timer
> routines which are not there in the SDL hard coded case. Not sure if
> this 'hold up a minute' by the cpu is to control power or it's the way
> the hardware (an i3) handles too many overlapping pow() requests.
>
> Some better timing data below.
>
> I think going forward in povr ...

if it's any help, I'd be happy to compile 'povr'[*] and run your test scene(s)
on an i5, for comparison; also, can capture session(s) and send transcripts.
(assuming that if I configure + build under /tmp, povr will use installed v3.8
povray.{conf,ini} files)

[*] different configurations, if wanted.


regards, jr.


Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 07:26:01
Message: <5e9ae3c9@news.povray.org>
On 4/17/20 10:35 AM, jr wrote:
> hi,
> 
> William F Pokorny <ano### [at] anonymousorg> wrote:
...
> 
> if it's any help, I'd be happy to compile 'povr'[*] and run your test scene(s)
> on an i5, for comparison; also, can capture session(s) and send transcripts.
> (assuming that if I configure + build under /tmp, povr will use installed v3.8
> povray.{conf,ini} files)
> 
> [*] different configurations, if wanted.
> 

Of interest I think and easier for now would be if you (or others) could 
run the attached v3.8 scene. You don't need povr to see or test for the 
pow() pileup the issue.

I'm thinking anyone on a system where the simd instructions are <=256 
bits wide will probably see <= SDL speed for the inbuilt command though
it should be faster. Those with avx512 instructions set cpu 'might' see 
'really' fast results for both as IIRC with that set we get a hardware 
exp() instruction.

Bill P.

//-------------------------------------------------
#version 3.8;
// Using recent v3.8, set +r<n> to get run times 60s+ maybe.
// Prefix the command with the system - not shell - time command.
//
// /usr/bin/time povray f_supreTest.pov +a0.0 +am1 +r2
//      or
// \time povray f_supreTest.pov +a0.0 +am1 +r2
//
// Results for my Ubuntu 18.04 i3 system running the default 4
// threads below. v38 master at commit 74b3ebe, but any should do.
//
// The inbuilt result should be faster, but it's
// almost 24% slower for my system. User time.
//

global_settings { assumed_gamma 1 }
#declare Grey50 = srgb <0.5,0.5,0.5>;
background { color Grey50 }
#declare Camera00 = camera {
     perspective
     location <3,3,-3.001>
     sky y
     angle 35
     right x*(image_width/image_height)
     look_at <0,0,0>
}
#declare White = srgb <1,1,1>;
#declare Light00 = light_source { <50,150,-250>, White }
#declare Red = srgb <1,0,0>;
#declare CylinderX = cylinder { -1*x, 1*x, 0.01 pigment { Red } }
#declare Green = srgb <0,1,0>;
#declare CylinderY = cylinder { -1*y, 1*y, 0.01 pigment { Green } }
#declare Blue = srgb <0,0,1>;
#declare CylinderZ = cylinder { -1*z, 1*z, 0.01 pigment { Blue } }

#include "functions.inc"

// SDL coded version.
#declare EW = 1/3;
#declare NS = 1/4;
#declare P2 = (2.0/EW);
#declare P3 = EW*(1.0/NS);
#declare P4 = 2*(1/NS);
#declare P5 = (NS*0.5);
#declare Fn00 = function {
     -1+pow((pow((pow(abs(x),P2)
                 +pow(abs(y),P2)),P3)
                 +pow(abs(z),P4)),P5)
}

#declare Iso99 = isosurface {
// function { Fn00(x,y,z) }                       // 154.544s
    function { -f_superellipsoid(x,y,z,1/3,1/4) }  // 191.359s +23.82%
     contained_by { box { -2.0,2.0 } }
     threshold 0
     accuracy 0.0005
     max_gradient 5.1
     pigment { color Green }
}

//--- scene ---
     camera { Camera00 }
     light_source { Light00 }
     object { CylinderX }
     object { CylinderY }
     object { CylinderZ }
     object { Iso99 }


Post a reply to this message

From: jr
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 12:25:01
Message: <web.5e9b29671e176347827e2b3e0@news.povray.org>
hi,

William F Pokorny <ano### [at] anonymousorg> wrote:
> Of interest I think and easier for now would be if you (or others) could
> run the attached v3.8 scene.

see p.b.misc, same subject.


regards, jr.


Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 13:35:45
Message: <5e9b3a71$1@news.povray.org>
On 4/18/20 12:23 PM, jr wrote:
> hi,
> 
> William F Pokorny <ano### [at] anonymousorg> wrote:
>> Of interest I think and easier for now would be if you (or others) could
>> run the attached v3.8 scene.
> 
> see p.b.misc, same subject
> 

Thank you. Interesting.

My 4th gen i3 at 22nm relative results a lot like your earlier 32nm 
generation i3 results. My i3 the same generation as your i5, but the 
relative differences are larger. Oh! your i5-4570 looks to be limited to 
one thread per core, so yeah, that looks not too different than my 2 
core results.

Looks to me, more or less, lines up performance difference with what I 
see too - the SDL coded version is faster... Didn't say it outright, but 
my guess is in the SDL method the pow()s are tossed at the processor 
somewhat slower and so 'fewer/(none?)' are asked to wait for some set 
time period.

Bill P.


Post a reply to this message

From: jr
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 14:15:00
Message: <web.5e9b42f21e176347827e2b3e0@news.povray.org>
hi,

William F Pokorny <ano### [at] anonymousorg> wrote:
> On 4/18/20 12:23 PM, jr wrote:
> > ...
> Thank you. Interesting.
>
> My 4th gen i3 at 22nm relative results a lot like your earlier 32nm
> generation i3 results. My i3 the same generation as your i5, but the
> relative differences are larger. Oh! your i5-4570 looks to be limited to
> one thread per core, so yeah, that looks not too different than my 2
> core results.

yes.  forgot to say, the 'povray.ini's on all machines set 'work_threads'.  one
per core, except the goose which is set to '2', hence override.


> Looks to me, more or less, lines up performance difference with what I
> see too - the SDL coded version is faster... Didn't say it outright, but
> my guess is in the SDL method the pow()s are tossed at the processor
> somewhat slower and so 'fewer/(none?)' are asked to wait for some set
> time period.

would "spacing" with 'nanosleep(2)' help?


regards, jr.


Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 15:02:07
Message: <5e9b4eaf$1@news.povray.org>
On 4/18/20 2:12 PM, jr wrote:
> hi,
...
> 
> would "spacing" with 'nanosleep(2)' help?
...
> 

Thought about such things and, yes, expect something like that might help.

The solution I settled upon was to add a field to f_superellipsoid() 
which lets me switch to a single float version of the code. The 
hardware/alg/SIMD? lanes are wide enough singles run fast like we'd 
expect from an inbuilt. Except at the parameter edges (near zero, larger 
value differences) of the EW,NS, it's working well enough the difference 
is impossible to spot unless you run value or image compares of some 
kind. Single nearly 2x faster than the SDL version and even faster than 
the inbuilt at double float given the pow() bottleneck.

Trick helps enough, I wonder if some other inbuilts could benefit from a 
float over double option too. But, I'm deleting many of the more obscure 
built in functions(1). We have functions for shapes and 'things' that 
are interesting to run - once - but not generally useful otherwise. Plus 
the values and polarities are all over the place with them. Leaves not 
many functions where the trick might apply.

Bill P.

(1) - Maybe at some point down the road I'll create a f_museum() 
function and roll all of the obscure stuff into that one function by index.


Post a reply to this message

From: jr
Subject: Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.
Date: 18 Apr 2020 16:15:01
Message: <web.5e9b5f361e176347827e2b3e0@news.povray.org>
hi,

William F Pokorny <ano### [at] anonymousorg> wrote:
> ...
> Trick helps enough, I wonder if some other inbuilts could benefit from a
> float over double option too.

does the .. cost of extra speed, in context, matter so much?  asking because
(and perhaps I'm completely off-track) only today was a post (by user 'guarnio')
where the problem is/was the range of float not being enough.

> But, I'm deleting many of the more obscure
> built in functions(1). We have functions for shapes and 'things' that
> are interesting to run - once - but not generally useful otherwise. Plus
> the values and polarities are all over the place with them. Leaves not
> many functions where the trick might apply.
>
> Bill P.
>
> (1) - Maybe at some point down the road I'll create a f_museum()
> function and roll all of the obscure stuff into that one function by index.

I think that if 'f_museum' is created first, and then various functions
"retired" there, they'll remain available at all times.  ("v good" at voicing my
opinions :-))


regards, jr.


Post a reply to this message

Goto Latest 10 Messages Next 3 Messages >>>

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.