POV-Ray: Newsgroups: povray.beta-test: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.: Re: v3.8 Clean up TODOs. f

POV-Ray : Newsgroups : povray.beta-test : v3.8 Clean up TODOs. f_superellipsoid() / shadow cache. : Re: v3.8 Clean up TODOs. f_superellipsoid() / shadow cache.		Server Time 27 Apr 2024 00:34:57 EDT (-0400)
From: William F Pokorny
Date: 17 Apr 2020 09:23:49
Message: <5e99ade5$1@news.povray.org>
On 4/15/20 12:55 PM, William F Pokorny wrote:
...
> 
> (2) For reasons I do not understand it looks like coding 
> f_superellipsoid as a macro is a lot faster than any inbuilt method. See 
> (a -> b) and (c -> h). It works so do it in v38 as at least an option to 
> normal function calls. 24.5% speed up. (Anyone have a thought as to why? 
> We are passing fewer parameters, but that seems a stretch to explain the 
> bulk of the difference. I've not dug.)
> 

Yep, I couldn't let these weird results go...

First, I was stupid in doing my initial C++ hard coded arguments compile 
in chosing EW = 0.5 and NS = 0.5. These, of course, resolve to calling 
some of the pow()s as pow(...,1.0). So, all the performance values below 
were done with EW = 1/3 and NW = 1/4 and the hard coded arguments 
compile is a lot less 'wow' faster.

As for why (2) where the raw SDL encoding is winning over an inbuilt 
compiled result. Looked at it with the linux perf profiling tools and it 
looks like when the pow() requests (done internally as exp()s and 
log()s) come at the hardware too fast, some are getting delayed. I see a 
big jump in pow() hardware cycle counts and under them irq and timer 
routines which are not there in the SDL hard coded case. Not sure if 
this 'hold up a minute' by the cpu is to control power or it's the way 
the hardware (an i3) handles too many overlapping pow() requests.

Some better timing data below.

I think going forward in povr I am going to stick with the two 
traditional arguments, but offer a switch to get to a float over double 
version of the code as this looks to be a big help - on my hardware at 
least - perhaps not on all. Yes, the results are very slightly different 
(1/255) values in my testing, but if it looks OK and is a 40%+ faster 
it's mostly a who cares I think.

Aside: I looked at this with both g++ and clang. Here clang a little 
slower overall - but the strange overall 'relative' performance results 
hold for both compilers. This aligns with my belief this a core library 
scheduling relative to the hardware thing.

Optimization is, on modern hardware especially, a brutal business. I was 
hoping in looking at this to find something in the vm implementation 
which could be fixed for an across the board benefit, but this has 
turned out to be something particular with doing so many pow()s at 
nearly the same time.

(Lots of implications pop into my head for static compiles over dynamic 
and changing performance as core library routines get optimized for the 
most modern hardware... I am going to let these question go.)

Bill P.

clang results. Actual cpu time in seconds, not elapsed.
----------------------------

Function coded in SDL (pow). 4 precalc args.
1 -->  2.431
2 -->  2.448
3 -->  3.384
4 -->  4.137

Function inbuilt with single floats (powf). 4 precalc args.
1 -->  1.363  (2.431 -> 1.363  -43.93%)
2 -->  1.379  (2.448 -> 1.379  -43.67%)
3 -->  1.864  (3.384 -> 1.864  -44.92%)
4 -->  2.235  (4.137 -> 2.235  -45.98%)

Function inbuilt with double floats (pow).  4 precalc args.
1 -->  3.121  (2.431 -> 3.121  +28.38%)
2 -->  3.117  (2.448 -> 3.117  +27.33%)
3 -->  4.122  (3.384 -> 4.122  +21.81%)
4 -->  4.857  (4.137 -> 4.857  +17.40%)

--- Traditional f_superellipsoid passing 2 args EW, NS.

Function inbuilt with single floats (powf). 2 traditional args.
1 -->  1.383  (2.431 -> 1.383  -43.11%)
2 -->  1.391  (2.448 -> 1.391  -43.18%)
3 -->  1.908  (3.384 -> 1.908  -43.62%)
4 -->  2.307  (4.137 -> 2.307  -44.23%)

Function inbuilt with double floats (pow). 2 traditional args.
1 -->  3.114  (2.431 -> 3.114  +28.10%)
2 -->  3.125  (2.448 -> 3.125  +27.66%)
3 -->  4.148  (3.384 -> 4.148  +22.58%)
4 -->  4.879  (4.137 -> 4.879  +17.94%)

---- Hard coded the args internal to f_superellipsoid.
(Again, this just a what if, how fast, kinda thing)

Function inbuilt with double floats (pow). No args.
1 -->  1.806  (2.431 -> 1.806  -25.71%)
2 -->  1.797  (2.448 -> 1.797  -27.22%)
3 -->  2.463  (3.384 -> 2.463  -27.22%)
4 -->  2.973  (4.137 -> 2.973  -28.14%)
Post a reply to this message