|
|
On 4/15/20 12:55 PM, William F Pokorny wrote:
...
>
> (2) For reasons I do not understand it looks like coding
> f_superellipsoid as a macro is a lot faster than any inbuilt method. See
> (a -> b) and (c -> h). It works so do it in v38 as at least an option to
> normal function calls. 24.5% speed up. (Anyone have a thought as to why?
> We are passing fewer parameters, but that seems a stretch to explain the
> bulk of the difference. I've not dug.)
>
Yep, I couldn't let these weird results go...
First, I was stupid in doing my initial C++ hard coded arguments compile
in chosing EW = 0.5 and NS = 0.5. These, of course, resolve to calling
some of the pow()s as pow(...,1.0). So, all the performance values below
were done with EW = 1/3 and NW = 1/4 and the hard coded arguments
compile is a lot less 'wow' faster.
As for why (2) where the raw SDL encoding is winning over an inbuilt
compiled result. Looked at it with the linux perf profiling tools and it
looks like when the pow() requests (done internally as exp()s and
log()s) come at the hardware too fast, some are getting delayed. I see a
big jump in pow() hardware cycle counts and under them irq and timer
routines which are not there in the SDL hard coded case. Not sure if
this 'hold up a minute' by the cpu is to control power or it's the way
the hardware (an i3) handles too many overlapping pow() requests.
Some better timing data below.
I think going forward in povr I am going to stick with the two
traditional arguments, but offer a switch to get to a float over double
version of the code as this looks to be a big help - on my hardware at
least - perhaps not on all. Yes, the results are very slightly different
(1/255) values in my testing, but if it looks OK and is a 40%+ faster
it's mostly a who cares I think.
Aside: I looked at this with both g++ and clang. Here clang a little
slower overall - but the strange overall 'relative' performance results
hold for both compilers. This aligns with my belief this a core library
scheduling relative to the hardware thing.
Optimization is, on modern hardware especially, a brutal business. I was
hoping in looking at this to find something in the vm implementation
which could be fixed for an across the board benefit, but this has
turned out to be something particular with doing so many pow()s at
nearly the same time.
(Lots of implications pop into my head for static compiles over dynamic
and changing performance as core library routines get optimized for the
most modern hardware... I am going to let these question go.)
Bill P.
clang results. Actual cpu time in seconds, not elapsed.
----------------------------
Function coded in SDL (pow). 4 precalc args.
1 --> 2.431
2 --> 2.448
3 --> 3.384
4 --> 4.137
Function inbuilt with single floats (powf). 4 precalc args.
1 --> 1.363 (2.431 -> 1.363 -43.93%)
2 --> 1.379 (2.448 -> 1.379 -43.67%)
3 --> 1.864 (3.384 -> 1.864 -44.92%)
4 --> 2.235 (4.137 -> 2.235 -45.98%)
Function inbuilt with double floats (pow). 4 precalc args.
1 --> 3.121 (2.431 -> 3.121 +28.38%)
2 --> 3.117 (2.448 -> 3.117 +27.33%)
3 --> 4.122 (3.384 -> 4.122 +21.81%)
4 --> 4.857 (4.137 -> 4.857 +17.40%)
--- Traditional f_superellipsoid passing 2 args EW, NS.
Function inbuilt with single floats (powf). 2 traditional args.
1 --> 1.383 (2.431 -> 1.383 -43.11%)
2 --> 1.391 (2.448 -> 1.391 -43.18%)
3 --> 1.908 (3.384 -> 1.908 -43.62%)
4 --> 2.307 (4.137 -> 2.307 -44.23%)
Function inbuilt with double floats (pow). 2 traditional args.
1 --> 3.114 (2.431 -> 3.114 +28.10%)
2 --> 3.125 (2.448 -> 3.125 +27.66%)
3 --> 4.148 (3.384 -> 4.148 +22.58%)
4 --> 4.879 (4.137 -> 4.879 +17.94%)
---- Hard coded the args internal to f_superellipsoid.
(Again, this just a what if, how fast, kinda thing)
Function inbuilt with double floats (pow). No args.
1 --> 1.806 (2.431 -> 1.806 -25.71%)
2 --> 1.797 (2.448 -> 1.797 -27.22%)
3 --> 2.463 (3.384 -> 2.463 -27.22%)
4 --> 2.973 (4.137 -> 2.973 -28.14%)
Post a reply to this message
|
|