povr_6e4ed6c2    Last October povr tarball
povr_0b0b91e5    Turning off all asserts for normal use compiles.
povr_1bc9c73e    Just prior to fltpt exception fix _w extra vec init
povr_9469d148    At the fltpt exception fix _w extra vec init
povrA (09126c05) Feb 8th but with extra vector zeroing. Bounding at doubles.
povrB (09126c05) Feb 8th without extra vector zeroing. Bounding at doubles.

All configured with:
...CXXFLAGS="-std=c++17 -O3 -ffast-math -march=native" --enable-lto

povrB is the only compile eliminating the extra vector component zeroing.
Eliminating the extra initializations is always a win for performance (-0.1 to
-4.7%).

The move to doubles best seen in povr_1bc9c73e is a win in many scenes (-0.3%
to -16.9%) but in testing some simpler scenes, like povr's fog.pov, the change
to doubles came up slower (+1.8% to +2.9%). The reasons for this are not
clear.

Of note too, the FS14 scene runs MUCH faster without any bounding at all - the
spheres basically all overlap. The bounding no matter how good is all
overhead.

It's clear the fix at povr_9469d148 to better handle floating point exceptions
where using -ffast-math, is costly (+0.4% +3.4%). Though there are occasional
exceptions too? Scenes like FS14_3_v38.pov, making heavier use of it, see a
heavier impact. The universal gain for -ffast-math is on the order of -6 to
-10% - so being able to continue to use that option without side effects makes
the fix a win. However, a look to see if the fix can be faster - or perhaps
the entire intersect_BBox mechanism can be improved / change in other ways
avoiding the floating point exceptions?

intersect_BBox
--------------

The intersect_BBox mechanism itself is not light weight performance wise. A
couple of experiments.

First, with the sphere which currently bypasses the mechanism and uses the
sphere intersection code straight up.  I turned ON the intersect_BBox code and
saw a slow down of +5.6% on FS14.pov - so there is significant overhead.

I wonder if we've really identified all the shapes where it's a loss to use
the mechanism. This perhaps especially true in povr, where significant solver
work has been done since the initial use/avoid intersect_BBox choices were
made in v3.7.

In the second experiment I turned off the intersect_BBox for isosurfaces.
Isosurfaces are defined with contained_by bounding shapes of boxes or spheres.

The performance was nearly the same whether intersect_BBox was used or not!
More specifically, it seems to run slightly slower (+0.29% to +0.44% slower)
where the contained_by shape was a box. Where the isosurfaces had a sphere as
the contained_by shape, and this sphere a better bounding than a box would be,
bypassing the intersect_BBox code was faster (-0.31% to -0.35%).

I had expected bypassing intersect_BBox for isosurfaces would always be faster.

---

A surprise is that dropping run time assert code actually hurts the
performance for SamBoxDiv_3.pov! My only guess as this point is caching is in
play somehow. The asserts absolutely cost to run, but are we perhaps getting
data into lower level caches needed for later work in that one case?

Bounding method two (+bm2) was often faster. Where measured it paralleled
bounding method one with respect to the major performance changes since last
October's tarball.

A surprise is there seems to be some further speed up / slow down after
povr_9469d148, depending upon the scene. This too I do not understand.

The results line up reasonably well with more direct performance comparisons
with profiling, but the profiling is done with substantially different
compiler settings. The detailed profiling is very time intensive so only a
small collection of scenes were evaluated.


Flags: +w1200 +h900 +a0.0 +am2 +r3 -j -d -p -cc -fn -v


FS14_3_v38.pov
--------------------------
193.25user 0.04system 0:49.00elapsed 394%CPU
190.66user 0.03system 0:48.36elapsed 394%CPU  -1.34%  -1.31%
185.03user 0.07system 0:46.95elapsed 394%CPU  -2.95%  -2.92%
191.23user 0.07system 0:48.55elapsed 393%CPU  +3.35%  +3.41%
192.16user 0.04system 0:48.77elapsed 394%CPU  +0.49%  +0.45%
187.65user 0.03system 0:47.63elapsed 394%CPU  -2.35%  -2.34%  (e -2.80%)

---Comparing to current state - last line above.
198.28user 0.04system 0:50.30elapsed 394%CPU  (No Intersect_BBox bypass e  +5.61%)
136.30user 0.04system 0:34.79elapsed 391%CPU  (-mb with above change e    -26.96%)
135.70user 0.03system 0:34.59elapsed 392%CPU  (-mb                        -27.38%)
--- v3.7 stable.
169.60user 0.04system 0:43.10elapsed 393%CPU  (Still -9.51% faster)
123.65user 0.04system 0:31.59elapsed 391%CPU  (-mb works v3.7 for spheres -27.09%)
--- v3.8 Beta 1.
207.49user 0.02system 0:52.56elapsed 394%CPU  (+10.35% to povr +21.95% to v3.7)
140.18user 0.03system 0:35.92elapsed 390%CPU  (-mb)

biscuit_3.pov
--------------------------
186.71user 0.03system 0:47.39elapsed 394%CPU
185.42user 0.07system 0:47.09elapsed 393%CPU  -0.69%  -0.63%
178.69user 0.04system 0:45.39elapsed 393%CPU  -3.63%  -3.61%
181.42user 0.06system 0:46.09elapsed 393%CPU  +1.53%  +1.54%
182.22user 0.07system 0:46.28elapsed 393%CPU  +0.44%  +0.41%
178.38user 0.08system 0:45.33elapsed 393%CPU  -2.11%  -2.05%  (e -4.35%)


SamBoxDiv_3.pov (A non-torus / non-polar version)
--------------------------
286.39user 0.08system 1:12.38elapsed 395%CPU
288.49user 0.12system 1:12.93elapsed 395%CPU   +0.73%  +0.76%
239.70user 0.10system 1:00.72elapsed 394%CPU  -16.91% -16.74%
241.15user 0.09system 1:01.28elapsed 393%CPU   +0.60%  +0.92%
239.00user 0.11system 1:00.57elapsed 394%CPU   -0.89%  -1.16%
238.69user 0.08system 1:00.46elapsed 394%CPU   -0.13%  -0.18% (e -16.47%)

SamBoxDiv_3.pov   (+bm2)
--------------------------
255.52user 0.08system 1:04.91elapsed 393%CPU  (-10.78% vs +bm1)
257.73user 0.07system 1:05.27elapsed 394%CPU   +0.86%  +0.55%
...
...
212.69user 0.05system 0:54.00elapsed 393%CPU  -17.48% -17.27%
212.34user 0.12system 0:53.90elapsed 394%CPU   -0.16%  -0.19% (e -16.96%)


rtr_kla.pov
--------------------------
25.09user 0.08system 0:07.52elapsed 334%CPU  8.75784 FPS
25.05user 0.10system 0:07.53elapsed 333%CPU  8.75912 FPS  -0.16%  +0.13%  +0.01%
24.98user 0.06system 0:07.46elapsed 335%CPU  8.81964 FPS  -0.28%  -0.93%  +0.69%
25.07user 0.10system 0:07.51elapsed 335%CPU  8.74253 FPS  +0.36%  +0.67%  -0.87%
24.97user 0.06system 0:07.47elapsed 335%CPU  8.79636 FPS  -0.40%  -0.53%  +0.62%
23.86user 0.05system 0:07.16elapsed 333%CPU  9.21517 FPS  -4.45%  -4.15%  +4.76%  (+5.22% faster fps)

(-mb)
36.10user 0.05system 0:10.47elapsed 345%CPU  6.13309 FPS
(+bm2)
23.61user 0.10system 0:07.13elapsed 332%CPU  9.27070 FPS  (+0.60% faster fps)


fog.pov
--------------------------
103.86user 0.05system 0:26.73elapsed 388%
102.49user 0.03system 0:26.33elapsed 388%  -1.32%  -1.50%
105.41user 0.06system 0:27.09elapsed 389%  +2.85%  +2.89%
107.10user 0.07system 0:27.49elapsed 389%  +1.60%  +1.48%
105.55user 0.06system 0:27.43elapsed 385%  -1.45%  -0.22%
104.35user 0.14system 0:26.88elapsed 388%  -1.14%  -2.01% (e overall +0.56%)

                                                -mb
103.67user 0.05system 0:26.63elapsed 389%CPU    219.21user 0.09system 0:55.51elapsed 395%CPU
102.80user 0.05system 0:26.43elapsed 389%CPU    219.22user 0.05system 0:55.51elapsed 394%CPU
105.68user 0.05system 0:27.14elapsed 389%CPU    213.63user 0.05system 0:54.11elapsed 394%CPU(* ?)
106.19user 0.06system 0:27.23elapsed 390%CPU    219.70user 0.07system 0:55.66elapsed 394%CPU

(*) - Float to doubles change is faster with bounding completely (povr only
capability) off(**).
(**) - Off from the perspective of what the core shape/surface solvers see.
Some of the intersect_BBox overhead is always present.

povr_6e4ed6c2    Last October povr tarball
povr_0b0b91e5    Turning off all asserts for normal use compiles.
povr_1bc9c73e    Just prior to fltpt exception fix _w extra vec init
povr_9469d148    At the fltpt exception fix _w extra vec init
povrA (09126c05) Feb 8th but with extra vector zeroing. Bounding at doubles.
povrB (09126c05) Feb 8th without extra vector zeroing. Bounding at doubles.