povr_6e4ed6c2 Last October povr tarball povr_0b0b91e5 Turning off all asserts for normal use compiles. povr_1bc9c73e Just prior to fltpt exception fix _w extra vec init povr_9469d148 At the fltpt exception fix _w extra vec init povrA (09126c05) Feb 8th but with extra vector zeroing. Bounding at doubles. povrB (09126c05) Feb 8th without extra vector zeroing. Bounding at doubles. All configured with: ...CXXFLAGS="-std=c++17 -O3 -ffast-math -march=native" --enable-lto povrB is the only compile eliminating the extra vector component zeroing. Eliminating the extra initializations is always a win for performance (-0.1 to -4.7%). The move to doubles best seen in povr_1bc9c73e is a win in many scenes (-0.3% to -16.9%) but in testing some simpler scenes, like povr's fog.pov, the change to doubles came up slower (+1.8% to +2.9%). The reasons for this are not clear. Of note too, the FS14 scene runs MUCH faster without any bounding at all - the spheres basically all overlap. The bounding no matter how good is all overhead. It's clear the fix at povr_9469d148 to better handle floating point exceptions where using -ffast-math, is costly (+0.4% +3.4%). Though there are occasional exceptions too? Scenes like FS14_3_v38.pov, making heavier use of it, see a heavier impact. The universal gain for -ffast-math is on the order of -6 to -10% - so being able to continue to use that option without side effects makes the fix a win. However, a look to see if the fix can be faster - or perhaps the entire intersect_BBox mechanism can be improved / change in other ways avoiding the floating point exceptions? intersect_BBox -------------- The intersect_BBox mechanism itself is not light weight performance wise. A couple of experiments. First, with the sphere which currently bypasses the mechanism and uses the sphere intersection code straight up. I turned ON the intersect_BBox code and saw a slow down of +5.6% on FS14.pov - so there is significant overhead. I wonder if we've really identified all the shapes where it's a loss to use the mechanism. This perhaps especially true in povr, where significant solver work has been done since the initial use/avoid intersect_BBox choices were made in v3.7. In the second experiment I turned off the intersect_BBox for isosurfaces. Isosurfaces are defined with contained_by bounding shapes of boxes or spheres. The performance was nearly the same whether intersect_BBox was used or not! More specifically, it seems to run slightly slower (+0.29% to +0.44% slower) where the contained_by shape was a box. Where the isosurfaces had a sphere as the contained_by shape, and this sphere a better bounding than a box would be, bypassing the intersect_BBox code was faster (-0.31% to -0.35%). I had expected bypassing intersect_BBox for isosurfaces would always be faster. --- A surprise is that dropping run time assert code actually hurts the performance for SamBoxDiv_3.pov! My only guess as this point is caching is in play somehow. The asserts absolutely cost to run, but are we perhaps getting data into lower level caches needed for later work in that one case? Bounding method two (+bm2) was often faster. Where measured it paralleled bounding method one with respect to the major performance changes since last October's tarball. A surprise is there seems to be some further speed up / slow down after povr_9469d148, depending upon the scene. This too I do not understand. The results line up reasonably well with more direct performance comparisons with profiling, but the profiling is done with substantially different compiler settings. The detailed profiling is very time intensive so only a small collection of scenes were evaluated. Flags: +w1200 +h900 +a0.0 +am2 +r3 -j -d -p -cc -fn -v FS14_3_v38.pov -------------------------- 193.25user 0.04system 0:49.00elapsed 394%CPU 190.66user 0.03system 0:48.36elapsed 394%CPU -1.34% -1.31% 185.03user 0.07system 0:46.95elapsed 394%CPU -2.95% -2.92% 191.23user 0.07system 0:48.55elapsed 393%CPU +3.35% +3.41% 192.16user 0.04system 0:48.77elapsed 394%CPU +0.49% +0.45% 187.65user 0.03system 0:47.63elapsed 394%CPU -2.35% -2.34% (e -2.80%) ---Comparing to current state - last line above. 198.28user 0.04system 0:50.30elapsed 394%CPU (No Intersect_BBox bypass e +5.61%) 136.30user 0.04system 0:34.79elapsed 391%CPU (-mb with above change e -26.96%) 135.70user 0.03system 0:34.59elapsed 392%CPU (-mb -27.38%) --- v3.7 stable. 169.60user 0.04system 0:43.10elapsed 393%CPU (Still -9.51% faster) 123.65user 0.04system 0:31.59elapsed 391%CPU (-mb works v3.7 for spheres -27.09%) --- v3.8 Beta 1. 207.49user 0.02system 0:52.56elapsed 394%CPU (+10.35% to povr +21.95% to v3.7) 140.18user 0.03system 0:35.92elapsed 390%CPU (-mb) biscuit_3.pov -------------------------- 186.71user 0.03system 0:47.39elapsed 394%CPU 185.42user 0.07system 0:47.09elapsed 393%CPU -0.69% -0.63% 178.69user 0.04system 0:45.39elapsed 393%CPU -3.63% -3.61% 181.42user 0.06system 0:46.09elapsed 393%CPU +1.53% +1.54% 182.22user 0.07system 0:46.28elapsed 393%CPU +0.44% +0.41% 178.38user 0.08system 0:45.33elapsed 393%CPU -2.11% -2.05% (e -4.35%) SamBoxDiv_3.pov (A non-torus / non-polar version) -------------------------- 286.39user 0.08system 1:12.38elapsed 395%CPU 288.49user 0.12system 1:12.93elapsed 395%CPU +0.73% +0.76% 239.70user 0.10system 1:00.72elapsed 394%CPU -16.91% -16.74% 241.15user 0.09system 1:01.28elapsed 393%CPU +0.60% +0.92% 239.00user 0.11system 1:00.57elapsed 394%CPU -0.89% -1.16% 238.69user 0.08system 1:00.46elapsed 394%CPU -0.13% -0.18% (e -16.47%) SamBoxDiv_3.pov (+bm2) -------------------------- 255.52user 0.08system 1:04.91elapsed 393%CPU (-10.78% vs +bm1) 257.73user 0.07system 1:05.27elapsed 394%CPU +0.86% +0.55% ... ... 212.69user 0.05system 0:54.00elapsed 393%CPU -17.48% -17.27% 212.34user 0.12system 0:53.90elapsed 394%CPU -0.16% -0.19% (e -16.96%) rtr_kla.pov -------------------------- 25.09user 0.08system 0:07.52elapsed 334%CPU 8.75784 FPS 25.05user 0.10system 0:07.53elapsed 333%CPU 8.75912 FPS -0.16% +0.13% +0.01% 24.98user 0.06system 0:07.46elapsed 335%CPU 8.81964 FPS -0.28% -0.93% +0.69% 25.07user 0.10system 0:07.51elapsed 335%CPU 8.74253 FPS +0.36% +0.67% -0.87% 24.97user 0.06system 0:07.47elapsed 335%CPU 8.79636 FPS -0.40% -0.53% +0.62% 23.86user 0.05system 0:07.16elapsed 333%CPU 9.21517 FPS -4.45% -4.15% +4.76% (+5.22% faster fps) (-mb) 36.10user 0.05system 0:10.47elapsed 345%CPU 6.13309 FPS (+bm2) 23.61user 0.10system 0:07.13elapsed 332%CPU 9.27070 FPS (+0.60% faster fps) fog.pov -------------------------- 103.86user 0.05system 0:26.73elapsed 388% 102.49user 0.03system 0:26.33elapsed 388% -1.32% -1.50% 105.41user 0.06system 0:27.09elapsed 389% +2.85% +2.89% 107.10user 0.07system 0:27.49elapsed 389% +1.60% +1.48% 105.55user 0.06system 0:27.43elapsed 385% -1.45% -0.22% 104.35user 0.14system 0:26.88elapsed 388% -1.14% -2.01% (e overall +0.56%) -mb 103.67user 0.05system 0:26.63elapsed 389%CPU 219.21user 0.09system 0:55.51elapsed 395%CPU 102.80user 0.05system 0:26.43elapsed 389%CPU 219.22user 0.05system 0:55.51elapsed 394%CPU 105.68user 0.05system 0:27.14elapsed 389%CPU 213.63user 0.05system 0:54.11elapsed 394%CPU(* ?) 106.19user 0.06system 0:27.23elapsed 390%CPU 219.70user 0.07system 0:55.66elapsed 394%CPU (*) - Float to doubles change is faster with bounding completely (povr only capability) off(**). (**) - Off from the perspective of what the core shape/surface solvers see. Some of the intersect_BBox overhead is always present. povr_6e4ed6c2 Last October povr tarball povr_0b0b91e5 Turning off all asserts for normal use compiles. povr_1bc9c73e Just prior to fltpt exception fix _w extra vec init povr_9469d148 At the fltpt exception fix _w extra vec init povrA (09126c05) Feb 8th but with extra vector zeroing. Bounding at doubles. povrB (09126c05) Feb 8th without extra vector zeroing. Bounding at doubles.