POV-Ray: Newsgroups: povray.unofficial.patches: SSE2 optymalization of Intersect

POV-Ray : Newsgroups : povray.unofficial.patches : SSE2 optymalization of Intersect_Triangle function		Server Time 2 Jul 2025 07:45:41 EDT (-0400)

From: raven
Subject: SSE2 optymalization of Intersect_Triangle function
Date: 26 Jan 2005 06:20:00
Message: <web.41f77c5d8db80ef05400db670@news.povray.org>

We made this little patch for Pov-Ray. It's optymalized version of
Intersect_Triangle function with SSE2.

Source code and readme file you can find here:
http://www.povray.republika.pl/Intersect_Triangle_SSE2_patch.zip

We appreciate any comments and suggestions.

Enjoy.

Post a reply to this message

From: Warp
Subject: Re: SSE2 optymalization of Intersect_Triangle function
Date: 26 Jan 2005 07:05:59
Message: <41f787a7@news.povray.org>

It would be nice if you posted some measurements on speed improvements
this patch provides.

-- 
#macro N(D)#if(D>99)cylinder{M()#local D=div(D,104);M().5,2pigment{rgb M()}}
N(D)#end#end#macro M()<mod(D,13)-6mod(div(D,13)8)-3,10>#end blob{
N(11117333955)N(4254934330)N(3900569407)N(7382340)N(3358)N(970)}//  - Warp -

Post a reply to this message

From: Nicolas Calimet
Subject: Re: SSE2 optymalization of Intersect_Triangle function
Date: 26 Jan 2005 10:00:48
Message: <41f7b0a0@news.povray.org>

> We made this little patch for Pov-Ray. It's optymalized version of
> Intersect_Triangle function with SSE2.

	As Warp suggests, it would be really interesting to see some
actual speedup demonstration of using your assembler code rather
than that generated by gcc on a Pentium 4 machine (with the optimi-
zation flags that ./configure sets for it).  You claim a 20% speedup,
which seems reasonnable but need to be supported by reproducible
test cases.  I will try it myself if time permits (also on an
k8 architecture).

	A few general comments after a very quick look at your code
(I'm not an assembler guru though):

- your assembly code looks a lot like what gcc-3.4.2 outputs, but
I didn't check things very carefully: could you point out what you
did optimize?
- seperating Intersect_Triangle() away from triangle.cpp make you loose
the inlining gcc does of it and of the other function calls within
All_Triangle_Intersections();
- you seem to call fabs where apparently gcc inlines the corresponding
assembly code too.

	Overall the way you proceed with this optimization could be also
optimized itself, by e.g. inlining assembly code within triangle.cpp
(here you kind of mess up with the build system to insert your own code).
Also that should save you writing some unecessary code related to e.g.
the triangle structs.  If you code does improve speed as you suggest,
I'd be interested to see a rewrite of your patch according to the points
mentionned above.

	- NC

Post a reply to this message

From: raven
Subject: Re: SSE2 optymalization of Intersect_Triangle function
Date: 26 Jan 2005 11:00:00
Message: <web.41f7bd6a4017f8d86e5b7ea20@news.povray.org>

We test this function with one big triangle. Here is a Pov-Ray code for it:
------------------------------------------------
#include "colors.inc"
camera {location <2.0 , 0.0 , 0.0>    # dla X
//camera {location <2.0 , 0.0 , 0.0>  # dla Y
//camera {location <2.0 , 0.0 , 0.0>  # dla Z
        look_at  <0.0 , 0.0 , 0.0>}
light_source{<1,2,-2> color White}

triangle {
 <-11,-6,-8>,<-11,6,0>,<-11,-6,8>   # dla X
 //<-8,-11,-6>,<0,-11,6>,<8,-11,-6> # dla Y
 //<-8,-6,-11>,<0,6,-11>,<8,-6,-11> # dla Z
 texture{
  pigment{color rgb<1,0.5,0>}
  finish{ambient 0.15 diffuse 0.85}
 }
}
------------------------------------------------

This triangle in standard resolution (300x300) run Intersect_Triangle 109676
times and return true 32678. We test every part (X, Y, Z) separately
because of some little differences in code.
We run orginal gcc function and our optimized one 1000 times and count numer
of processor cycles. Next we choose 100 best results for both functions and
calculate an average numer of cycles. Here are our results:

gcc verion for X: 74,1 milions cycles
sse2 version for X: 59,2 milions cycles (20,10 % beter)
gcc verion for X: 76,3 milions cycles
sse2 version for X: 61,0 milions cycles (20,00 % beter)
gcc verion for X: 75,5 milions cycles
sse2 version for X: 59,8 milions cycles (20,79 % beter)

Mainly we minimalize numer of reads from memory and generally numer of
instructions.

Post a reply to this message

From: Nicolas Calimet
Subject: Re: SSE2 optymalization of Intersect_Triangle function
Date: 26 Jan 2005 11:54:49
Message: <41f7cb59$1@news.povray.org>

> gcc verion for X: 74,1 milions cycles
> sse2 version for X: 59,2 milions cycles (20,10 % beter)

	Please give the gcc version and compiler flags (CXXFLAGS) that
you used to do this comparison.

	- NC

Post a reply to this message

From: Ryan Lamansky
Subject: Re: SSE2 optymalization of Intersect_Triangle function
Date: 27 Jan 2005 09:01:43
Message: <41f8f447$1@news.povray.org>

raven wrote:
> We test this function with one big triangle.

It's unlikely a scene will ever consist of a single triangle.  Even if 
it did, POV-Ray can currently render it instantly, and 20% faster of 
instant is still instant.

A better test would consist of 10s of thousands of triangles.  My guess 
is that there wouldn't be much (if any) improvement because the vista 
buffer and bounding heirarchy will eliminate most of the intersection tests.

My opinion is that hand-optimized assembly for intersection tests is 
"cool" and "interesting", but also not the best place for the effort...

-Ryan

Post a reply to this message

From: Rangifer
Subject: Re: SSE2 optymalization of Intersect_Triangle function
Date: 29 Jan 2005 06:30:45
Message: <41fb73e5@news.povray.org>

Ryan Lamansky wrote:
> raven wrote:
> A better test would consist of 10s of thousands of triangles.  My guess 
> is that there wouldn't be much (if any) improvement because the vista 
> buffer and bounding heirarchy will eliminate most of the intersection 
> tests.

13680 antialiased tringles (9 teapots). A rough profiling blurts out 
something like this. Total time was 8.39 sec so in this case one might 
save 20% of 2.86% and the total time would drop to 8.34.. what a saving!

   %   cumulative   self
  time   seconds   seconds    calls  name
  13.47      1.13     1.13 16337788  pov::Check_And_Enqueue(...)
  10.25      1.99     0.86   561358  pov::Intersect_Light_Tree(...)
   9.18      2.76     0.77   480000  pov::intersect_vista_tree(...)
   3.93      3.09     0.33   593029  pov::compute_lighted_texture(...)
   3.81      3.41     0.32  6822351  pov::priority_queue_insert(...)
   3.34      3.69     0.28   863724  pov::Noise(...)
   3.34      3.97     0.28   215987  pov::Intersect_BBox_Tree(...)
   3.10      4.23     0.26   921538  pov::DNoise(...)
   2.86      4.47     0.24  2367736  pov::Intersect_Triangle(...)
...

-r

Post a reply to this message