|
|
> We made this little patch for Pov-Ray. It's optymalized version of
> Intersect_Triangle function with SSE2.
As Warp suggests, it would be really interesting to see some
actual speedup demonstration of using your assembler code rather
than that generated by gcc on a Pentium 4 machine (with the optimi-
zation flags that ./configure sets for it). You claim a 20% speedup,
which seems reasonnable but need to be supported by reproducible
test cases. I will try it myself if time permits (also on an
k8 architecture).
A few general comments after a very quick look at your code
(I'm not an assembler guru though):
- your assembly code looks a lot like what gcc-3.4.2 outputs, but
I didn't check things very carefully: could you point out what you
did optimize?
- seperating Intersect_Triangle() away from triangle.cpp make you loose
the inlining gcc does of it and of the other function calls within
All_Triangle_Intersections();
- you seem to call fabs where apparently gcc inlines the corresponding
assembly code too.
Overall the way you proceed with this optimization could be also
optimized itself, by e.g. inlining assembly code within triangle.cpp
(here you kind of mess up with the build system to insert your own code).
Also that should save you writing some unecessary code related to e.g.
the triangle structs. If you code does improve speed as you suggest,
I'd be interested to see a rewrite of your patch according to the points
mentionned above.
- NC
Post a reply to this message
|
|
|
|
We test this function with one big triangle. Here is a Pov-Ray code for it:
------------------------------------------------
#include "colors.inc"
camera {location <2.0 , 0.0 , 0.0> # dla X
//camera {location <2.0 , 0.0 , 0.0> # dla Y
//camera {location <2.0 , 0.0 , 0.0> # dla Z
look_at <0.0 , 0.0 , 0.0>}
light_source{<1,2,-2> color White}
triangle {
<-11,-6,-8>,<-11,6,0>,<-11,-6,8> # dla X
//<-8,-11,-6>,<0,-11,6>,<8,-11,-6> # dla Y
//<-8,-6,-11>,<0,6,-11>,<8,-6,-11> # dla Z
texture{
pigment{color rgb<1,0.5,0>}
finish{ambient 0.15 diffuse 0.85}
}
}
------------------------------------------------
This triangle in standard resolution (300x300) run Intersect_Triangle 109676
times and return true 32678. We test every part (X, Y, Z) separately
because of some little differences in code.
We run orginal gcc function and our optimized one 1000 times and count numer
of processor cycles. Next we choose 100 best results for both functions and
calculate an average numer of cycles. Here are our results:
gcc verion for X: 74,1 milions cycles
sse2 version for X: 59,2 milions cycles (20,10 % beter)
gcc verion for X: 76,3 milions cycles
sse2 version for X: 61,0 milions cycles (20,00 % beter)
gcc verion for X: 75,5 milions cycles
sse2 version for X: 59,8 milions cycles (20,79 % beter)
Mainly we minimalize numer of reads from memory and generally numer of
instructions.
Post a reply to this message
|
|