POV-Ray: Newsgroups: povray.programming: Fixing mandelx_pattern: Re: Fixing mandelx

POV-Ray : Newsgroups : povray.programming : Fixing mandelx_pattern : Re: Fixing mandelx_pattern		Server Time 12 Jul 2025 11:59:23 EDT (-0400)
From: Algo
Date: 22 Jun 2008 17:35:00
Message: <web.485ec47e56e5854ed9b2aafa0@news.povray.org>
How did NEWER_mandelx_pattern of exponent=2 manage to go 7% faster than
mandel_pattern?

THIS EXPLANATION IS HERE BECAUSE I SAID I WOULD FIGURE IT OUT.
FEEL FREE NOT TO READ THIS.  IT IS VERY TECHNICAL MATERIAL OF LITTLE
INTEREST EXCEPT TO HARD-CORE CODERS.  ALL THE LESS RELEVANT NOW,
SINCE MY FINAL CODE NOW BEATS mandel_pattern BY 19%.  MY ADVICE:
STUDY THAT INSTEAD.

The answer is by a combination of 75% Good Luck and 25% Right Living.  The
following are the disassembled code executing in the FPU on each pass through
the col loop (when neither internal "if" succeeds).

This code is MS VS 2008 C++, "release".

As expected, both codes succeed in keeping the loop totally free
from memory access.

Recall that "fld st(n)" is about three times faster than fadd, fmul or
fsub.

Recall that the first "fxch" is free as long as it executes after an
fadd, fmul or fsub.


             NEWER_mandelx_pattern:
                       NEXT_SQUARE;
             004017BE  fsubrp      st(1),st     add#1
             004017C0  fxch        st(3)
             004017C2  fadd        st(0),st     add#2
             004017C4  fmulp       st(1),st     mul#1
                       ReZ = ZTo2ToN[2] + ReX;
             004017C6  fxch        st(2)
             004017C8  fadd        st,st(3)     add#3
                       ImZ = ZTo2ToN[3] + ImX;
             004017CA  fxch        st(2)
             004017CC  fadd        st,st(1)     add#4
                       DBL AbsZ2 = ReZ*ReZ + ImZ*ImZ;
             004017CE  fld         st(0)        fld#1
             004017D0  fmul        st,st(1)     mul#2
             004017D2  fld         st(3)        fld#2
             004017D4  fmul        st,st(4)     mul#3
             004017D6  fld         st(0)        fld#3
             004017D8  fadd        st,st(2)     add#5


             mandel_pattern:
             b  = 2.0 * a * b + y;
             0040102B  fxch        st(4)        xch#1
             0040102D  fadd        st(0),st     add#1
             0040102F  fmulp       st(1),st     mul#1
             00401031  fadd        st,st(1)     add#2
                 a  = a2 - b2 + x;
             00401033  fxch        st(2)
             00401035  fsubrp      st(3),st     add#3
             00401037  fxch        st(2)
             00401039  fadd        st,st(3)     add#4

                 a2 = Sqr(a);
             0040103B  fld         st(0)        fld#1
             0040103D  fmul        st,st(1)     mul#2
                 b2 = Sqr(b);
             0040103F  fld         st(2)        fld#2
             00401041  fmul        st,st(3)     mul#3

             0040101F  fxch        st(2)
             00401021  fxch        st(4)        xch#2
             00401023  fxch        st(1)        xch#3
             00401025  fxch        st(3)        xch#4
             00401027  fxch        st(1)        xch#5
             00401029  fxch        st(2)        xch#6

A rough cycle count comparison is:
                                add     mul     fld     fxch    total
        NEWER_mandelx_pattern:  5*3     3*3     3*1     0       27
        mandel_pattern:         4*3     3*3     2*1     6*1     29

The time difference seems to be due mandel_pattern's chain of 6 fxch ops
at the end of the loop.  This is due to the cunning re-use of a^2 and b^2
from one pass through the loop to the next, upping the number of re-used
register values from 3 to 5 (out of 8).  These must be returned to the same
positions from pass to pass and that requires a bunch of register exchanges,
wasting more time than was saved by squeezing out that last FPU arithmetic op.

An infinitely clever compiler could have avoided these by unrolling the loop
in mandel-pattern.

---Algo
Post a reply to this message