|
|
How did NEWER_mandelx_pattern of exponent=2 manage to go 7% faster than
mandel_pattern?
THIS EXPLANATION IS HERE BECAUSE I SAID I WOULD FIGURE IT OUT.
FEEL FREE NOT TO READ THIS. IT IS VERY TECHNICAL MATERIAL OF LITTLE
INTEREST EXCEPT TO HARD-CORE CODERS. ALL THE LESS RELEVANT NOW,
SINCE MY FINAL CODE NOW BEATS mandel_pattern BY 19%. MY ADVICE:
STUDY THAT INSTEAD.
The answer is by a combination of 75% Good Luck and 25% Right Living. The
following are the disassembled code executing in the FPU on each pass through
the col loop (when neither internal "if" succeeds).
This code is MS VS 2008 C++, "release".
As expected, both codes succeed in keeping the loop totally free
from memory access.
Recall that "fld st(n)" is about three times faster than fadd, fmul or
fsub.
Recall that the first "fxch" is free as long as it executes after an
fadd, fmul or fsub.
NEWER_mandelx_pattern:
NEXT_SQUARE;
004017BE fsubrp st(1),st add#1
004017C0 fxch st(3)
004017C2 fadd st(0),st add#2
004017C4 fmulp st(1),st mul#1
ReZ = ZTo2ToN[2] + ReX;
004017C6 fxch st(2)
004017C8 fadd st,st(3) add#3
ImZ = ZTo2ToN[3] + ImX;
004017CA fxch st(2)
004017CC fadd st,st(1) add#4
DBL AbsZ2 = ReZ*ReZ + ImZ*ImZ;
004017CE fld st(0) fld#1
004017D0 fmul st,st(1) mul#2
004017D2 fld st(3) fld#2
004017D4 fmul st,st(4) mul#3
004017D6 fld st(0) fld#3
004017D8 fadd st,st(2) add#5
mandel_pattern:
b = 2.0 * a * b + y;
0040102B fxch st(4) xch#1
0040102D fadd st(0),st add#1
0040102F fmulp st(1),st mul#1
00401031 fadd st,st(1) add#2
a = a2 - b2 + x;
00401033 fxch st(2)
00401035 fsubrp st(3),st add#3
00401037 fxch st(2)
00401039 fadd st,st(3) add#4
a2 = Sqr(a);
0040103B fld st(0) fld#1
0040103D fmul st,st(1) mul#2
b2 = Sqr(b);
0040103F fld st(2) fld#2
00401041 fmul st,st(3) mul#3
0040101F fxch st(2)
00401021 fxch st(4) xch#2
00401023 fxch st(1) xch#3
00401025 fxch st(3) xch#4
00401027 fxch st(1) xch#5
00401029 fxch st(2) xch#6
A rough cycle count comparison is:
add mul fld fxch total
NEWER_mandelx_pattern: 5*3 3*3 3*1 0 27
mandel_pattern: 4*3 3*3 2*1 6*1 29
The time difference seems to be due mandel_pattern's chain of 6 fxch ops
at the end of the loop. This is due to the cunning re-use of a^2 and b^2
from one pass through the loop to the next, upping the number of re-used
register values from 3 to 5 (out of 8). These must be returned to the same
positions from pass to pass and that requires a bunch of register exchanges,
wasting more time than was saved by squeezing out that last FPU arithmetic op.
An infinitely clever compiler could have avoided these by unrolling the loop
in mandel-pattern.
---Algo
Post a reply to this message
|
|