|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Hello
I have been using POV on and off since the early 90s. It was a fabulous product
then, and it is fabulous now.
Small as it is, I would like to offer a contribution of my own: fixing
mandelx_pattern. I have hacked together a demonstration of my proposed
improvement:
1.) Accuracy is not damaged. In fact it is improved. In a typical test case:
2.) Speed is improved. The enclosed graph shows measured performance
2x to 20x]
3.) Maintainability is improved. The code is easier to read and, for instance,
the binomial_coeff stuff is no longer needed.
If you would be interested, drop me a note.
Algo
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Algo <Gem### [at] hotmailcom> wrote:
> Small as it is, I would like to offer a contribution of my own: fixing
> mandelx_pattern.
What was broken with it, that needed fixing?
> I have hacked together a demonstration of my proposed improvement:
Where is it?
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp <war### [at] tagpovrayorg> wrote:
> Algo <Gem### [at] hotmailcom> wrote:
> > Small as it is, I would like to offer a contribution of my own: fixing
> > mandelx_pattern.
>
> What was broken with it, that needed fixing?
[Speed-ups ranging from 2x to 20x]
[Also accuracy is improved]
[Less source code, too]
>
> > I have hacked together a demonstration of my proposed improvement:
>
> Where is it?
>
Do you want to see source code or an algorithm description?
Algo
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
PROPOSED NEW VERSION OF mandelx_pattern
The basis of this improvement is the following algorithm. This algorithm could
not possibly be original to me, though I did not get it from any source.
PROPOSED ALGORITHM
How to compute powers in logarithmic time. The POV docs inaccurately imply that
computing the pth power takes p-1 multiplications. In fact it always takes
fewer than 2*log2(p) multiplications. Here is how that speedup works:
/////////////////////////////////////////////////////////////////
// Let z=a+ib, want zToP = z^p
// for example, take p=37
//
// Compute z, z^2, z^4, z^8, ... by successive squaring.
// Call this sequence zTo2ToN.
//
// Write these below binary expansion of p:
//
// 1 0 0 1 0 1
// z^32 z^16 z^8 z^4 z^2 z^1
// zToP = z^32 *z^4 *z^1 = z^37
//
// Notice that this takes:
// 5 complex multiplies to get the sequence zTo2ToN
// 2 complex multiplies accumulating -> zNew
// ----------------------
// 7 complex multiplies total
//
// i,e., WAY less than the 36 required if computed via
//
// z^1, z^2, z^3,..., z^34, z^35
//
// In general, to compute a pth power, it takes
//
// CountZeros(p)+2*CountOnes(p)-2 complex multiplies
//
// For certain values of p (like 15 or 30), there are methods
// that are somewhat faster, but the extra code burden would
// probably not be worth it.
/////////////////////////////////////////////////////////////////
================================================================================
This obviates a LOT of code.....
I propose that the code below replaces:
1.) mandelx_pattern
2.) mandel_pattern
3.) mandel3_pattern
4.) mandel4_pattern
5.) MANDEL_PATTERN
6.) MANDEL3_PATTERN
7.) MANDEL4_PATTERN
and all that machinery.
The same trick applies to the Julia Set as well, NEW_juliax_pattern eliminating:
8.) juliax_pattern
9.) julia_pattern
10.) julia3_pattern
11.) julia4_pattern
12.) JULIA_PATTERN
13.) JULIA3_PATTERN
14.) JULIA4_PATTERN
and all that machinery.
When these are both done, there is no further need for:
15.) binomial_coeff
16.) InitializeBinomialCoefficients
17.) BinomialCoefficients
18.) BinomialCoefficientsInited
and all that machinery.
================================================================================
And it goes a LOT faster. The enclosed figure, ALGO_TIMES.JPG, shows the
speedups.
Since I cannot post it to this forum, I will place it in some more permissive
forum on the POVRAY site.
================================================================================
Finally, it is a LOT more accurate. This is mostly a matter of peace-of-mind
for coders. I doubt very much that users will notice any artistic improvement
from this, but, for exponent=30, there actually were two pixels different even
on my 401x401 hacked up grid (but no smaller exponent differed at all).
================================================================================
================================================================================
Here is a hacked-up version of the code. I have never operated it in situ (for
some reason [missing DLL?] I am unable to execute my POV compilations). But I
am pretty sure that it works.
/********NEW_mandelx_pattern begins********************************************
#define CMPLX_SQ(A,B) {DBL w = A*A - B*B; B = 2.0*A*B; A = w;}
#define CMPLX_MUL(A,B,C,D) {DBL w = A*C - B*D; B = A*D + B*C; A = w;}
static DBL NEW_mandelx_pattern(double EPoint[3])
{// assert(exponent>0);
int col,ExpShifted;
DBL ReZ, ImZ, ReX, ImX, MinAbsZ2;
ReZ = ReX = EPoint[X]; // Think of ReZ+iImZ as z
ImZ = ImX = EPoint[Y]; // Think of ReX+iImX as x, start with z=x
MinAbsZ2 = ReZ*ReZ + ImZ*ImZ;
for(col = 0; col < it_max; col++)
{
DBL ReZTo2ToN = ReZ; // Will become z^(2^bit_pos)
DBL ImZTo2ToN = ImZ;
for(ExpShifted=exponent; (ExpShifted&1)==0; ExpShifted >>= 1)
CMPLX_SQ(ReZTo2ToN,ImZTo2ToN); // Next trailing zero in exponent
DBL ReZToE = ReZTo2ToN; // Least significant 1 in exponent,
DBL ImZToE = ImZTo2ToN; // This will become z^exponent
for(ExpShifted >>= 1 ; ExpShifted!=0 ; ExpShifted >>= 1)
{// Next bit-position in exponent
CMPLX_SQ(ReZTo2ToN,ImZTo2ToN); // Square previous z^(2^bit_pos)
if((ExpShifted&1)!=0) // This bit-position = 1?
CMPLX_MUL(ReZToE,ImZToE,ReZTo2ToN,ImZTo2ToN); // -> z^exponent
}
ReZ = ReZToE + ReX; // z <= z^exponent + x
ImZ = ImZToE + ImX;
DBL AbsZ2 = ReZ*ReZ + ImZ*ImZ;
if(AbsZ2 < MinAbsZ2) MinAbsZ2 = AbsZ2;
if(AbsZ2 > 4.0)
return(fractal_exterior_color(col, ReZ, ImZ));
}
return(fractal_interior_color(col, ReZ, ImZ, MinAbsZ2));
}
********NEW_mandelx_pattern ends**********************************************/
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Sounds cool.
You also claim that your version is faster than mandel_pattern().
Given that this function does nothing more than calculate the regular
z^2+c mandelbrot, I find it strange that a generic version which calculates
the same for any exponent will be faster. Are you sure?
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
No I didn't claim that it is faster than the exponent=2 special code. I haven't
measured that. I have measured vs. the mandelx with exponent = 2 and it turned
out to be a few x faster there.
If speed is the only consideration, then one picks the fastest choice,
obviously. I will measure the new code vs. the exponent=2 special code.
If there is a threshhold for trading off some amount of speed in exchange for
some amount of code reduction, then it's parameters. Better judged by experts
like yourself who know the values of such trade-offs.
I'm an enthusiastic code pruner, myself.
Algo
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Algo <Gem### [at] hotmailcom> wrote:
> No I didn't claim that it is faster than the exponent=2 special code. I haven't
> measured that. I have measured vs. the mandelx with exponent = 2 and it turned
> out to be a few x faster there.
> If speed is the only consideration, then one picks the fastest choice,
> obviously. I will measure the new code vs. the exponent=2 special code.
> If there is a threshhold for trading off some amount of speed in exchange for
> some amount of code reduction, then it's parameters. Better judged by experts
> like yourself who know the values of such trade-offs.
> I'm an enthusiastic code pruner, myself.
As you may have noticed, the current source has specializations for
the exponents 2, 3 and 4, and the larger exponents are calculated with
the generalized function. Your improvement would basically replace the
latter.
It may be good to measure the speed of your generic function to the
speed of the specialized exponent 4 function to see which is faster
(my guess is that the specialized one is, but I can't know without
actually measuring). If the specialized function results to be slower,
then it may be removed as well.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
A beautiful sunny Sunday in Seattle. Outside my window, ships go through the
locks, salmon swim up the ladder, a bagpipe band plays in the garden and I am
inside coding. Actually I love coding. And I have kept it down to two hours.
NEW_mandelx_pattern was identified all along as a hack. Perfectly adequate for
demonstrating the improved power computations, but for timing purposes
representing a worst-case promise rather than any basis for quantitative
comparison.
NEWER_mandelx_pattern is a speed-optimized version. Ready for quantitative
comparisons. My main avenue of attack was to unroll my awful interior loops.
Short, badly branch-predicted. Ideal candidates. The measured result is
posted among the images.
Also I have extended NEWER_mandelx_pattern to treat exponents up to 255.
The old mandel_pattern. A fine code. I would have said speed optimal, in fact.
I was hoping to come within 10% of its speed. But the measurements are
showing NEWER_mandelx_pattern to be 6.8% FASTER, for some unexpected reason. I
will not be satisfied until I have tracked down why this is.
Will post NEWER_mandelx_pattern after I understand it vs. mandel_pattern and
prune its code way down.
Algo
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Algo <Gem### [at] hotmailcom> wrote:
> Also I have extended NEWER_mandelx_pattern to treat exponents up to 255.
Since I assume that the extra exponents come "for free" (ie. without
requiring any additional resources such as arrays), that's completely ok.
However, I really doubt anyone would use exponents that large, as the
resulting image starts more and more approaching just a circle (and
the border of the fractal becomes more and more boring as the exponent
grows).
OTOH, I have never seen nor zoomed into a z = z^255+c mandelbrot.
Could be interesting, just out of curiosity. :)
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
How did NEWER_mandelx_pattern of exponent=2 manage to go 7% faster than
mandel_pattern?
THIS EXPLANATION IS HERE BECAUSE I SAID I WOULD FIGURE IT OUT.
FEEL FREE NOT TO READ THIS. IT IS VERY TECHNICAL MATERIAL OF LITTLE
INTEREST EXCEPT TO HARD-CORE CODERS. ALL THE LESS RELEVANT NOW,
SINCE MY FINAL CODE NOW BEATS mandel_pattern BY 19%. MY ADVICE:
STUDY THAT INSTEAD.
The answer is by a combination of 75% Good Luck and 25% Right Living. The
following are the disassembled code executing in the FPU on each pass through
the col loop (when neither internal "if" succeeds).
This code is MS VS 2008 C++, "release".
As expected, both codes succeed in keeping the loop totally free
from memory access.
Recall that "fld st(n)" is about three times faster than fadd, fmul or
fsub.
Recall that the first "fxch" is free as long as it executes after an
fadd, fmul or fsub.
NEWER_mandelx_pattern:
NEXT_SQUARE;
004017BE fsubrp st(1),st add#1
004017C0 fxch st(3)
004017C2 fadd st(0),st add#2
004017C4 fmulp st(1),st mul#1
ReZ = ZTo2ToN[2] + ReX;
004017C6 fxch st(2)
004017C8 fadd st,st(3) add#3
ImZ = ZTo2ToN[3] + ImX;
004017CA fxch st(2)
004017CC fadd st,st(1) add#4
DBL AbsZ2 = ReZ*ReZ + ImZ*ImZ;
004017CE fld st(0) fld#1
004017D0 fmul st,st(1) mul#2
004017D2 fld st(3) fld#2
004017D4 fmul st,st(4) mul#3
004017D6 fld st(0) fld#3
004017D8 fadd st,st(2) add#5
mandel_pattern:
b = 2.0 * a * b + y;
0040102B fxch st(4) xch#1
0040102D fadd st(0),st add#1
0040102F fmulp st(1),st mul#1
00401031 fadd st,st(1) add#2
a = a2 - b2 + x;
00401033 fxch st(2)
00401035 fsubrp st(3),st add#3
00401037 fxch st(2)
00401039 fadd st,st(3) add#4
a2 = Sqr(a);
0040103B fld st(0) fld#1
0040103D fmul st,st(1) mul#2
b2 = Sqr(b);
0040103F fld st(2) fld#2
00401041 fmul st,st(3) mul#3
0040101F fxch st(2)
00401021 fxch st(4) xch#2
00401023 fxch st(1) xch#3
00401025 fxch st(3) xch#4
00401027 fxch st(1) xch#5
00401029 fxch st(2) xch#6
A rough cycle count comparison is:
add mul fld fxch total
NEWER_mandelx_pattern: 5*3 3*3 3*1 0 27
mandel_pattern: 4*3 3*3 2*1 6*1 29
The time difference seems to be due mandel_pattern's chain of 6 fxch ops
at the end of the loop. This is due to the cunning re-use of a^2 and b^2
from one pass through the loop to the next, upping the number of re-used
register values from 3 to 5 (out of 8). These must be returned to the same
positions from pass to pass and that requires a bunch of register exchanges,
wasting more time than was saved by squeezing out that last FPU arithmetic op.
An infinitely clever compiler could have avoided these by unrolling the loop
in mandel-pattern.
---Algo
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|