POV-Ray: Newsgroups: povray.beta-test: Fwd: Stack Size Testers Wanted: Re: Fwd: Stack Size Testers Wanted

POV-Ray : Newsgroups : povray.beta-test : Fwd: Stack Size Testers Wanted : Re: Fwd: Stack Size Testers Wanted		Server Time 27 Apr 2024 12:36:54 EDT (-0400)
From: clipka
Date: 17 Feb 2019 14:30:34
Message: <5c69b65a$1@news.povray.org>
Am 17.02.2019 um 14:07 schrieb William F Pokorny:

> /// @def MAX_ORDER
> /// Maximum supported polynomial order.
> ///
> /// @todo
> ///     This currently carries a large, fixed per polysolve() call
> ///     memory allocation on the stack. Size is on the order of
> ///     (MAX_ORDER+1)*int + PRECISE_FLOAT * (MAX_ORDER+1)^2
> ///     which impacts performance and performance stability especially
> ///     for threads > cores. Allocation based on current equation order
> ///     would be better. My C++ attempts at this all badly slower. C itself
> ///     supports a struct "flexible array member" feature, however this
> ///     not a C++11 standard though some C++ compilers support it.
> ///
> #define MAX_ORDER 35

BTW, that "flexible array member" feature won't help you speed up 
things, as it requires dynamic allocation on the heap, which is 
super-expensive; whatever you allocate on the stack must necessarily 
have a fixed size.

The feature is nothing more than syntactical sugar for accessing the 
memory right after the struct as a pointer to a particular type, i.e. 
allowing to rewrite code like this:

     typedef struct { /* ... */ } MyStructT;
     MyStructT* myStructPtr;
     ...
     unsigned char* startOfStruct = (unsigned char*)(myStructPtr);
     unsigned char* endOfStruct   = startOfStruct + sizeof(*myStructPtr);
     T* myArray = (T*)(endOfStruct);
     Foo = myArray[i];

as:

     typedef struct { /* ... */ T myArray[]; } MyStructT;
     ...
     Foo = myStructPtr->myArray[i];

but it doesn't magically solve the problem of how to allocate the proper 
amount of memory for the object.


Also, "threads > cores" is not a sane use case (unless you mean "threads 
 > physical cores"), as there's virtually no way this might increase 
performance by any considerable amount.


> However, a substantial part of the performance degrade looked to be 
> related to impacts to caching behavior - at least on my i3 machine. I've 
> tested hard coding to known orders. Doing this I see jumps in 
> performance improvement as the size drops which seem to be related to 
> how well the allocations fit to some memory alignment.
> 
> Said another way, it looks like sseg being 36 doubles + int in size for 
> each polynomial means multiple equations are not optimally folded into 
> some memory alignment. Allocating int + 5 doubles for the sseq size when 
> that's all you need is substantially faster(2-3% as I remember - for 
> what my memory is worth these days).

I can see two ways how to address these things:

(A) Use more C++ instead of less, turning the solver into a template 
with a maximum order parameter.

(B) Refactor the code to use a flat array, and place the data into the 
first N*N elements.


Also, constructing `sseq` in such a way that its `polynomial` elements 
end up aligned with cache line boundaries might already do the trick.

This might be achieved by declaring `polynomial` to be aligned to 
multiples of 64 or even 128 bytes, using e.g. `struct alignas(64) 
polynomial { ... };`.

Alternatively, we could reduce MAX_ORDER to the next lower value such 
that sizeof(polynomial) ends up a multiple of 64 or 128, and hope that 
the compiler is smart enough to align sseq on a cache line boundary. 
This means we would want MAX_ORDER+2 to be a multiple of 8 or 16 
(presuming sizeof(int)<=sizeof(double) and 
alignof(double)==sizeof(double)), which would be satisfied by MAX_ORDER=30.


> And still I wonder about the value of having a max order 35 over 15... 
> Did someone actually need 35? Accuracy degrades as order increases - 
> even 15 is pushing it hard in the general case working with floats. If 
> the >15 order polynomials are very sparse as presented to polysolve() or 
> your working with some polynomials that happen to be stable - maybe 
> things work acceptably beyond 15, but I have doubts... All very high 
> order solvers I've looked at are using exact - or extendable accuracy - 
> math (ie integers) for coefficient representation and calculation.

IIRC there was a use case for a `polynomial` primitive that needed an 
order of 16, and Jerome decided it would be nonsense to increase it by 1 
just to wait for the next use case to require an order of 17; so he went 
ahead and increased the limit to something that was not totally 
arbitrary. 35 was a natural choice because above this threshold the 
`binomials` array of constants would overflow the data type on some 
machines (32-bit `unsigned int`).

If we want to use a lower maximum order, 19 would be a logical choice, 
as it's the threshold at which we could change the datat type of the 
`binomials` array from `unsigned int` to `unsigned short` to save some 
memory.


Last not least, don't forget that a performance change of 2% in the root 
solver does not translate to a change of 2% in total render time. You 
probably have a clearer picture of how much time POV-Ray spends in the 
root solver, but remember that there is such a thing as over-optimization.
Post a reply to this message