|
|
Am 17.02.2019 um 14:07 schrieb William F Pokorny:
> /// @def MAX_ORDER
> /// Maximum supported polynomial order.
> ///
> /// @todo
> /// This currently carries a large, fixed per polysolve() call
> /// memory allocation on the stack. Size is on the order of
> /// (MAX_ORDER+1)*int + PRECISE_FLOAT * (MAX_ORDER+1)^2
> /// which impacts performance and performance stability especially
> /// for threads > cores. Allocation based on current equation order
> /// would be better. My C++ attempts at this all badly slower. C itself
> /// supports a struct "flexible array member" feature, however this
> /// not a C++11 standard though some C++ compilers support it.
> ///
> #define MAX_ORDER 35
BTW, that "flexible array member" feature won't help you speed up
things, as it requires dynamic allocation on the heap, which is
super-expensive; whatever you allocate on the stack must necessarily
have a fixed size.
The feature is nothing more than syntactical sugar for accessing the
memory right after the struct as a pointer to a particular type, i.e.
allowing to rewrite code like this:
typedef struct { /* ... */ } MyStructT;
MyStructT* myStructPtr;
...
unsigned char* startOfStruct = (unsigned char*)(myStructPtr);
unsigned char* endOfStruct = startOfStruct + sizeof(*myStructPtr);
T* myArray = (T*)(endOfStruct);
Foo = myArray[i];
as:
typedef struct { /* ... */ T myArray[]; } MyStructT;
...
Foo = myStructPtr->myArray[i];
but it doesn't magically solve the problem of how to allocate the proper
amount of memory for the object.
Also, "threads > cores" is not a sane use case (unless you mean "threads
> physical cores"), as there's virtually no way this might increase
performance by any considerable amount.
> However, a substantial part of the performance degrade looked to be
> related to impacts to caching behavior - at least on my i3 machine. I've
> tested hard coding to known orders. Doing this I see jumps in
> performance improvement as the size drops which seem to be related to
> how well the allocations fit to some memory alignment.
>
> Said another way, it looks like sseg being 36 doubles + int in size for
> each polynomial means multiple equations are not optimally folded into
> some memory alignment. Allocating int + 5 doubles for the sseq size when
> that's all you need is substantially faster(2-3% as I remember - for
> what my memory is worth these days).
I can see two ways how to address these things:
(A) Use more C++ instead of less, turning the solver into a template
with a maximum order parameter.
(B) Refactor the code to use a flat array, and place the data into the
first N*N elements.
Also, constructing `sseq` in such a way that its `polynomial` elements
end up aligned with cache line boundaries might already do the trick.
This might be achieved by declaring `polynomial` to be aligned to
multiples of 64 or even 128 bytes, using e.g. `struct alignas(64)
polynomial { ... };`.
Alternatively, we could reduce MAX_ORDER to the next lower value such
that sizeof(polynomial) ends up a multiple of 64 or 128, and hope that
the compiler is smart enough to align sseq on a cache line boundary.
This means we would want MAX_ORDER+2 to be a multiple of 8 or 16
(presuming sizeof(int)<=sizeof(double) and
alignof(double)==sizeof(double)), which would be satisfied by MAX_ORDER=30.
> And still I wonder about the value of having a max order 35 over 15...
> Did someone actually need 35? Accuracy degrades as order increases -
> even 15 is pushing it hard in the general case working with floats. If
> the >15 order polynomials are very sparse as presented to polysolve() or
> your working with some polynomials that happen to be stable - maybe
> things work acceptably beyond 15, but I have doubts... All very high
> order solvers I've looked at are using exact - or extendable accuracy -
> math (ie integers) for coefficient representation and calculation.
IIRC there was a use case for a `polynomial` primitive that needed an
order of 16, and Jerome decided it would be nonsense to increase it by 1
just to wait for the next use case to require an order of 17; so he went
ahead and increased the limit to something that was not totally
arbitrary. 35 was a natural choice because above this threshold the
`binomials` array of constants would overflow the data type on some
machines (32-bit `unsigned int`).
If we want to use a lower maximum order, 19 would be a logical choice,
as it's the threshold at which we could change the datat type of the
`binomials` array from `unsigned int` to `unsigned short` to save some
memory.
Last not least, don't forget that a performance change of 2% in the root
solver does not translate to a change of 2% in total render time. You
probably have a clearer picture of how much time POV-Ray spends in the
root solver, but remember that there is such a thing as over-optimization.
Post a reply to this message
|
|