POV-Ray: Newsgroups: povray.beta-test: Fwd: Stack Size Testers Wanted: Re: Fwd: Stack Size Testers Wanted

POV-Ray : Newsgroups : povray.beta-test : Fwd: Stack Size Testers Wanted : Re: Fwd: Stack Size Testers Wanted		Server Time 18 Apr 2024 10:08:12 EDT (-0400)

From: William F Pokorny
Date: 17 Feb 2019 08:07:41
Message: <5c695c9d$1@news.povray.org>

On 2/16/19 4:39 PM, clipka wrote:
>> We still have some largish stack allocations - the sturm solver when 
>> we increased the max order from 15 to 35(why ?) in v3.7 now causes a 
>> 600KB plus (??? don't recall exactly) allocation on the stack - at a 
>> considerable performance penalty I'll add(1) - no matter the actual 
>> incoming equation order. This alone might cause your suggested size of 
>> 512KB to fail. Well, except perhaps on windows where presumably 
>> splitting still in place.
> 
> Where would those 600 kB be allocated?

No clue. :-( My mind not being so good again of late is my bet... When I 
look at the comments I've added to polynomialsolver.h they read:

/// @def MAX_ORDER
/// Maximum supported polynomial order.
///
/// @todo
///     This currently carries a large, fixed per polysolve() call
///     memory allocation on the stack. Size is on the order of
///     (MAX_ORDER+1)*int + PRECISE_FLOAT * (MAX_ORDER+1)^2
///     which impacts performance and performance stability especially
///     for threads > cores. Allocation based on current equation order
///     would be better. My C++ attempts at this all badly slower. C itself
///     supports a struct "flexible array member" feature, however this
///     not a C++11 standard though some C++ compilers support it.
///
#define MAX_ORDER 35

which aligns with what you state below so at some point last year I had 
it more or less right...

> 
> The only out-of-the-ordinary local variable I see in all of the 
> polynomial solver code is `sseq` in `polysolve()`, which is 36 elements 
> of type `polynomial`, which in turn consists of 1 integer and 36 
> doubles. That's about 10 kB, not 600.
> 
>> (1) - My C++ ish attempts to address this have all been really slow or 
>> not worked at all when I try to get fancy/fast with raw memory 
>> allocations and pointers. 

I have in my head I tried thread_local... I'll try this again.

However, a substantial part of the performance degrade looked to be 
related to impacts to caching behavior - at least on my i3 machine. I've 
tested hard coding to known orders. Doing this I see jumps in 
performance improvement as the size drops which seem to be related to 
how well the allocations fit to some memory alignment.

Said another way, it looks like sseg being 36 doubles + int in size for 
each polynomial means multiple equations are not optimally folded into 
some memory alignment. Allocating int + 5 doubles for the sseq size when 
that's all you need is substantially faster(2-3% as I remember - for 
what my memory is worth these days).

And still I wonder about the value of having a max order 35 over 15... 
Did someone actually need 35? Accuracy degrades as order increases - 
even 15 is pushing it hard in the general case working with floats. If 
the >15 order polynomials are very sparse as presented to polysolve() or 
your working with some polynomials that happen to be stable - maybe 
things work acceptably beyond 15, but I have doubts... All very high 
order solvers I've looked at are using exact - or extendable accuracy - 
math (ie integers) for coefficient representation and calculation.

Bill P.

Post a reply to this message