POV-Ray : Newsgroups : povray.beta-test : Fwd: Stack Size Testers Wanted Server Time
6 Dec 2021 06:41:09 EST (-0500)
  Fwd: Stack Size Testers Wanted (Message 3 to 12 of 12)  
<<< Previous 2 Messages Goto Initial 10 Messages
From: clipka
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 16 Feb 2019 16:39:11
Message: <5c6882ff$1@news.povray.org>
Am 16.02.2019 um 16:17 schrieb William F Pokorny:

> This leads to me close with being somewhat worried about the suggested 
> test size of 512KB.

The "target" size of 512 kiB is an invariant - that's the per-thread 
stack size we would get on OS X by default.

> We still have some largish stack allocations - the 
> sturm solver when we increased the max order from 15 to 35(why ?) in 
> v3.7 now causes a 600KB plus (??? don't recall exactly) allocation on 
> the stack - at a considerable performance penalty I'll add(1) - no 
> matter the actual incoming equation order. This alone might cause your 
> suggested size of 512KB to fail. Well, except perhaps on windows where 
> presumably splitting still in place.

Where would those 600 kB be allocated?

The only out-of-the-ordinary local variable I see in all of the 
polynomial solver code is `sseq` in `polysolve()`, which is 36 elements 
of type `polynomial`, which in turn consists of 1 integer and 36 
doubles. That's about 10 kB, not 600.

> (1) - My C++ ish attempts to address this have all been really slow or 
> not worked at all when I try to get fancy/fast with raw memory 
> allocations and pointers.

If the variable is local to a function that is guaranteed to never be 
called recursively, the easiest solution to guarantee speedy allocation 
and not hogging the stack is to change the variable declaration to 
`thread_local`. This is essentially the same as `static`, but with each 
thread having its own copy of the variable.

> C and some C++ compilers (IBM's XLC being one) 
> support the needed dynamic structure array-size allocation mechanism 
> with high performance. This a reason - among others - why I'm toying 
> some with taking the common solver code back to straight C. We will 
> soon, I think? - move to a mixed C++/C mode in picking up FreeType 
> which, as I remember, is actually a C library - so perhaps not so crazy.

Veto to that: All of POV-Ray should be valid C++11 code, so that POV-Ray 
can be compiled with only a single C++ compiler. 3rd party libraries are 
another matter; we expect those to be present in binary form (except on 
Windows, where we expect their source code to be compatible with MSVC's 
dialect of C/C++), and the headers of modern C libs are almost 
invariably designed as hybrid C/C++ headers, in the sense that they use 
pre-processor features to behave as either C or C++ code, depending on 
the compiler used.

There is currently only one genuine C source file ihn POV-Ray proper, 
namely `povms.c`, and even that one is designed as a C/C++ hybrid and 
compiled by including it from a `.cpp` file. (The rationale for this C 
file's existence being that it - in theory - allows 3rd party genuine C 
programs to drive the POV-Ray back-end via POVMS.)


Post a reply to this message

From: clipka
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 16 Feb 2019 19:06:37
Message: <5c68a58d$1@news.povray.org>
Just to clear up some confusion:

On Windows, stack size is _not_ "virtually unlimited" - nor does Windows 
support split stacks. Rather, each thread's stack occupies a contiguous 
fixed block of address space. The only thing that's dynamic about it is 
that only a small portion is initially backed by physical (or even 
virtual) memory.

The per-thread stack size is configured at link time, and just happens 
to be set to a humongously large size in the POV-Ray project.


Post a reply to this message

From: clipka
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 16 Feb 2019 20:53:51
Message: <5c68beaf$1@news.povray.org>
Am 16.02.2019 um 16:17 schrieb William F Pokorny:

> I've attached an attractors.zip file to a comment to the github issue:
> 
> https://github.com/POV-Ray/povray/issues/239

Thanks - yes, that seems to be the right stuff for testing this.

To reduce parsing and rendering time, I'm testing a slightly modified 
version with 100 instead of 1000 iterations, i.e.:

clifford (1.6,1.9, 0.5,-1.4, -0.4,-0.1, -1.6,1.8, 100, 0.0005, 0.00375, 
cliff_mat, cliff_trans)


On my Windows machine and the x64 Debug version (which should be a bit 
more memory-hungry than the others), I've just compared commit accc779d 
(which did away with the `FixedSimpleVector`, back in 2017-11-10) with 
the commit immediately before, with the following results:

- Before: Required a POV_THREAD_STACK_SIZE of 2 MiB.
- After: Requires a lousy 16 kiB.

I also ran the last test with 1000 iterations just for giggles, with the 
same result: 16 kiB was still sufficient.


I also see no trouble with the Sturmian root solver at 16 kiB (tested 
with `objects/torus1.pov` sample scene, with `sturm` added to torus 
definitions).


So, bottom line: Judging from this first test, even OS X's 512 kiB 
should be enough for everybody now ;)


Post a reply to this message

From: William F Pokorny
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 17 Feb 2019 08:07:41
Message: <5c695c9d$1@news.povray.org>
On 2/16/19 4:39 PM, clipka wrote:
>> We still have some largish stack allocations - the sturm solver when 
>> we increased the max order from 15 to 35(why ?) in v3.7 now causes a 
>> 600KB plus (??? don't recall exactly) allocation on the stack - at a 
>> considerable performance penalty I'll add(1) - no matter the actual 
>> incoming equation order. This alone might cause your suggested size of 
>> 512KB to fail. Well, except perhaps on windows where presumably 
>> splitting still in place.
> 
> Where would those 600 kB be allocated?

No clue. :-( My mind not being so good again of late is my bet... When I 
look at the comments I've added to polynomialsolver.h they read:

/// @def MAX_ORDER
/// Maximum supported polynomial order.
///
/// @todo
///     This currently carries a large, fixed per polysolve() call
///     memory allocation on the stack. Size is on the order of
///     (MAX_ORDER+1)*int + PRECISE_FLOAT * (MAX_ORDER+1)^2
///     which impacts performance and performance stability especially
///     for threads > cores. Allocation based on current equation order
///     would be better. My C++ attempts at this all badly slower. C itself
///     supports a struct "flexible array member" feature, however this
///     not a C++11 standard though some C++ compilers support it.
///
#define MAX_ORDER 35

which aligns with what you state below so at some point last year I had 
it more or less right...

> 
> The only out-of-the-ordinary local variable I see in all of the 
> polynomial solver code is `sseq` in `polysolve()`, which is 36 elements 
> of type `polynomial`, which in turn consists of 1 integer and 36 
> doubles. That's about 10 kB, not 600.
> 
>> (1) - My C++ ish attempts to address this have all been really slow or 
>> not worked at all when I try to get fancy/fast with raw memory 
>> allocations and pointers. 

I have in my head I tried thread_local... I'll try this again.

However, a substantial part of the performance degrade looked to be 
related to impacts to caching behavior - at least on my i3 machine. I've 
tested hard coding to known orders. Doing this I see jumps in 
performance improvement as the size drops which seem to be related to 
how well the allocations fit to some memory alignment.

Said another way, it looks like sseg being 36 doubles + int in size for 
each polynomial means multiple equations are not optimally folded into 
some memory alignment. Allocating int + 5 doubles for the sseq size when 
that's all you need is substantially faster(2-3% as I remember - for 
what my memory is worth these days).

And still I wonder about the value of having a max order 35 over 15... 
Did someone actually need 35? Accuracy degrades as order increases - 
even 15 is pushing it hard in the general case working with floats. If 
the >15 order polynomials are very sparse as presented to polysolve() or 
your working with some polynomials that happen to be stable - maybe 
things work acceptably beyond 15, but I have doubts... All very high 
order solvers I've looked at are using exact - or extendable accuracy - 
math (ie integers) for coefficient representation and calculation.

Bill P.


Post a reply to this message

From: clipka
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 17 Feb 2019 14:30:34
Message: <5c69b65a$1@news.povray.org>
Am 17.02.2019 um 14:07 schrieb William F Pokorny:

> /// @def MAX_ORDER
> /// Maximum supported polynomial order.
> ///
> /// @todo
> ///     This currently carries a large, fixed per polysolve() call
> ///     memory allocation on the stack. Size is on the order of
> ///     (MAX_ORDER+1)*int + PRECISE_FLOAT * (MAX_ORDER+1)^2
> ///     which impacts performance and performance stability especially
> ///     for threads > cores. Allocation based on current equation order
> ///     would be better. My C++ attempts at this all badly slower. C itself
> ///     supports a struct "flexible array member" feature, however this
> ///     not a C++11 standard though some C++ compilers support it.
> ///
> #define MAX_ORDER 35

BTW, that "flexible array member" feature won't help you speed up 
things, as it requires dynamic allocation on the heap, which is 
super-expensive; whatever you allocate on the stack must necessarily 
have a fixed size.

The feature is nothing more than syntactical sugar for accessing the 
memory right after the struct as a pointer to a particular type, i.e. 
allowing to rewrite code like this:

     typedef struct { /* ... */ } MyStructT;
     MyStructT* myStructPtr;
     ...
     unsigned char* startOfStruct = (unsigned char*)(myStructPtr);
     unsigned char* endOfStruct   = startOfStruct + sizeof(*myStructPtr);
     T* myArray = (T*)(endOfStruct);
     Foo = myArray[i];

as:

     typedef struct { /* ... */ T myArray[]; } MyStructT;
     ...
     Foo = myStructPtr->myArray[i];

but it doesn't magically solve the problem of how to allocate the proper 
amount of memory for the object.


Also, "threads > cores" is not a sane use case (unless you mean "threads 
 > physical cores"), as there's virtually no way this might increase 
performance by any considerable amount.


> However, a substantial part of the performance degrade looked to be 
> related to impacts to caching behavior - at least on my i3 machine. I've 
> tested hard coding to known orders. Doing this I see jumps in 
> performance improvement as the size drops which seem to be related to 
> how well the allocations fit to some memory alignment.
> 
> Said another way, it looks like sseg being 36 doubles + int in size for 
> each polynomial means multiple equations are not optimally folded into 
> some memory alignment. Allocating int + 5 doubles for the sseq size when 
> that's all you need is substantially faster(2-3% as I remember - for 
> what my memory is worth these days).

I can see two ways how to address these things:

(A) Use more C++ instead of less, turning the solver into a template 
with a maximum order parameter.

(B) Refactor the code to use a flat array, and place the data into the 
first N*N elements.


Also, constructing `sseq` in such a way that its `polynomial` elements 
end up aligned with cache line boundaries might already do the trick.

This might be achieved by declaring `polynomial` to be aligned to 
multiples of 64 or even 128 bytes, using e.g. `struct alignas(64) 
polynomial { ... };`.

Alternatively, we could reduce MAX_ORDER to the next lower value such 
that sizeof(polynomial) ends up a multiple of 64 or 128, and hope that 
the compiler is smart enough to align sseq on a cache line boundary. 
This means we would want MAX_ORDER+2 to be a multiple of 8 or 16 
(presuming sizeof(int)<=sizeof(double) and 
alignof(double)==sizeof(double)), which would be satisfied by MAX_ORDER=30.


> And still I wonder about the value of having a max order 35 over 15... 
> Did someone actually need 35? Accuracy degrades as order increases - 
> even 15 is pushing it hard in the general case working with floats. If 
> the >15 order polynomials are very sparse as presented to polysolve() or 
> your working with some polynomials that happen to be stable - maybe 
> things work acceptably beyond 15, but I have doubts... All very high 
> order solvers I've looked at are using exact - or extendable accuracy - 
> math (ie integers) for coefficient representation and calculation.

IIRC there was a use case for a `polynomial` primitive that needed an 
order of 16, and Jerome decided it would be nonsense to increase it by 1 
just to wait for the next use case to require an order of 17; so he went 
ahead and increased the limit to something that was not totally 
arbitrary. 35 was a natural choice because above this threshold the 
`binomials` array of constants would overflow the data type on some 
machines (32-bit `unsigned int`).

If we want to use a lower maximum order, 19 would be a logical choice, 
as it's the threshold at which we could change the datat type of the 
`binomials` array from `unsigned int` to `unsigned short` to save some 
memory.


Last not least, don't forget that a performance change of 2% in the root 
solver does not translate to a change of 2% in total render time. You 
probably have a clearer picture of how much time POV-Ray spends in the 
root solver, but remember that there is such a thing as over-optimization.


Post a reply to this message

From: William F Pokorny
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 17 Feb 2019 18:41:58
Message: <5c69f146$1@news.povray.org>
On 2/17/19 2:30 PM, clipka wrote:
> Am 17.02.2019 um 14:07 schrieb William F Pokorny:
> 
...
> 
> BTW, that "flexible array member" feature won't help you speed up 
> things, as it requires dynamic allocation on the heap, which is 
> super-expensive; whatever you allocate on the stack must necessarily 
> have a fixed size.

OK. Surprised the heap allocation is slower than one on the stack. I 
figured the allocation and free mechanisms were more or less the same - 
just being a difference of where in memory it happened. Guess not?

Also surprised allocations on the stack have to be of fixed size. Is it 
so on the function call just one allocation can happen on the stack for 
all the dynamic variables in the function?

> 
> 
> Also, "threads > cores" is not a sane use case (unless you mean "threads 
>  > physical cores"), as there's virtually no way this might increase 
> performance by any considerable amount.
> 

Yes, "threads > physical cores."

> 

I'm working with the solver branch today and was able to run the 
following without too much effort.

My i3 has two physical cores, 2 threads each. lemonSturm.pov scene long 
used for solver work. Below comparing today's MAX_ORDER 35 vs the 
necessary MAX_ORDER 4 allocation. Commands being of the form:

/usr/bin/time povray lemonSturm.pov -j +wt3 -cc -fn -d -p

+wt4 (what POV-Ray would use by default)
--------------------------------------------
22.99 -> 21.49 ---> -6.52%

+wt3
----
19.11 -> 18.11 ---> -5.23%

+wt2
----
14.32 -> 13.80 ---> -3.63%

+wt1
----
14.27 -> 13.74 ---> -3.71%


> 
> I can see two ways how to address these things:
> 
> (A) Use more C++ instead of less, turning the solver into a template 
> with a maximum order parameter.
> 
> (B) Refactor the code to use a flat array, and place the data into the 
> first N*N elements.
> 
> 
> Also, constructing `sseq` in such a way that its `polynomial` elements 
> end up aligned with cache line boundaries might already do the trick.
> 
> This might be achieved by declaring `polynomial` to be aligned to 
> multiples of 64 or even 128 bytes, using e.g. `struct alignas(64) 
> polynomial { ... };`.
> 
> Alternatively, we could reduce MAX_ORDER to the next lower value such 
> that sizeof(polynomial) ends up a multiple of 64 or 128, and hope that 
> the compiler is smart enough to align sseq on a cache line boundary. 
> This means we would want MAX_ORDER+2 to be a multiple of 8 or 16 
> (presuming sizeof(int)<=sizeof(double) and 
> alignof(double)==sizeof(double)), which would be satisfied by MAX_ORDER=30.
> 

Thanks for the ideas. I add them to my notes to try at some point.

> 
>> And still I wonder about the value of having a max order 35 over 15... 
...
> 
> IIRC there was a use case for a `polynomial` primitive that needed an 
> order of 16, and Jerome decided it would be nonsense to increase it by 1 
> just to wait for the next use case to require an order of 17; so he went 
> ahead and increased the limit to something that was not totally 
> arbitrary. 35 was a natural choice because above this threshold the 
> `binomials` array of constants would overflow the data type on some 
> machines (32-bit `unsigned int`).
> 
> If we want to use a lower maximum order, 19 would be a logical choice, 
> as it's the threshold at which we could change the datat type of the 
> `binomials` array from `unsigned int` to `unsigned short` to save some 
> memory.
> 

Thanks for the background. I'll put it on my list to look at what 19 
would give us performance wise.

> 
> Last not least, don't forget that a performance change of 2% in the root 
> solver does not translate to a change of 2% in total render time. You 
> probably have a clearer picture of how much time POV-Ray spends in the 
> root solver, but remember that there is such a thing as over-optimization.

Understand. My measurements are almost always in total render time.

I admit to chasing performance harder than some might. Comes from my 
belief you need to stay after it. Otherwise you get fooled, due degraded 
performance everywhere over time, that certain improvements don't matter 
enough to do them. A reason I think about that 3-6% being more than it 
is as measured is I know there are at least two other substantial 
performance issues in the current sturm solver. If those can be 
addressed, we'll be talking about a good bit more than 3-6% improvement 
for exact-order allocations.

Bill P.


Post a reply to this message

From: dick balaska
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 17 Feb 2019 21:05:09
Message: <5c6a12d5$1@news.povray.org>
On 2/17/19 6:41 PM, William F Pokorny wrote:

> 
> OK. Surprised the heap allocation is slower than one on the stack. I
> figured the allocation and free mechanisms were more or less the same -
> just being a difference of where in memory it happened. Guess not?

The compiler figures out how much stack space a function call needs, and
allocating that memory is simply a matter of moving the stack pointer
*once*, a single register operation, upon entry to a function.

Heap requires a separate function call for each object to be allocated.
That function examines its tables trying to find a "best fit" for the
size requested, mark those tables as allocated, and return a pointer.
For the reverse, free'ing tries to merge freed memory [1].  Much, much
more expensive than stack.

[1] Old MS-DOS systems you could kill it by allocating all of memory as
1KB blocks, free all those blocks, then try to allocate a 2KB block. Out
of memory.

> 
> Also surprised allocations on the stack have to be of fixed size. Is it
> so on the function call just one allocation can happen on the stack for
> all the dynamic variables in the function?

See above.

-- 
dik
Rendered 1024 of 921600 pixels (0%)


Post a reply to this message

From: clipka
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 17 Feb 2019 21:39:02
Message: <5c6a1ac6$1@news.povray.org>
Am 18.02.2019 um 00:41 schrieb William F Pokorny:

>> BTW, that "flexible array member" feature won't help you speed up 
>> things, as it requires dynamic allocation on the heap, which is 
>> super-expensive; whatever you allocate on the stack must necessarily 
>> have a fixed size.
> 
> OK. Surprised the heap allocation is slower than one on the stack. I 
> figured the allocation and free mechanisms were more or less the same - 
> just being a difference of where in memory it happened. Guess not?

Definitely not.

On the stack, allocation and release only ever happens at the top of the 
stack, and the free area is always a single contiguous block; that's 
trivial to keep track of: A single pointer to the current start of the 
free region (with most CPUs providing a dedicated register for that: the 
Stack Pointer), and some means to detect when you're running past the 
end of the memory section you have set aside for the stack (typically 
done by setting up a "guard page" at the end of the stack, and crying 
bloody murder if that guard page is ever accessed).

At least that's the classic stack implementation; new fancy stuff like 
split stacks change the picture, but that's why they're slower.

On the heap, release can happen anywhere anytime, and you may end up 
with a gazillion separate blocks of free memory you have to keep track 
of, test if you can merge any of them, and search through whenever you 
need to allocate memory again to find a region of suitable size. "Heap 
fragmentation" being the buzzword there.


> Also surprised allocations on the stack have to be of fixed size. Is it 
> so on the function call just one allocation can happen on the stack for 
> all the dynamic variables in the function?

It's essentially a matter of the programm language used. If you were to 
implement a function in assembler, there is nothing stopping you from 
advancing the stack pointer by a computed value (which is all that's 
necessary to allocate memory on the stack), provided you later make sure 
you revert that change to the stack pointer (and also follow some other 
conventions; e.g. some conventions require the stack pointer to be 
16-byte aligned when invoking some other function).

As a matter of fact, as I just discovered, C99 _does_ support such a 
thing; but it's not "flexible array member" but "variable length 
arrays"; and they can't be nested in structs, and C++ did not adopt them.

One reason why C++ did not adopt them is some entirely unrelated 
fundamental stumbling block rooted in the C++ type concept. But I guess 
they might also mess up - or at least complicate - exception handling; 
"stack unwinding" would be the buzzword there.


Post a reply to this message

From: William F Pokorny
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 18 Feb 2019 08:45:07
Message: <5c6ab6e3$1@news.povray.org>
On 2/17/19 9:39 PM, clipka wrote:
> Am 18.02.2019 um 00:41 schrieb William F Pokorny:
> 
...
>>
>> OK. Surprised the heap allocation is slower than one on the stack. I 
>> figured the allocation and free mechanisms were more or less the same 
>> - just being a difference of where in memory it happened. Guess not?
> 
> Definitely not.
> 

Dick, Christoph, Thanks both for the helpful tutorials on stack operation.

...
> 
> As a matter of fact, as I just discovered, C99 _does_ support such a 
> thing; but it's not "flexible array member" but "variable length 
> arrays"; and they can't be nested in structs, and C++ did not adopt them.
> 
> One reason why C++ did not adopt them is some entirely unrelated 
> fundamental stumbling block rooted in the C++ type concept. But I guess 
> they might also mess up - or at least complicate - exception handling; 
> "stack unwinding" would be the buzzword there.

This last I'd come across as I was thrashing around. I must have blended 
the two features together as one in my mind...

Bill P.


Post a reply to this message

From: William F Pokorny
Subject: Re: Fwd: Stack Size Testers Wanted
Date: 18 Feb 2019 09:08:58
Message: <5c6abc7a$1@news.povray.org>
On 2/17/19 6:41 PM, William F Pokorny wrote:
>>
>> IIRC there was a use case for a `polynomial` primitive that needed an 
>> order of 16, and Jerome decided it would be nonsense to increase it by 
>> 1 just to wait for the next use case to require an order of 17; so he 
>> went ahead and increased the limit to something that was not totally 
>> arbitrary. 35 was a natural choice because above this threshold the 
>> `binomials` array of constants would overflow the data type on some 
>> machines (32-bit `unsigned int`).
>>
>> If we want to use a lower maximum order, 19 would be a logical choice, 
>> as it's the threshold at which we could change the datat type of the 
>> `binomials` array from `unsigned int` to `unsigned short` to save some 
>> memory.
>>
> 
> Thanks for the background. I'll put it on my list to look at what 19 
> would give us performance wise.

Same test set up as before; setting MAX_ORDER to 19.

+wt1
14.25 -> 14.14 ---> -0.77%

+wt2
14.31 -> 14.16 ---> -1.05%

+wt3
19.11 -> 18.79 ---> -1.67%

+wt4
22.90 -> 22.35 ---> -2.40%

For my own day to day use I plan to hard code my working POV-Ray version 
to 10 (sphere_sweep) as that's the largest needed for other than user 
coded polynomials as far as I know.

Bill P.


Post a reply to this message

<<< Previous 2 Messages Goto Initial 10 Messages

Copyright 2003-2021 Persistence of Vision Raytracer Pty. Ltd.