|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Forgot to cross-post to `povray.beta-test`:
-------- Weitergeleitete Nachricht --------
Betreff: Stack Size Testers Wanted
Datum: Sat, 16 Feb 2019 01:40:46 +0100
Von: clipka <ano### [at] anonymousorg>
Newsgruppen: povray.macintosh
Hi folks,
I'm looking for guinea pigs for a particular test.
Here's the background:
A while ago we (well, some of you) were having issues with crashes due
to insufficient thread stack size on Mac OS X, or in one case even on
Linux; we solved these issues with a workaround to override the
per-thread stack size (or, in the Linux case, increase that override).
In the meantime, some changes have been made to the code which should
have reduced POV-Ray's stack requirements, so my hope is that the
workaround is no longer necessary - which would be great news, because
we could get rid of the entire boost thread library.
Here's where you come in:
If you have ever experienced problems related to the thread stack size,
and still have a scene available that ran into these issues, then I'd
like to enlist your help.
@dick balaska:
I'm not sure if you're aware, but I know you were affected; see your
post on 2017-02-12 in povray.beta-test titled "crash in origin/master"
(http://news.povray.org/povray.beta-test/thread/%3C58a1e32f%241%40news.povray.org%3E/)
Here's what I'd like you to do:
(1) Reproduce the old problem and workaround [optional]
- Grab the source code of a sufficiently OLD v3.8.0-alpha (BEFORE
v3.8.0-alpha.9436902; anything built BEFORE December 2017 should do),
v3.7.1-alpha/beta or even the latest v3.7.0.
- In the file `source/backend/configbackend.h`, place the following
lines at the end of the file:
#undef POV_THREAD_STACK_SIZE
#define POV_THREAD_STACK_SIZE (512 * 1024) // 512 KiB
- Build POV-Ray.
- Run whatever scene you remember crashing on you.
- Verify that the scene does indeed crash. (If not, your scene does not
seem to be a suitable test candidate.)
- If you want to go the extra mile, increase POV_THREAD_STACK_SIZE to
see at which size the scene ceases to crash. (I recommend doubling the
value; you shouldn't have to go any further than 8*1024*1024.)
(2) Test whether the workaround is still required
- Grab the source code of a sufficiently NEW v3.8.0-alpha (AT LEAST
v3.8.0-alpha.9436902 or newer; anything built in 2018 or 2019 should do;
I'd recommend the newest tagged alpha though).
- Apply the same changes to `source/backend/configbackend.h` as
described above.
- Build POV-Ray.
- Run your test scene.
- Observe whether the scene crashes or not.
- If you want to go the extra mile, change POV_THREAD_STACK_SIZE to see
at which size the behaviour changes: If the scene crashes, increase the
value until it ceases to; if the scene does not crash, decrease the
value until it does. (I recommend doubling / halving the value.)
(3) Report your observations.
Your help is very much appreciated!
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 2/15/19 7:44 PM, clipka wrote:
> Forgot to cross-post to `povray.beta-test`:
>
> -------- Weitergeleitete Nachricht --------
> Betreff: Stack Size Testers Wanted
> Datum: Sat, 16 Feb 2019 01:40:46 +0100
> Von: clipka <ano### [at] anonymousorg>
> Newsgruppen: povray.macintosh
>
> Hi folks,
>
> I'm looking for guinea pigs for a particular test.
>
>
> Here's the background:
>
> A while ago we (well, some of you) were having issues with crashes due
> to insufficient thread stack size on Mac OS X, or in one case even on
> Linux; we solved these issues with a workaround to override the
> per-thread stack size (or, in the Linux case, increase that override).
>
...
I've attached an attractors.zip file to a comment to the github issue:
https://github.com/POV-Ray/povray/issues/239
which is a related open issue. One which should be closed if recent
updates have better addressed the problem.
The zip contains a set of test cases from ThH (Thorsten) at the bottom
should others want to test in their environment(s). Two or more of which
failed still after the updates in early 2017. Third down being one of
these.
These now all run cleanly for me. Ubuntu 18.04 at master (v38) at commit
054e75c.
And 3rd fails still for me going back to v3.71 release branch at commit
9808f53 (Jun 24 2017 with updates).
Christoph, Remember we at some point ran across some information
suggesting windows runs with stack monitoring and splitting/growing the
stack. Perhaps why windows could always run with a smaller stack and why
we never saw such fails on windows. Stack splitting as needed due growth
could be turned on with the gnu compiler - at a performance hit. This
suggested we might want to turn off such splitting in windows for a
performance gain. All two year old looks though, and nothing ever tried
as far as I know, so who knows how things stand today. If just a windows
compiler setting for you though, might be worth just trying a windows
compile with splitting off if we are in good shape thread-stack size
wise.
This leads to me close with being somewhat worried about the suggested
test size of 512KB. We still have some largish stack allocations - the
sturm solver when we increased the max order from 15 to 35(why ?) in
v3.7 now causes a 600KB plus (??? don't recall exactly) allocation on
the stack - at a considerable performance penalty I'll add(1) - no
matter the actual incoming equation order. This alone might cause your
suggested size of 512KB to fail. Well, except perhaps on windows where
presumably splitting still in place.
I did not run master with 512KB as you suggested or a clang compile with
the current larger default. I'm focused elsewhere at the moment and
changes to the more common headers are painfully slow for me to compile
and try.
Bill P.
(1) - My C++ ish attempts to address this have all been really slow or
not worked at all when I try to get fancy/fast with raw memory
allocations and pointers. C and some C++ compilers (IBM's XLC being one)
support the needed dynamic structure array-size allocation mechanism
with high performance. This a reason - among others - why I'm toying
some with taking the common solver code back to straight C. We will
soon, I think? - move to a mixed C++/C mode in picking up FreeType
which, as I remember, is actually a C library - so perhaps not so crazy.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 16.02.2019 um 16:17 schrieb William F Pokorny:
> This leads to me close with being somewhat worried about the suggested
> test size of 512KB.
The "target" size of 512 kiB is an invariant - that's the per-thread
stack size we would get on OS X by default.
> We still have some largish stack allocations - the
> sturm solver when we increased the max order from 15 to 35(why ?) in
> v3.7 now causes a 600KB plus (??? don't recall exactly) allocation on
> the stack - at a considerable performance penalty I'll add(1) - no
> matter the actual incoming equation order. This alone might cause your
> suggested size of 512KB to fail. Well, except perhaps on windows where
> presumably splitting still in place.
Where would those 600 kB be allocated?
The only out-of-the-ordinary local variable I see in all of the
polynomial solver code is `sseq` in `polysolve()`, which is 36 elements
of type `polynomial`, which in turn consists of 1 integer and 36
doubles. That's about 10 kB, not 600.
> (1) - My C++ ish attempts to address this have all been really slow or
> not worked at all when I try to get fancy/fast with raw memory
> allocations and pointers.
If the variable is local to a function that is guaranteed to never be
called recursively, the easiest solution to guarantee speedy allocation
and not hogging the stack is to change the variable declaration to
`thread_local`. This is essentially the same as `static`, but with each
thread having its own copy of the variable.
> C and some C++ compilers (IBM's XLC being one)
> support the needed dynamic structure array-size allocation mechanism
> with high performance. This a reason - among others - why I'm toying
> some with taking the common solver code back to straight C. We will
> soon, I think? - move to a mixed C++/C mode in picking up FreeType
> which, as I remember, is actually a C library - so perhaps not so crazy.
Veto to that: All of POV-Ray should be valid C++11 code, so that POV-Ray
can be compiled with only a single C++ compiler. 3rd party libraries are
another matter; we expect those to be present in binary form (except on
Windows, where we expect their source code to be compatible with MSVC's
dialect of C/C++), and the headers of modern C libs are almost
invariably designed as hybrid C/C++ headers, in the sense that they use
pre-processor features to behave as either C or C++ code, depending on
the compiler used.
There is currently only one genuine C source file ihn POV-Ray proper,
namely `povms.c`, and even that one is designed as a C/C++ hybrid and
compiled by including it from a `.cpp` file. (The rationale for this C
file's existence being that it - in theory - allows 3rd party genuine C
programs to drive the POV-Ray back-end via POVMS.)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Just to clear up some confusion:
On Windows, stack size is _not_ "virtually unlimited" - nor does Windows
support split stacks. Rather, each thread's stack occupies a contiguous
fixed block of address space. The only thing that's dynamic about it is
that only a small portion is initially backed by physical (or even
virtual) memory.
The per-thread stack size is configured at link time, and just happens
to be set to a humongously large size in the POV-Ray project.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 16.02.2019 um 16:17 schrieb William F Pokorny:
> I've attached an attractors.zip file to a comment to the github issue:
>
> https://github.com/POV-Ray/povray/issues/239
Thanks - yes, that seems to be the right stuff for testing this.
To reduce parsing and rendering time, I'm testing a slightly modified
version with 100 instead of 1000 iterations, i.e.:
clifford (1.6,1.9, 0.5,-1.4, -0.4,-0.1, -1.6,1.8, 100, 0.0005, 0.00375,
cliff_mat, cliff_trans)
On my Windows machine and the x64 Debug version (which should be a bit
more memory-hungry than the others), I've just compared commit accc779d
(which did away with the `FixedSimpleVector`, back in 2017-11-10) with
the commit immediately before, with the following results:
- Before: Required a POV_THREAD_STACK_SIZE of 2 MiB.
- After: Requires a lousy 16 kiB.
I also ran the last test with 1000 iterations just for giggles, with the
same result: 16 kiB was still sufficient.
I also see no trouble with the Sturmian root solver at 16 kiB (tested
with `objects/torus1.pov` sample scene, with `sturm` added to torus
definitions).
So, bottom line: Judging from this first test, even OS X's 512 kiB
should be enough for everybody now ;)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 2/16/19 4:39 PM, clipka wrote:
>> We still have some largish stack allocations - the sturm solver when
>> we increased the max order from 15 to 35(why ?) in v3.7 now causes a
>> 600KB plus (??? don't recall exactly) allocation on the stack - at a
>> considerable performance penalty I'll add(1) - no matter the actual
>> incoming equation order. This alone might cause your suggested size of
>> 512KB to fail. Well, except perhaps on windows where presumably
>> splitting still in place.
>
> Where would those 600 kB be allocated?
No clue. :-( My mind not being so good again of late is my bet... When I
look at the comments I've added to polynomialsolver.h they read:
/// @def MAX_ORDER
/// Maximum supported polynomial order.
///
/// @todo
/// This currently carries a large, fixed per polysolve() call
/// memory allocation on the stack. Size is on the order of
/// (MAX_ORDER+1)*int + PRECISE_FLOAT * (MAX_ORDER+1)^2
/// which impacts performance and performance stability especially
/// for threads > cores. Allocation based on current equation order
/// would be better. My C++ attempts at this all badly slower. C itself
/// supports a struct "flexible array member" feature, however this
/// not a C++11 standard though some C++ compilers support it.
///
#define MAX_ORDER 35
which aligns with what you state below so at some point last year I had
it more or less right...
>
> The only out-of-the-ordinary local variable I see in all of the
> polynomial solver code is `sseq` in `polysolve()`, which is 36 elements
> of type `polynomial`, which in turn consists of 1 integer and 36
> doubles. That's about 10 kB, not 600.
>
>> (1) - My C++ ish attempts to address this have all been really slow or
>> not worked at all when I try to get fancy/fast with raw memory
>> allocations and pointers.
I have in my head I tried thread_local... I'll try this again.
However, a substantial part of the performance degrade looked to be
related to impacts to caching behavior - at least on my i3 machine. I've
tested hard coding to known orders. Doing this I see jumps in
performance improvement as the size drops which seem to be related to
how well the allocations fit to some memory alignment.
Said another way, it looks like sseg being 36 doubles + int in size for
each polynomial means multiple equations are not optimally folded into
some memory alignment. Allocating int + 5 doubles for the sseq size when
that's all you need is substantially faster(2-3% as I remember - for
what my memory is worth these days).
And still I wonder about the value of having a max order 35 over 15...
Did someone actually need 35? Accuracy degrades as order increases -
even 15 is pushing it hard in the general case working with floats. If
the >15 order polynomials are very sparse as presented to polysolve() or
your working with some polynomials that happen to be stable - maybe
things work acceptably beyond 15, but I have doubts... All very high
order solvers I've looked at are using exact - or extendable accuracy -
math (ie integers) for coefficient representation and calculation.
Bill P.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 17.02.2019 um 14:07 schrieb William F Pokorny:
> /// @def MAX_ORDER
> /// Maximum supported polynomial order.
> ///
> /// @todo
> /// This currently carries a large, fixed per polysolve() call
> /// memory allocation on the stack. Size is on the order of
> /// (MAX_ORDER+1)*int + PRECISE_FLOAT * (MAX_ORDER+1)^2
> /// which impacts performance and performance stability especially
> /// for threads > cores. Allocation based on current equation order
> /// would be better. My C++ attempts at this all badly slower. C itself
> /// supports a struct "flexible array member" feature, however this
> /// not a C++11 standard though some C++ compilers support it.
> ///
> #define MAX_ORDER 35
BTW, that "flexible array member" feature won't help you speed up
things, as it requires dynamic allocation on the heap, which is
super-expensive; whatever you allocate on the stack must necessarily
have a fixed size.
The feature is nothing more than syntactical sugar for accessing the
memory right after the struct as a pointer to a particular type, i.e.
allowing to rewrite code like this:
typedef struct { /* ... */ } MyStructT;
MyStructT* myStructPtr;
...
unsigned char* startOfStruct = (unsigned char*)(myStructPtr);
unsigned char* endOfStruct = startOfStruct + sizeof(*myStructPtr);
T* myArray = (T*)(endOfStruct);
Foo = myArray[i];
as:
typedef struct { /* ... */ T myArray[]; } MyStructT;
...
Foo = myStructPtr->myArray[i];
but it doesn't magically solve the problem of how to allocate the proper
amount of memory for the object.
Also, "threads > cores" is not a sane use case (unless you mean "threads
> physical cores"), as there's virtually no way this might increase
performance by any considerable amount.
> However, a substantial part of the performance degrade looked to be
> related to impacts to caching behavior - at least on my i3 machine. I've
> tested hard coding to known orders. Doing this I see jumps in
> performance improvement as the size drops which seem to be related to
> how well the allocations fit to some memory alignment.
>
> Said another way, it looks like sseg being 36 doubles + int in size for
> each polynomial means multiple equations are not optimally folded into
> some memory alignment. Allocating int + 5 doubles for the sseq size when
> that's all you need is substantially faster(2-3% as I remember - for
> what my memory is worth these days).
I can see two ways how to address these things:
(A) Use more C++ instead of less, turning the solver into a template
with a maximum order parameter.
(B) Refactor the code to use a flat array, and place the data into the
first N*N elements.
Also, constructing `sseq` in such a way that its `polynomial` elements
end up aligned with cache line boundaries might already do the trick.
This might be achieved by declaring `polynomial` to be aligned to
multiples of 64 or even 128 bytes, using e.g. `struct alignas(64)
polynomial { ... };`.
Alternatively, we could reduce MAX_ORDER to the next lower value such
that sizeof(polynomial) ends up a multiple of 64 or 128, and hope that
the compiler is smart enough to align sseq on a cache line boundary.
This means we would want MAX_ORDER+2 to be a multiple of 8 or 16
(presuming sizeof(int)<=sizeof(double) and
alignof(double)==sizeof(double)), which would be satisfied by MAX_ORDER=30.
> And still I wonder about the value of having a max order 35 over 15...
> Did someone actually need 35? Accuracy degrades as order increases -
> even 15 is pushing it hard in the general case working with floats. If
> the >15 order polynomials are very sparse as presented to polysolve() or
> your working with some polynomials that happen to be stable - maybe
> things work acceptably beyond 15, but I have doubts... All very high
> order solvers I've looked at are using exact - or extendable accuracy -
> math (ie integers) for coefficient representation and calculation.
IIRC there was a use case for a `polynomial` primitive that needed an
order of 16, and Jerome decided it would be nonsense to increase it by 1
just to wait for the next use case to require an order of 17; so he went
ahead and increased the limit to something that was not totally
arbitrary. 35 was a natural choice because above this threshold the
`binomials` array of constants would overflow the data type on some
machines (32-bit `unsigned int`).
If we want to use a lower maximum order, 19 would be a logical choice,
as it's the threshold at which we could change the datat type of the
`binomials` array from `unsigned int` to `unsigned short` to save some
memory.
Last not least, don't forget that a performance change of 2% in the root
solver does not translate to a change of 2% in total render time. You
probably have a clearer picture of how much time POV-Ray spends in the
root solver, but remember that there is such a thing as over-optimization.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 2/17/19 2:30 PM, clipka wrote:
> Am 17.02.2019 um 14:07 schrieb William F Pokorny:
>
...
>
> BTW, that "flexible array member" feature won't help you speed up
> things, as it requires dynamic allocation on the heap, which is
> super-expensive; whatever you allocate on the stack must necessarily
> have a fixed size.
OK. Surprised the heap allocation is slower than one on the stack. I
figured the allocation and free mechanisms were more or less the same -
just being a difference of where in memory it happened. Guess not?
Also surprised allocations on the stack have to be of fixed size. Is it
so on the function call just one allocation can happen on the stack for
all the dynamic variables in the function?
>
>
> Also, "threads > cores" is not a sane use case (unless you mean "threads
> > physical cores"), as there's virtually no way this might increase
> performance by any considerable amount.
>
Yes, "threads > physical cores."
>
I'm working with the solver branch today and was able to run the
following without too much effort.
My i3 has two physical cores, 2 threads each. lemonSturm.pov scene long
used for solver work. Below comparing today's MAX_ORDER 35 vs the
necessary MAX_ORDER 4 allocation. Commands being of the form:
/usr/bin/time povray lemonSturm.pov -j +wt3 -cc -fn -d -p
+wt4 (what POV-Ray would use by default)
--------------------------------------------
22.99 -> 21.49 ---> -6.52%
+wt3
----
19.11 -> 18.11 ---> -5.23%
+wt2
----
14.32 -> 13.80 ---> -3.63%
+wt1
----
14.27 -> 13.74 ---> -3.71%
>
> I can see two ways how to address these things:
>
> (A) Use more C++ instead of less, turning the solver into a template
> with a maximum order parameter.
>
> (B) Refactor the code to use a flat array, and place the data into the
> first N*N elements.
>
>
> Also, constructing `sseq` in such a way that its `polynomial` elements
> end up aligned with cache line boundaries might already do the trick.
>
> This might be achieved by declaring `polynomial` to be aligned to
> multiples of 64 or even 128 bytes, using e.g. `struct alignas(64)
> polynomial { ... };`.
>
> Alternatively, we could reduce MAX_ORDER to the next lower value such
> that sizeof(polynomial) ends up a multiple of 64 or 128, and hope that
> the compiler is smart enough to align sseq on a cache line boundary.
> This means we would want MAX_ORDER+2 to be a multiple of 8 or 16
> (presuming sizeof(int)<=sizeof(double) and
> alignof(double)==sizeof(double)), which would be satisfied by MAX_ORDER=30.
>
Thanks for the ideas. I add them to my notes to try at some point.
>
>> And still I wonder about the value of having a max order 35 over 15...
...
>
> IIRC there was a use case for a `polynomial` primitive that needed an
> order of 16, and Jerome decided it would be nonsense to increase it by 1
> just to wait for the next use case to require an order of 17; so he went
> ahead and increased the limit to something that was not totally
> arbitrary. 35 was a natural choice because above this threshold the
> `binomials` array of constants would overflow the data type on some
> machines (32-bit `unsigned int`).
>
> If we want to use a lower maximum order, 19 would be a logical choice,
> as it's the threshold at which we could change the datat type of the
> `binomials` array from `unsigned int` to `unsigned short` to save some
> memory.
>
Thanks for the background. I'll put it on my list to look at what 19
would give us performance wise.
>
> Last not least, don't forget that a performance change of 2% in the root
> solver does not translate to a change of 2% in total render time. You
> probably have a clearer picture of how much time POV-Ray spends in the
> root solver, but remember that there is such a thing as over-optimization.
Understand. My measurements are almost always in total render time.
I admit to chasing performance harder than some might. Comes from my
belief you need to stay after it. Otherwise you get fooled, due degraded
performance everywhere over time, that certain improvements don't matter
enough to do them. A reason I think about that 3-6% being more than it
is as measured is I know there are at least two other substantial
performance issues in the current sturm solver. If those can be
addressed, we'll be talking about a good bit more than 3-6% improvement
for exact-order allocations.
Bill P.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 2/17/19 6:41 PM, William F Pokorny wrote:
>
> OK. Surprised the heap allocation is slower than one on the stack. I
> figured the allocation and free mechanisms were more or less the same -
> just being a difference of where in memory it happened. Guess not?
The compiler figures out how much stack space a function call needs, and
allocating that memory is simply a matter of moving the stack pointer
*once*, a single register operation, upon entry to a function.
Heap requires a separate function call for each object to be allocated.
That function examines its tables trying to find a "best fit" for the
size requested, mark those tables as allocated, and return a pointer.
For the reverse, free'ing tries to merge freed memory [1]. Much, much
more expensive than stack.
[1] Old MS-DOS systems you could kill it by allocating all of memory as
1KB blocks, free all those blocks, then try to allocate a 2KB block. Out
of memory.
>
> Also surprised allocations on the stack have to be of fixed size. Is it
> so on the function call just one allocation can happen on the stack for
> all the dynamic variables in the function?
See above.
--
dik
Rendered 1024 of 921600 pixels (0%)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 18.02.2019 um 00:41 schrieb William F Pokorny:
>> BTW, that "flexible array member" feature won't help you speed up
>> things, as it requires dynamic allocation on the heap, which is
>> super-expensive; whatever you allocate on the stack must necessarily
>> have a fixed size.
>
> OK. Surprised the heap allocation is slower than one on the stack. I
> figured the allocation and free mechanisms were more or less the same -
> just being a difference of where in memory it happened. Guess not?
Definitely not.
On the stack, allocation and release only ever happens at the top of the
stack, and the free area is always a single contiguous block; that's
trivial to keep track of: A single pointer to the current start of the
free region (with most CPUs providing a dedicated register for that: the
Stack Pointer), and some means to detect when you're running past the
end of the memory section you have set aside for the stack (typically
done by setting up a "guard page" at the end of the stack, and crying
bloody murder if that guard page is ever accessed).
At least that's the classic stack implementation; new fancy stuff like
split stacks change the picture, but that's why they're slower.
On the heap, release can happen anywhere anytime, and you may end up
with a gazillion separate blocks of free memory you have to keep track
of, test if you can merge any of them, and search through whenever you
need to allocate memory again to find a region of suitable size. "Heap
fragmentation" being the buzzword there.
> Also surprised allocations on the stack have to be of fixed size. Is it
> so on the function call just one allocation can happen on the stack for
> all the dynamic variables in the function?
It's essentially a matter of the programm language used. If you were to
implement a function in assembler, there is nothing stopping you from
advancing the stack pointer by a computed value (which is all that's
necessary to allocate memory on the stack), provided you later make sure
you revert that change to the stack pointer (and also follow some other
conventions; e.g. some conventions require the stack pointer to be
16-byte aligned when invoking some other function).
As a matter of fact, as I just discovered, C99 _does_ support such a
thing; but it's not "flexible array member" but "variable length
arrays"; and they can't be nested in structs, and C++ did not adopt them.
One reason why C++ did not adopt them is some entirely unrelated
fundamental stumbling block rooted in the C++ type concept. But I guess
they might also mess up - or at least complicate - exception handling;
"stack unwinding" would be the buzzword there.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|