POV-Ray: Newsgroups: povray.beta-test: Fwd: Stack Size Testers Wanted: Re: Fwd: Stack Size Testers Wanted

POV-Ray : Newsgroups : povray.beta-test : Fwd: Stack Size Testers Wanted : Re: Fwd: Stack Size Testers Wanted		Server Time 26 Apr 2024 03:30:33 EDT (-0400)
From: William F Pokorny
Date: 17 Feb 2019 18:41:58
Message: <5c69f146$1@news.povray.org>
On 2/17/19 2:30 PM, clipka wrote:
> Am 17.02.2019 um 14:07 schrieb William F Pokorny:
> 
...
> 
> BTW, that "flexible array member" feature won't help you speed up 
> things, as it requires dynamic allocation on the heap, which is 
> super-expensive; whatever you allocate on the stack must necessarily 
> have a fixed size.

OK. Surprised the heap allocation is slower than one on the stack. I 
figured the allocation and free mechanisms were more or less the same - 
just being a difference of where in memory it happened. Guess not?

Also surprised allocations on the stack have to be of fixed size. Is it 
so on the function call just one allocation can happen on the stack for 
all the dynamic variables in the function?

> 
> 
> Also, "threads > cores" is not a sane use case (unless you mean "threads 
>  > physical cores"), as there's virtually no way this might increase 
> performance by any considerable amount.
> 

Yes, "threads > physical cores."

> 

I'm working with the solver branch today and was able to run the 
following without too much effort.

My i3 has two physical cores, 2 threads each. lemonSturm.pov scene long 
used for solver work. Below comparing today's MAX_ORDER 35 vs the 
necessary MAX_ORDER 4 allocation. Commands being of the form:

/usr/bin/time povray lemonSturm.pov -j +wt3 -cc -fn -d -p

+wt4 (what POV-Ray would use by default)
--------------------------------------------
22.99 -> 21.49 ---> -6.52%

+wt3
----
19.11 -> 18.11 ---> -5.23%

+wt2
----
14.32 -> 13.80 ---> -3.63%

+wt1
----
14.27 -> 13.74 ---> -3.71%


> 
> I can see two ways how to address these things:
> 
> (A) Use more C++ instead of less, turning the solver into a template 
> with a maximum order parameter.
> 
> (B) Refactor the code to use a flat array, and place the data into the 
> first N*N elements.
> 
> 
> Also, constructing `sseq` in such a way that its `polynomial` elements 
> end up aligned with cache line boundaries might already do the trick.
> 
> This might be achieved by declaring `polynomial` to be aligned to 
> multiples of 64 or even 128 bytes, using e.g. `struct alignas(64) 
> polynomial { ... };`.
> 
> Alternatively, we could reduce MAX_ORDER to the next lower value such 
> that sizeof(polynomial) ends up a multiple of 64 or 128, and hope that 
> the compiler is smart enough to align sseq on a cache line boundary. 
> This means we would want MAX_ORDER+2 to be a multiple of 8 or 16 
> (presuming sizeof(int)<=sizeof(double) and 
> alignof(double)==sizeof(double)), which would be satisfied by MAX_ORDER=30.
> 

Thanks for the ideas. I add them to my notes to try at some point.

> 
>> And still I wonder about the value of having a max order 35 over 15... 
...
> 
> IIRC there was a use case for a `polynomial` primitive that needed an 
> order of 16, and Jerome decided it would be nonsense to increase it by 1 
> just to wait for the next use case to require an order of 17; so he went 
> ahead and increased the limit to something that was not totally 
> arbitrary. 35 was a natural choice because above this threshold the 
> `binomials` array of constants would overflow the data type on some 
> machines (32-bit `unsigned int`).
> 
> If we want to use a lower maximum order, 19 would be a logical choice, 
> as it's the threshold at which we could change the datat type of the 
> `binomials` array from `unsigned int` to `unsigned short` to save some 
> memory.
> 

Thanks for the background. I'll put it on my list to look at what 19 
would give us performance wise.

> 
> Last not least, don't forget that a performance change of 2% in the root 
> solver does not translate to a change of 2% in total render time. You 
> probably have a clearer picture of how much time POV-Ray spends in the 
> root solver, but remember that there is such a thing as over-optimization.

Understand. My measurements are almost always in total render time.

I admit to chasing performance harder than some might. Comes from my 
belief you need to stay after it. Otherwise you get fooled, due degraded 
performance everywhere over time, that certain improvements don't matter 
enough to do them. A reason I think about that 3-6% being more than it 
is as measured is I know there are at least two other substantial 
performance issues in the current sturm solver. If those can be 
addressed, we'll be talking about a good bit more than 3-6% improvement 
for exact-order allocations.

Bill P.
Post a reply to this message