POV-Ray: Newsgroups: povray.beta-test: v3.8+ crackle instability (facets?) with >1 uses per thread.

POV-Ray : Newsgroups : povray.beta-test : v3.8+ crackle instability (facets?) with >1 uses per thread.		Server Time 27 Jul 2024 14:26:44 EDT (-0400)

From: William F Pokorny
Subject: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 8 Jun 2024 15:12:53
Message: <6664ad35@news.povray.org>

I've been playing with 'facets' and 'crackle' of late. I've turned up a 
bug (or two) (*).

Documenting now - partly so I can think through what I'm seeing as I write.

The crackle pattern and facets perturbation maintain thread local 
storage so information can be cached in a thread safe way.

The issue, I think, is that that storage is set up to work with one 
crackle and/or facets use per thread and no more.

Once we run >1 of either in the same thread they share the thread local 
storage. This >1 usage per thread happens, for example, when we layer 
textures both based upon crackle.

See the two attached scene files which result in images like those 
attached when things go wrong. (Using v3.7 beta 2 and the not yet 
released yuqk R15 for the renders)

1) Things don't always go wrong. The problem is flaky. Scene renders 
with artifacts often later run cleanly and visa versa. The two scene 
files attached are good at having problems.

2) The v38 scene uses 'repeat <>' and the yuqk one 'ip_strength <>' - 
which twiddles with the strength of the noise used to push the point per 
cube around inside the cube for the pseudo random-ish point set. My 
guess at the moment is these features turn over the thread crackle cache 
more often.

3) I've only seen the buggy results when the text output reports less 
than 100% cache hits. For example:

   Crackle Cache Queries:          960000
   Crackle Cache Hits:             888124 ( 93 percent)

4) The problem seems slightly worse - with more of a render block 
signature - when multiple threads are used. This doesn't really line up
with my best guess as to the issue! At the moment I think this probably 
a secondary bug where maybe the cache is supposed to be cleared at 
render block end, but it isn't, or similar. Might be a secondary thread 
safety issue too.

5) I've not looked at the v3.7 code as yet for these issue(s).

---
I know. The yuqk fork's tearing result looks kinda cool - wish it 
reflected intent... :-)

Bill P.

(*) - There are a couple other minor bugs too in crackle with offset and 
<=0 metric settings patched / fixed in the yuqk fork.

Post a reply to this message

Attachments:
Download 'crackle2_v38.pov.txt' (2 KB) Download 'crackle2_v38_00.jpg' (224 KB) Download 'crackle2_yuqk.pov.txt' (2 KB) Download 'crackle2_yuqk2.jpg' (315 KB)

Preview of image 'crackle2_v38_00.jpg'

Preview of image 'crackle2_yuqk2.jpg'

From: William F Pokorny
Subject: Re: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 9 Jun 2024 00:21:13
Message: <66652db9$1@news.povray.org>

On 6/8/24 15:12, William F Pokorny wrote:
> The issue, I think, is that that storage is set up to work with one 
> crackle and/or facets use per thread and no more.

OK. I ran an experiment where I forced 100% cache misses in yuqk. I then 
re-ran a collections of scenes running multiple crackle patterns per 
thread. Everything looks OK.

This method of disabling the cache is not optimal as the cache set up 
mechanism is forced to run all the time, but the cached data is never 
used. Still, I ran some timing using the crackle2_v38.pov scene with no 
AA and forced (+a0.0) heavy AA.

p380b2 -> yuqk (R15). Cache active. No AA. Shows yuqk 62% faster(a).
p380b2 -> yuqk (R15). Cache active. With AA. Shows yuqk 34% faster(a).

yuqk (with cache) -> yuqk (all misses). No AA. yuqk is 240% slower.
yuqk (with cache) -> yuqk (all misses). With AA. yuqk is 335% slower.

So... Forcing cache misses and getting no cache benefit is very costly. 
Of course, the results are correct, which matters more.

Suppose, I need to attempt thread local storage which completely 
replaces the current cache mechanism to see where that performance comes 
in. :-(

Unsure if I'll do that work for R15 though. I might just force the cache 
misses for now. It would leave me a release where crackle is working and 
I've not further twiddled with how the the code works.

Ah, and what about facets.

Bill P.

(a) - Is the current yuqk speed up over p380b2 is mostly:

https://news.povray.org/povray.beta-test/thread/%3C663eff9d%241%40news.povray.org%3E/

I'm unsure what else it might be if not.

Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 10 Jun 2024 07:51:26
Message: <6666e8be$1@news.povray.org>

On 6/9/24 00:21, William F Pokorny wrote:
> Ah, and what about facets.

FWIW. The caching mechanism is simpler (older) for facets. As with 
crackle I experimented some with forcing 100% misses. The slow down in 
the heavy AA case is +195% as opposed to the +335% seen with crackle. 
The difference likely comes down to the overhead for the simpler facets 
cache being smaller. The facets cache comes close to what I wanted to 
try with the crackle cache.

Going to let ideas to rattle around in my head for a while as to what to 
do. ( 1. Limit use to one crackle and one facets use in any given scene. 
2. A cache per crackle/facets use / per thread. 3. ...)

Bill P.

Post a reply to this message

From: Thorsten
Subject: Re: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 11 Jun 2024 03:43:56
Message: <6668003c$1@news.povray.org>

On 10.06.2024 13:51, William F Pokorny wrote:
> On 6/9/24 00:21, William F Pokorny wrote:
>> Ah, and what about facets.
> 
> FWIW. The caching mechanism is simpler (older) for facets. As with 
> crackle I experimented some with forcing 100% misses. The slow down in 
> the heavy AA case is +195% as opposed to the +335% seen with crackle. 
> The difference likely comes down to the overhead for the simpler facets 
> cache being smaller. The facets cache comes close to what I wanted to 
> try with the crackle cache.
> 
> Going to let ideas to rattle around in my head for a while as to what to 
> do. ( 1. Limit use to one crackle and one facets use in any given scene. 
> 2. A cache per crackle/facets use / per thread. 3. ...)

Hi Bill,

the other issue to consider is that while there is no user interface for 
it, in theory multiple renders of the same scene can run in parallel. 
The actual solution to the whole problem is to keep the data needed not 
only thread-local but look carefully at what is actually cached and then 
ideally have it block local (also meaning, as with thread-local storage, 
that the pattern changes with render block size) or even better pixel 
local (no change with block size). To avoid the access to thread-local 
storage, the whole rendering actually could be overhauled (which would 
be good anyway) to move from a recursive to a stack based approach. That 
way the needed local data could be (more easily) passed as argument down 
to patterns ... but expect half a year full time to implement something 
like this.

Thorsten

Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 11 Jun 2024 11:39:12
Message: <66686fa0$1@news.povray.org>

On 6/11/24 03:43, Thorsten wrote:
> Hi Bill,
> 
> the other issue to consider is that while there is no user interface for 
> it, in theory multiple renders of the same scene can run in parallel. 
> The actual solution to the whole problem is to keep the data needed not 
> only thread-local but look carefully at what is actually cached and then 
> ideally have it block local (also meaning, as with thread-local storage, 
> that the pattern changes with render block size) or even better pixel 
> local (no change with block size). To avoid the access to thread-local 
> storage, the whole rendering actually could be overhauled (which would 
> be good anyway) to move from a recursive to a stack based approach. That 
> way the needed local data could be (more easily) passed as argument down 
> to patterns ... but expect half a year full time to implement something 
> like this.
> 
> Thorsten

Hi Thorsten,

Thank you for your thoughts about the situation.

One thing I've not done is think about all patterns / perturbations / 
shape, thread caching with respect to overlapping in-thread storage use. 
In other words, what other problems like this might be sitting in the 
code today...

On the blocking, you got me thinking one nearer term option with crackle 
and facets might be to track the pattern / perturbation pointers 
themselves alongside the usual cube centers. In cases where we get a 
hit, but the pointers themselves don't match, we'd act like we missed 
and create a new cache entry. Rather than stick that 'overlapping hit' 
entry in the cache, we'd do the distance measures locally and discard 
the entry. Not optimal, more storage for the cache, but it would be 
better than just turning the cache off.

On storage block or pixel local/thread storage. Better I'd say, but so 
long as the patterns might share the storage, I think it still leaves us 
exposed given how the crackle / facets patterns work today.

Overhauling the rendering approach. Yeah, likely due and good, but not 
at all trivial as you say. I'm not myself sure how such a restructuring 
should look in total.

With the solver work I did now 5-6 years ago, I came to the conclusion a 
fused shape/solver approach would be far better given we are 
ray-tracing. See:

https://news.povray.org/povray.programming/thread/%3C5d0f64ff%241%40news.povray.org%3E/

When I think about really implementing that approach, I also start to 
think about how a different approach to parallelism than our block based 
approach could be good. One where we spin up the combined 
shape/solver(s) as processes to which we'd send batches of rays at a 
time and get back batches of intersections... Yeah, I'm practically 
dreaming, but pretty sure that sort of set up would be best for the 
merged uni-variate, polynomial solver/shape approach. How it well that 
structure would work overall - I'm not at all sure. :-)

As a practical near term solution, one thing I want to try is similar to 
what I did with the four ripple/wave value-pattern/normal-perturbation 
re-writes. I dumped already calculated locations for ones always 
calculated on the fly. At the default source location count of 10, the 
hit for not storing the locations was 20% give or take - IIRC.

If I can figure out a way to re-write the crackle and facets at that 
sort of performance hit, I'll probably just dump all the caching / 
thread local storage in total for local stack based storage.

Whether I can accomplish such a re-write - at a performance hit not too 
bad -is an open question at the moment. Not the least for the reason 
it's a chunk of work which well might not work out as a solution in the 
end - so I'm procrastinating.

Bill P.

Post a reply to this message

From: Thorsten
Subject: Re: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 11 Jun 2024 14:17:07
Message: <666894a3$1@news.povray.org>

On 11.06.2024 17:39, William F Pokorny wrote:
> When I think about really implementing that approach, I also start to 
> think about how a different approach to parallelism than our block based 
> approach could be good. One where we spin up the combined 
> shape/solver(s) as processes to which we'd send batches of rays at a 
> time and get back batches of intersections... Yeah, I'm practically 
> dreaming, but pretty sure that sort of set up would be best for the 
> merged uni-variate, polynomial solver/shape approach. How it well that 
> structure would work overall - I'm not at all sure. 😄

Well, yes, an intersection based approach would probably offer the most 
potential performance on a shared memory system. You would also end up 
with a stack-based approach automatically that way. However, you hit 
sort of a wall once you get to the really big multi-die systems like 
Epyc and newer Xeons because they only share the last level cache with 
all cores. So in the end the best performance probably hides somewhere 
in a hybrid of the two with blocks still offering some benefit for large 
multi-core systems. That is, of course, assuming the ray order doesn't 
disrupt first and second level caches too much. It is impossible to 
predict the complexity with modern CPU, I think.

The benefit would be that the "texturing" would become a completely 
separate task, and could actually be done (sans reflection and 
refraction) after tracing, which, if nothing else, would lead to a cool 
looking render preview. The other effect would be that at least bounding 
optimisations and mesh intersection testing could be done on a GPU.

Yet another benefit you get from separating the tracing and the 
texturing is that you end up with a sort of frame buffer that contains 
object data. An idea I never pursued to the end 20 or so years ago was 
that this gives rise to the ability to edit a ray-traced scene on the 
fly because you have access to the objects making up an individual pixel 
and can separate objects in and out of the scene as long as the camera 
doesn't move.

Thorsten

Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 13 Jun 2024 19:26:02
Message: <666b800a$1@news.povray.org>

On 6/11/24 14:17, Thorsten wrote:
> It is impossible to predict the complexity with modern CPU, I think.

I agree. Today's hardware optimizations make performance tuning a tough 
trick - and make questionable a number of "rules of thumb" about which 
algorithms perform best.

> 
> The benefit would be that the "texturing" would become a completely 
> separate task, and could actually be done (sans reflection and 
> refraction) after tracing, which, if nothing else, would lead to a cool 
> looking render preview. The other effect would be that at least bounding 
> optimisations and mesh intersection testing could be done on a GPU.
> 
> Yet another benefit you get from separating the tracing and the 
> texturing is that you end up with a sort of frame buffer that contains 
> object data. An idea I never pursued to the end 20 or so years ago was 
> that this gives rise to the ability to edit a ray-traced scene on the 
> fly because you have access to the objects making up an individual pixel 
> and can separate objects in and out of the scene as long as the camera 
> doesn't move.

Cool ideas. :-) I can see how some parts might work, but far from all of 
it.

Our ray tracing and texturing is today tangled in places (adc bailout, 
filtering/transparency, media, object modifiers). There is too how to 
handle anti-aliasing (AA) / camera focal blur.

Though our 'AA' approach today is expensive(a), it's a strength with 
respect to 'true result' that each sample ray considers the scene - 
including texturing - alongside all the ray tracing / branching in total.

Bill P.

(a) - With respect to performance, on my 'try it someday' list are 
cheaper AA / focal blur modes where the rays beyond some 'sampling 
depth/count' would terminate at a much shallower max_trace_level/sample 
count(*). Or maybe we gradually reduce the trace depth in opposition to 
the AA/blur sampling 'depth'. Results would be less true, but I 
'suspect' they'd often look good as a rule. (There is a tradeoff buried 
in the idea as the less accurate results due shallower ray trace depth 
would sometimes itself trigger additional sampling - and sometimes not 
where we would otherwise have shot more rays.)

(*) - Yes! I made trying the idea harder by implementing the forced min 
sampling AA in yuqk.

Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 14 Jun 2024 08:27:14
Message: <666c3722$1@news.povray.org>

On 6/8/24 15:12, William F Pokorny wrote:
> The crackle pattern and facets perturbation maintain thread local 
> storage so information can be cached in a thread safe way.

Note too:

https://stackoverflow.com/questions/35985960/c-why-is-boosthash-combine-the-best-way-to-combine-hash-values

---
In working to clean up and commit my last updates, I ran across a TODO 
comment I'd added to friend std::size_t hash_value() in cracklecache.h 
about the initial seed value of 0 - which bothers me. I did a quick 
search this morning to look for rumblings about boost:combine().

Other issues aside. It might be our crackle caching mechanism is less 
effective than it could be.

Bill P.

Post a reply to this message

From: Bald Eagle
Subject: Re: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 14 Jun 2024 09:50:00
Message: <web.666c4a5b18b34e675a6710c25979125@news.povray.org>

Minimally hijacking this thread to just post an FYI which may be helpful in some
of your source-code optimizing work.

https://iquilezles.org/articles/noacos/

I haven't looked under the hood in a while to see how we're handling stuff like
this, but it seems like it could provide some performance increases, and for the
basis for some macros / include files.

- BW

Post a reply to this message

From: William F Pokorny
Subject: Re: v3.8+ crackle instability (facets?) with >1 uses per thread.
Date: 15 Jun 2024 08:43:08
Message: <666d8c5c$1@news.povray.org>

On 6/14/24 09:49, Bald Eagle wrote:
> Minimally hijacking this thread to just post an FYI which may be helpful in some
> of your source-code optimizing work.
> 
> https://iquilezles.org/articles/noacos/

Thanks. Been some years, but I read that article at some point in the 
past! Good to be reminded of it.

Bill P.

Post a reply to this message