POV-Ray: Newsgroups: povray.beta-test: Radiosity: status & SMP idea: Re: Radiosity: status & SMP idea

POV-Ray : Newsgroups : povray.beta-test : Radiosity: status & SMP idea : Re: Radiosity: status & SMP idea		Server Time 12 Jul 2025 19:49:19 EDT (-0400)
From: clipka
Date: 21 Dec 2008 20:20:00
Message: <web.494eeb25b480f792cc645f00@news.povray.org>
Thorsten Froehlich <tho### [at] trfde> wrote:
> The solution here is either to only pretrace (single threaded then), order
> sample insertion (only feasible in a multithreaded pretrace), or come up
> with some other division of the tree avoiding insertion data race conditions.

If 100% reproducability of output is an issue, then other divisions of the tree
will not help either.

Presuming that having more samples than actually required is not an issue, then
how about this one:

On each pretrace pass (and likewise in the main render), subdivide the image
into some tiles of standard size. Have each of these tiles processed
separately, with its own copy of the radiosity sample tree (or, rather, with
read access to the already existing tree, plus another tree to store new
samples in). At the end of each pretrace, have a single task merge the forests
together to one new tree.

If having a too many samples would be an issue, the same approach could still
work, provided that the merging task would check again for redundancy of
samples.


I'm also thinking about totally different approaches, although I can't see yet
how they could be put to good use. One basic idea would be to have a single
main pretracer thread that would find out what samples are needed, and have
other threads actually collect them.

One way this could be done would be as follows:

- A single thread takes care of the pretrace, doing all the things needed -
except for actually taking samples.

- Instead, whenever it decides that a new sample is needed, it inserts a sample
block into the tree, marks it as "incomplete", and enqueues it in a special
queue. It then takes a fantasy value instead, and goes on minding its own
business. The picture will look crazy, but it's a pretrace after all, so no
need for beauty.

- Some other task polls the queue, and does the actual sample-collecting,
shooting rays as usual. It will probably need samples from a deeper "radiosity
bounce" level, too - in that case it will not collect those itself either, but
instead just build skeleton sample blocks and enqueue them for yet another
thread to work on them; of course its sample is incomplete now, so it marks it
accordingly (or, rather, leave it marked that way). In addition, it will
memorize which own sample it is currently working on and for which direction it
needed that new sample (maybe by adding this info to some linked list in the new
sampling block skeleton), and then go on as if everything was perfectly fine (it
will even trace the current ray to full completion, just in case it needs to
"order" more samples).

As the lower-level thread performs its sampling, it will fill in the sampling
block it had received to work on, and when it is finished, it will mark the
block as "completed". It will also insert it into another queue.

A thread that has "ordered" sample blocks will check its queue periodically
(e.g. whenever it has finished work on a sample), and re-visits its own "sample
jobs" for which it originally ordered that deeper-bounce sample, re-tracing the
incomplete ray(s) if it now has all necessary samples for that particular ray.
If it finds that it has completed all rays successfully now, it will report the
sample as finished to its own "boss".


So we would have the threads kind of not running in parallel but "layered", with
one thread per radiosity "bounce" depth.

In an environment with insufficinent threads to fully realize this approach,
some threads would compute two or even more bounces, calculating the
intermediate-bounce samples itself as they become needed.


A crucial question would be, of course, whether this approach would actually
fact speed up anything. Obviously, we may trace some rays twice with this
setup: Once to find that we need certain samples, and then another time when
they have actually arrived.

If we look at a 1-bounce scenario, this is a non-issue: The top level tracer
does not need to re-do the pretrace picture. This would cost us the ability to
use the pretrace as a preview, but that's it. So we would have some overhead
here, but no complicated math.

The same goes for any multi-bounce scenario on a dual-core system (as I guess we
would not want to have more than one task running per core): However the bounce
levels would be distributed among the tasks, the "bottom-level" task would not
have to wait for lower-bounce samples to be computed, while the "top-level"
task would not have to re-render.


In a 2-bounce triple-core scenario, the worst thing that could happen would be
that all samples of the first bounce need totally different second-bounce
samples. But in this case we'd be screwed anyway (unless we have a very low
sample count) as it would mean that for some reason the 2nd-level bounce would
require count times as many samples as the 1st-level bounce.

In reality, I'd expect a first-bounce sample to require a roughly random cut of
second-bounce samples. So as the 1st-bounce worker is traversing its sample
jobs, more and more results will come in from the 2nd-bounce worker (which in a
2-bounce scenario does not need any further samples, so will deliver results in
sequential order), and therefore more and more rays will trace fine without
requiring a re-trace.

So effectively this can be expected to result in an about 1.5-fold workload for
the 1st-bounce worker (maybe something like 1.25-fold workload in total?), at
the benefit of having three or more tasks working on it simultaneously.


In a 3-bounce scenario, things probably get some deal messier. If for example we
had 1000 2nd-level-bounce samples and each required all 1000 3rd-level-bounce
samples, all the 1st-level-bounce samples would be rendered twice.

So the workload would be about 1.5-fold again for the 2st-bounce worker, but
probably close to double for the 1st-bounce worker.

So it seems that with this approach, going beyond three simultaneous tasks
during pretrace might not be worth the pain after all. On the other hand, with
enough spare computing power and a particularly deep bounce level it would
still reduce the total running time.


Anyway, I'll try to set up a simulation to get an estimate of how this aproach
might scale.
Post a reply to this message