POV-Ray: Newsgroups: povray.beta-test: Radiosity: status & SMP idea

POV-Ray : Newsgroups : povray.beta-test : Radiosity: status & SMP idea		Server Time 14 Jul 2025 04:50:57 EDT (-0400)

Goto Latest 10 Messages

Next 10 Messages >>>

From: clipka
Subject: Radiosity: status & SMP idea
Date: 21 Dec 2008 14:50:00
Message: <web.494e9dd4478ce2e4cc645f00@news.povray.org>

(A)

For all of you out there waiting for radiosity to get fixed, a short status
report:

Let me put it this way - let's be happy that radiosity does *something* in the
beta at all :) There's a hell lot of work to do, as the code has been
"amputated" in many places, obviously to get the other code to compile at all.
So it's basically a re-integration job.

(B)

I think I have an idea how to use SMP to speed up radiosity pretrace as well,
without too much modification to the existing architecture.

Obviously, the problem with thread synchronization during radiosity pretrace is
that the sample octree needs to be accessed from multiple threads. The worst
case scenario is that multiple threads might find it necessary to build a new
root at the same time.

However, tasks traversing the tree in search for samples, while another is
actually updating the tree, should not be a problem, if tree modification is
performed carefully.

Competing writes to the same sample tree node can be avoided, if instead of
using a single linked list of sample blocks each node contains as many such
lists as there are threads. The overhead for retrieving blocks should be
insignificant.

Competing writes to the tree hierarchy can be avoided - without all-too-frequent
messing with locks - as follows:

- Each thread keeps a small private list of samples it wasn't able to hook in
yet, because the corresponding nodes didn't exist yet and fast tree
modifications would have been too unsafe.

- Whenever a thread's private sample list reaches a certain size, it attempts to
lock the sample tree; however, if it doesn't get the lock immediately it just
continues to do its job, and retries when its private list of samples has
increased by another certaing amount.

- If the thread does acquire a lock on the tree, it hooks in the samples it has
accumulated by now, then releases the lock and resumes tracing.

This way the threads will not block if they need to do a modification.


A few more considerations regarding this idea:

- Deadlocks are not possible, because if threads can't acquire the lock they
will just do something else.

- Livelocks are unlikely, because the time until a certain number of samples in
non-existing nodes has been taken will have significant jitter.

- Starvation of a single thread is unlikely for similar reasons; plus, it can be
further reduced by having a task try for a lock more frequently the longer its
list of un-hooked samples gets.

- The list of un-hooked samples could include a reference to the node below
which a new sub-tree is needed, to speed up the actual tree modification.

- If a thread cannot acquire a lock, it can still check its list of un-hooked
samples to see if another thread has already built the required nodes by now.

- Separate locks could be used for "root growth" and "leaf growth", as building
a new root will not interfere with the rest of the tree.

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: Radiosity: status & SMP idea
Date: 21 Dec 2008 15:47:52
Message: <494eab78$1@news.povray.org>

clipka wrote:
> Competing writes to the tree hierarchy can be avoided - without all-too-frequent
> messing with locks - as follows:

Please note that finding a locking algorithm is not the problem. The issue 
at hand is significantly more complex: The heuristic deciding to reuse a 
sample or take a new sample simply breaks with multiple threads because it 
depends on the order threads retrieve samples and decide to add samples. 
This order in turn will lead to similar, but yet different trees each time 
the same scene is rendered because thread execution order is never 
predictable, let along controllable. As such, the image will look different 
unless very high sample qualities are used.

The solution here is either to only pretrace (single threaded then), order 
sample insertion (only feasible in a multithreaded pretrace), or come up 
with some other division of the tree avoiding insertion data race conditions.

	Thorsten

Post a reply to this message

From: Warp
Subject: Re: Radiosity: status & SMP idea
Date: 21 Dec 2008 16:37:19
Message: <494eb70f@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> The solution here is either to only pretrace (single threaded then), order 
> sample insertion (only feasible in a multithreaded pretrace), or come up 
> with some other division of the tree avoiding insertion data race conditions.

  Also note that since the radiosity data tree is probably accessed millions
of times per second (for both reading and writing), mutexing each access would
be prohibitively slow, as mutexes (especially those provided by Boost) are
very inefficient. Mutexes are only feasible for functions which are called
relatively infrequently.

  (If a function would otherwise be very fast, eg. take just a few dozens
of clock cycles, adding a mutex lock to it can make it hundreds of times
slower at worst. If the function in question is called very frequently,
eg. millions of times per second, that can have a catastrophical effect
on performance.)

-- 
                                                          - Warp

Post a reply to this message

From: clipka
Subject: Re: Radiosity: status & SMP idea
Date: 21 Dec 2008 18:40:01
Message: <web.494ed2a1b480f792cc645f00@news.povray.org>

Warp <war### [at] tagpovrayorg> wrote:
>   Also note that since the radiosity data tree is probably accessed millions
> of times per second (for both reading and writing), mutexing each access would
> be prohibitively slow, as mutexes (especially those provided by Boost) are
> very inefficient. Mutexes are only feasible for functions which are called
> relatively infrequently.

Of this issue I was aware, which is why I suggested to basically cache some
intended tree writes before paying the cost of locking.

If 100% reproducability of shots is an issue though, another approach is indeed
needed.

(BTW, why is Boost used if it is so slow?)

Post a reply to this message

From: Chambers
Subject: Re: Radiosity: status & SMP idea
Date: 21 Dec 2008 19:58:33
Message: <8B6B4030829F48BA8B6CC8229CE92246@HomePC>

> -----Original Message-----
> From: clipka [mailto:nomail@nomail]
> (BTW, why is Boost used if it is so slow?)

Boost isn't slow.  Mutexing is slow (relatively).  Using mutexes adds
some few dozen cycles to a function's run time, meaning that for rather
fast functions (which are only a few dozen cycles, or even less) it can
reduce speed dramatically.

Of course, if you're looking at functions that typically take hundreds
or thousands of cycles to complete, the cost of using mutexes is
trivial.

...Ben Chambers
www.pacificwebguy.com

Post a reply to this message

From: clipka
Subject: Re: Radiosity: status & SMP idea
Date: 21 Dec 2008 20:20:00
Message: <web.494eeb25b480f792cc645f00@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> The solution here is either to only pretrace (single threaded then), order
> sample insertion (only feasible in a multithreaded pretrace), or come up
> with some other division of the tree avoiding insertion data race conditions.

If 100% reproducability of output is an issue, then other divisions of the tree
will not help either.

Presuming that having more samples than actually required is not an issue, then
how about this one:

On each pretrace pass (and likewise in the main render), subdivide the image
into some tiles of standard size. Have each of these tiles processed
separately, with its own copy of the radiosity sample tree (or, rather, with
read access to the already existing tree, plus another tree to store new
samples in). At the end of each pretrace, have a single task merge the forests
together to one new tree.

If having a too many samples would be an issue, the same approach could still
work, provided that the merging task would check again for redundancy of
samples.

I'm also thinking about totally different approaches, although I can't see yet
how they could be put to good use. One basic idea would be to have a single
main pretracer thread that would find out what samples are needed, and have
other threads actually collect them.

One way this could be done would be as follows:

- A single thread takes care of the pretrace, doing all the things needed -
except for actually taking samples.

- Instead, whenever it decides that a new sample is needed, it inserts a sample
block into the tree, marks it as "incomplete", and enqueues it in a special
queue. It then takes a fantasy value instead, and goes on minding its own
business. The picture will look crazy, but it's a pretrace after all, so no
need for beauty.

- Some other task polls the queue, and does the actual sample-collecting,
shooting rays as usual. It will probably need samples from a deeper "radiosity
bounce" level, too - in that case it will not collect those itself either, but
instead just build skeleton sample blocks and enqueue them for yet another
thread to work on them; of course its sample is incomplete now, so it marks it
accordingly (or, rather, leave it marked that way). In addition, it will
memorize which own sample it is currently working on and for which direction it
needed that new sample (maybe by adding this info to some linked list in the new
sampling block skeleton), and then go on as if everything was perfectly fine (it
will even trace the current ray to full completion, just in case it needs to
"order" more samples).

As the lower-level thread performs its sampling, it will fill in the sampling
block it had received to work on, and when it is finished, it will mark the
block as "completed". It will also insert it into another queue.

A thread that has "ordered" sample blocks will check its queue periodically
(e.g. whenever it has finished work on a sample), and re-visits its own "sample
jobs" for which it originally ordered that deeper-bounce sample, re-tracing the
incomplete ray(s) if it now has all necessary samples for that particular ray.
If it finds that it has completed all rays successfully now, it will report the
sample as finished to its own "boss".

So we would have the threads kind of not running in parallel but "layered", with
one thread per radiosity "bounce" depth.

In an environment with insufficinent threads to fully realize this approach,
some threads would compute two or even more bounces, calculating the
intermediate-bounce samples itself as they become needed.

A crucial question would be, of course, whether this approach would actually
fact speed up anything. Obviously, we may trace some rays twice with this
setup: Once to find that we need certain samples, and then another time when
they have actually arrived.

If we look at a 1-bounce scenario, this is a non-issue: The top level tracer
does not need to re-do the pretrace picture. This would cost us the ability to
use the pretrace as a preview, but that's it. So we would have some overhead
here, but no complicated math.

The same goes for any multi-bounce scenario on a dual-core system (as I guess we
would not want to have more than one task running per core): However the bounce
levels would be distributed among the tasks, the "bottom-level" task would not
have to wait for lower-bounce samples to be computed, while the "top-level"
task would not have to re-render.

In a 2-bounce triple-core scenario, the worst thing that could happen would be
that all samples of the first bounce need totally different second-bounce
samples. But in this case we'd be screwed anyway (unless we have a very low
sample count) as it would mean that for some reason the 2nd-level bounce would
require count times as many samples as the 1st-level bounce.

In reality, I'd expect a first-bounce sample to require a roughly random cut of
second-bounce samples. So as the 1st-bounce worker is traversing its sample
jobs, more and more results will come in from the 2nd-bounce worker (which in a
2-bounce scenario does not need any further samples, so will deliver results in
sequential order), and therefore more and more rays will trace fine without
requiring a re-trace.

So effectively this can be expected to result in an about 1.5-fold workload for
the 1st-bounce worker (maybe something like 1.25-fold workload in total?), at
the benefit of having three or more tasks working on it simultaneously.

In a 3-bounce scenario, things probably get some deal messier. If for example we
had 1000 2nd-level-bounce samples and each required all 1000 3rd-level-bounce
samples, all the 1st-level-bounce samples would be rendered twice.

So the workload would be about 1.5-fold again for the 2st-bounce worker, but
probably close to double for the 1st-bounce worker.

So it seems that with this approach, going beyond three simultaneous tasks
during pretrace might not be worth the pain after all. On the other hand, with
enough spare computing power and a particularly deep bounce level it would
still reduce the total running time.

Anyway, I'll try to set up a simulation to get an estimate of how this aproach
might scale.

Post a reply to this message

From: Warp
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 05:50:29
Message: <494f70f5@news.povray.org>

Chambers <ben### [at] pacificwebguycom> wrote:
> > -----Original Message-----
> > From: clipka [mailto:nomail@nomail]
> > (BTW, why is Boost used if it is so slow?)

> Boost isn't slow.  Mutexing is slow (relatively).

  Boost mutexes are actually the slowest I have tried, although not by a
very large margin.

  Using Pthreads mutexes directly is (at least in my system) slightly
faster, and using OpenMP mutexes still slightly faster.

  The reason why Boost is used is that it's portable to most systems.
Pthreads can only be used in posix systems which implement it (if I'm
not mistaken, most Windows compilers don't). OpenMP is a rather new
standard and few compilers support it.

  As a curiosity, I tested the actual slowdown caused by Boost mutexes.
Consider this function:

void incrementCounter()
{
    boost::mutex::scoped_lock lock(mutex);
    ++someCounter;
}

  I tested calling this function (from a different compilation unit, so
that the compiler cannot inline it, in order to better give a true notion
of its speed) 500000000 times in a loop (compiler options -O3 -march=native)
and in my system it takes 47 seconds. If I comment out the locking, it
takes 1.8 seconds.

  Mutexing is a rather hard problem. Not only is it enough to lock access
from more than one thread at the same time, you also have to prioritize
the accesses. In other words, assume you have this situation using a
naive mutex:

- Thread 1 accesses the mutexed resource and gets the lock to itself.
- Now thread 2 gets scheduled and tries to access the same resource, but
  the mutex blocks it, so for now it has to wait.
- Thread 1 gets done and frees the lock. However, it needs the resource
  immediately again, and locks it again.
- Thread 2 gets scheduled and tries again to access the resource, but
  it's locked, so it can't get it.
- Repeat.

  Thread 2 may be locked out of the resource by thread 1 if the former
never gets scheduled when the lock would be free.

  For this reason the mutex must prioritize: When thread 1 tries to
immediately lock the resource again, after freeing it, the mutex should
not give it, but instead it should prioritize thread 2, who requested it
earlier.

  If a naive mutex would be enough, then this would be by far the most
efficient way of locking I know of:

void incrementCounter()
{
    static int lockFlag = 0;
    while(!__sync_bool_compare_and_swap(&lockFlag, 0, 1)) sched_yield();

    ++someCounter;

    lockFlag = 0;
}

  This runs in 15 seconds in my system. OTOH it requires the gcc extension
__sync_bool_compare_and_swap as well as the posix <sched.h> library. So
it's not really a viable option for portable code.

  (And of course it has the problem that it doesn't prioritize requests.)

-- 
                                                          - Warp

Post a reply to this message

From: Chambers
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 06:30:30
Message: <AE8980356CD843C5B3457F1A203917C5@HomePC>

> -----Original Message-----
> From: Warp [mailto:war### [at] tagpovrayorg]
> Chambers <ben### [at] pacificwebguycom> wrote:
> > Boost isn't slow.  Mutexing is slow (relatively).
> 
>   Boost mutexes are actually the slowest I have tried, although not by
> a
> very large margin.

> (interesting stuff about mutexing snipped)

Yes, I'm aware of different designs for mutexes.  In fact, the fastest
way to deal with them is to design your container such that they aren't
even needed.  However, the only paper I've seen that said such a thing
advocated using immutable state instead.

That is, once a variable is created it cannot be modified.  In the case
of a container, you don't actually change the container; you make a
copy, set the reference to point to the modified data, and dump the
original data.

I was so impressed that I implemented a few tests of the idea just to
get a feel for it.  It CAN be extremely fast, and threading CAN work
effectively with it, but generally it's a pain in the a** and, if done
wrong, you can screw up your data or get a big slowdown.  And you're
still faced with the race conditions when radiosity samples depend on
other radiosity samples (that is, the result of sampling depends on what
other samples are taken - but in a threaded environment, we don't know
what samples will be taken first).

Would it be faster if different threads rarely accessed the same locked
data?  We could split the sample tree into several smaller trees
(sub-trees, if you will) based on scene geography (grouping close
samples in the same tree).  Then, when performing sampling, a thread
would be most likely to only need access to a few of the sub-trees
(those nearest), and rarely the sub-trees further away.  If you only
have a few threads, then the chances of multiple threads needing the
same sub-tree at the same time are minimized.  Having more threads would
add more overhead, of course, because thread collision is more likely.

Or, depending on the implementation of the tree, you wouldn't need to
lock the whole tree, just one particular node and all it's sub-nodes.

Of course, I don't even know if such a thing would speed things up at
all, or if it's a pointless optimization.

...Ben Chambers
www.pacificwebguy.com

Post a reply to this message

From: Christian Froeschlin
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 06:54:57
Message: <494f8011$1@news.povray.org>

Warp wrote:

> - Thread 1 gets done and frees the lock. However, it needs the resource
>   immediately again, and locks it again.
> - Thread 2 gets scheduled and tries again to access the resource, but
>   it's locked, so it can't get it.

I don't know the pthread library well, but are you sure its
lock function works by polling? I'd have expected it to wait
for a signal sent by the unlock function and Thread 1 to
get awakened before Thread 2 can lock again.

Also, why should Thread 2 have a higher priority anyway? It
may feel unfair for it to have to wait all the time, but if
both threads lock the resource too frequently one of them
is bound to be waiting all the time, so the throughput
is not helped by fairness.

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 07:14:43
Message: <494f84b3$1@news.povray.org>

Warp wrote:
>   As a curiosity, I tested the actual slowdown caused by Boost mutexes.

Sorry, but this is a fairly baseless claim because there are no "Boost" 
mutexes. Boost provides nothing more than a very low-overhead (zero overhead 
if using an optimizing compiler) wrapper over Posix mutexes on Unix and 
Win32 mutexes on Windows. As such, "Boost mutexes" cannot be slower than the 
native operating system's mutexes because they *are* nothing more than the 
native operating system's mutexes!

	Thorsten

Post a reply to this message

Goto Latest 10 Messages

Next 10 Messages >>>