POV-Ray: Newsgroups: povray.beta-test: Radiosity: status & SMP idea

POV-Ray : Newsgroups : povray.beta-test : Radiosity: status & SMP idea		Server Time 5 Jul 2025 16:11:27 EDT (-0400)

<<< Previous 4 Messages

Goto Latest 10 Messages

Next 10 Messages >>>

From: Chambers
Subject: Re: Radiosity: status & SMP idea
Date: 21 Dec 2008 19:58:33
Message: <8B6B4030829F48BA8B6CC8229CE92246@HomePC>

> -----Original Message-----
> From: clipka [mailto:nomail@nomail]
> (BTW, why is Boost used if it is so slow?)

Boost isn't slow.  Mutexing is slow (relatively).  Using mutexes adds
some few dozen cycles to a function's run time, meaning that for rather
fast functions (which are only a few dozen cycles, or even less) it can
reduce speed dramatically.

Of course, if you're looking at functions that typically take hundreds
or thousands of cycles to complete, the cost of using mutexes is
trivial.

...Ben Chambers
www.pacificwebguy.com

Post a reply to this message

From: clipka
Subject: Re: Radiosity: status & SMP idea
Date: 21 Dec 2008 20:20:00
Message: <web.494eeb25b480f792cc645f00@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> The solution here is either to only pretrace (single threaded then), order
> sample insertion (only feasible in a multithreaded pretrace), or come up
> with some other division of the tree avoiding insertion data race conditions.

If 100% reproducability of output is an issue, then other divisions of the tree
will not help either.

Presuming that having more samples than actually required is not an issue, then
how about this one:

On each pretrace pass (and likewise in the main render), subdivide the image
into some tiles of standard size. Have each of these tiles processed
separately, with its own copy of the radiosity sample tree (or, rather, with
read access to the already existing tree, plus another tree to store new
samples in). At the end of each pretrace, have a single task merge the forests
together to one new tree.

If having a too many samples would be an issue, the same approach could still
work, provided that the merging task would check again for redundancy of
samples.

I'm also thinking about totally different approaches, although I can't see yet
how they could be put to good use. One basic idea would be to have a single
main pretracer thread that would find out what samples are needed, and have
other threads actually collect them.

One way this could be done would be as follows:

- A single thread takes care of the pretrace, doing all the things needed -
except for actually taking samples.

- Instead, whenever it decides that a new sample is needed, it inserts a sample
block into the tree, marks it as "incomplete", and enqueues it in a special
queue. It then takes a fantasy value instead, and goes on minding its own
business. The picture will look crazy, but it's a pretrace after all, so no
need for beauty.

- Some other task polls the queue, and does the actual sample-collecting,
shooting rays as usual. It will probably need samples from a deeper "radiosity
bounce" level, too - in that case it will not collect those itself either, but
instead just build skeleton sample blocks and enqueue them for yet another
thread to work on them; of course its sample is incomplete now, so it marks it
accordingly (or, rather, leave it marked that way). In addition, it will
memorize which own sample it is currently working on and for which direction it
needed that new sample (maybe by adding this info to some linked list in the new
sampling block skeleton), and then go on as if everything was perfectly fine (it
will even trace the current ray to full completion, just in case it needs to
"order" more samples).

As the lower-level thread performs its sampling, it will fill in the sampling
block it had received to work on, and when it is finished, it will mark the
block as "completed". It will also insert it into another queue.

A thread that has "ordered" sample blocks will check its queue periodically
(e.g. whenever it has finished work on a sample), and re-visits its own "sample
jobs" for which it originally ordered that deeper-bounce sample, re-tracing the
incomplete ray(s) if it now has all necessary samples for that particular ray.
If it finds that it has completed all rays successfully now, it will report the
sample as finished to its own "boss".

So we would have the threads kind of not running in parallel but "layered", with
one thread per radiosity "bounce" depth.

In an environment with insufficinent threads to fully realize this approach,
some threads would compute two or even more bounces, calculating the
intermediate-bounce samples itself as they become needed.

A crucial question would be, of course, whether this approach would actually
fact speed up anything. Obviously, we may trace some rays twice with this
setup: Once to find that we need certain samples, and then another time when
they have actually arrived.

If we look at a 1-bounce scenario, this is a non-issue: The top level tracer
does not need to re-do the pretrace picture. This would cost us the ability to
use the pretrace as a preview, but that's it. So we would have some overhead
here, but no complicated math.

The same goes for any multi-bounce scenario on a dual-core system (as I guess we
would not want to have more than one task running per core): However the bounce
levels would be distributed among the tasks, the "bottom-level" task would not
have to wait for lower-bounce samples to be computed, while the "top-level"
task would not have to re-render.

In a 2-bounce triple-core scenario, the worst thing that could happen would be
that all samples of the first bounce need totally different second-bounce
samples. But in this case we'd be screwed anyway (unless we have a very low
sample count) as it would mean that for some reason the 2nd-level bounce would
require count times as many samples as the 1st-level bounce.

In reality, I'd expect a first-bounce sample to require a roughly random cut of
second-bounce samples. So as the 1st-bounce worker is traversing its sample
jobs, more and more results will come in from the 2nd-bounce worker (which in a
2-bounce scenario does not need any further samples, so will deliver results in
sequential order), and therefore more and more rays will trace fine without
requiring a re-trace.

So effectively this can be expected to result in an about 1.5-fold workload for
the 1st-bounce worker (maybe something like 1.25-fold workload in total?), at
the benefit of having three or more tasks working on it simultaneously.

In a 3-bounce scenario, things probably get some deal messier. If for example we
had 1000 2nd-level-bounce samples and each required all 1000 3rd-level-bounce
samples, all the 1st-level-bounce samples would be rendered twice.

So the workload would be about 1.5-fold again for the 2st-bounce worker, but
probably close to double for the 1st-bounce worker.

So it seems that with this approach, going beyond three simultaneous tasks
during pretrace might not be worth the pain after all. On the other hand, with
enough spare computing power and a particularly deep bounce level it would
still reduce the total running time.

Anyway, I'll try to set up a simulation to get an estimate of how this aproach
might scale.

Post a reply to this message

From: Warp
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 05:50:29
Message: <494f70f5@news.povray.org>

Chambers <ben### [at] pacificwebguycom> wrote:
> > -----Original Message-----
> > From: clipka [mailto:nomail@nomail]
> > (BTW, why is Boost used if it is so slow?)

> Boost isn't slow.  Mutexing is slow (relatively).

  Boost mutexes are actually the slowest I have tried, although not by a
very large margin.

  Using Pthreads mutexes directly is (at least in my system) slightly
faster, and using OpenMP mutexes still slightly faster.

  The reason why Boost is used is that it's portable to most systems.
Pthreads can only be used in posix systems which implement it (if I'm
not mistaken, most Windows compilers don't). OpenMP is a rather new
standard and few compilers support it.

  As a curiosity, I tested the actual slowdown caused by Boost mutexes.
Consider this function:

void incrementCounter()
{
    boost::mutex::scoped_lock lock(mutex);
    ++someCounter;
}

  I tested calling this function (from a different compilation unit, so
that the compiler cannot inline it, in order to better give a true notion
of its speed) 500000000 times in a loop (compiler options -O3 -march=native)
and in my system it takes 47 seconds. If I comment out the locking, it
takes 1.8 seconds.

  Mutexing is a rather hard problem. Not only is it enough to lock access
from more than one thread at the same time, you also have to prioritize
the accesses. In other words, assume you have this situation using a
naive mutex:

- Thread 1 accesses the mutexed resource and gets the lock to itself.
- Now thread 2 gets scheduled and tries to access the same resource, but
  the mutex blocks it, so for now it has to wait.
- Thread 1 gets done and frees the lock. However, it needs the resource
  immediately again, and locks it again.
- Thread 2 gets scheduled and tries again to access the resource, but
  it's locked, so it can't get it.
- Repeat.

  Thread 2 may be locked out of the resource by thread 1 if the former
never gets scheduled when the lock would be free.

  For this reason the mutex must prioritize: When thread 1 tries to
immediately lock the resource again, after freeing it, the mutex should
not give it, but instead it should prioritize thread 2, who requested it
earlier.

  If a naive mutex would be enough, then this would be by far the most
efficient way of locking I know of:

void incrementCounter()
{
    static int lockFlag = 0;
    while(!__sync_bool_compare_and_swap(&lockFlag, 0, 1)) sched_yield();

    ++someCounter;

    lockFlag = 0;
}

  This runs in 15 seconds in my system. OTOH it requires the gcc extension
__sync_bool_compare_and_swap as well as the posix <sched.h> library. So
it's not really a viable option for portable code.

  (And of course it has the problem that it doesn't prioritize requests.)

-- 
                                                          - Warp

Post a reply to this message

From: Chambers
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 06:30:30
Message: <AE8980356CD843C5B3457F1A203917C5@HomePC>

> -----Original Message-----
> From: Warp [mailto:war### [at] tagpovrayorg]
> Chambers <ben### [at] pacificwebguycom> wrote:
> > Boost isn't slow.  Mutexing is slow (relatively).
> 
>   Boost mutexes are actually the slowest I have tried, although not by
> a
> very large margin.

> (interesting stuff about mutexing snipped)

Yes, I'm aware of different designs for mutexes.  In fact, the fastest
way to deal with them is to design your container such that they aren't
even needed.  However, the only paper I've seen that said such a thing
advocated using immutable state instead.

That is, once a variable is created it cannot be modified.  In the case
of a container, you don't actually change the container; you make a
copy, set the reference to point to the modified data, and dump the
original data.

I was so impressed that I implemented a few tests of the idea just to
get a feel for it.  It CAN be extremely fast, and threading CAN work
effectively with it, but generally it's a pain in the a** and, if done
wrong, you can screw up your data or get a big slowdown.  And you're
still faced with the race conditions when radiosity samples depend on
other radiosity samples (that is, the result of sampling depends on what
other samples are taken - but in a threaded environment, we don't know
what samples will be taken first).

Would it be faster if different threads rarely accessed the same locked
data?  We could split the sample tree into several smaller trees
(sub-trees, if you will) based on scene geography (grouping close
samples in the same tree).  Then, when performing sampling, a thread
would be most likely to only need access to a few of the sub-trees
(those nearest), and rarely the sub-trees further away.  If you only
have a few threads, then the chances of multiple threads needing the
same sub-tree at the same time are minimized.  Having more threads would
add more overhead, of course, because thread collision is more likely.

Or, depending on the implementation of the tree, you wouldn't need to
lock the whole tree, just one particular node and all it's sub-nodes.

Of course, I don't even know if such a thing would speed things up at
all, or if it's a pointless optimization.

...Ben Chambers
www.pacificwebguy.com

Post a reply to this message

From: Christian Froeschlin
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 06:54:57
Message: <494f8011$1@news.povray.org>

Warp wrote:

> - Thread 1 gets done and frees the lock. However, it needs the resource
>   immediately again, and locks it again.
> - Thread 2 gets scheduled and tries again to access the resource, but
>   it's locked, so it can't get it.

I don't know the pthread library well, but are you sure its
lock function works by polling? I'd have expected it to wait
for a signal sent by the unlock function and Thread 1 to
get awakened before Thread 2 can lock again.

Also, why should Thread 2 have a higher priority anyway? It
may feel unfair for it to have to wait all the time, but if
both threads lock the resource too frequently one of them
is bound to be waiting all the time, so the throughput
is not helped by fairness.

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 07:14:43
Message: <494f84b3$1@news.povray.org>

Warp wrote:
>   As a curiosity, I tested the actual slowdown caused by Boost mutexes.

Sorry, but this is a fairly baseless claim because there are no "Boost" 
mutexes. Boost provides nothing more than a very low-overhead (zero overhead 
if using an optimizing compiler) wrapper over Posix mutexes on Unix and 
Win32 mutexes on Windows. As such, "Boost mutexes" cannot be slower than the 
native operating system's mutexes because they *are* nothing more than the 
native operating system's mutexes!

	Thorsten

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 07:17:08
Message: <494f8544$1@news.povray.org>

Chambers wrote:
> Yes, I'm aware of different designs for mutexes.  In fact, the fastest
> way to deal with them is to design your container such that they aren't
> even needed.  However, the only paper I've seen that said such a thing
> advocated using immutable state instead.

Look for papers on lock-free containers.

	Thorsten

Post a reply to this message

From: Warp
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 08:03:47
Message: <494f9033@news.povray.org>

Chambers <ben### [at] pacificwebguycom> wrote:
> That is, once a variable is created it cannot be modified.  In the case
> of a container, you don't actually change the container; you make a
> copy, set the reference to point to the modified data, and dump the
> original data.

  How exactly does that help? You still have to mutex the copying to
avoid two threads doing it simultaneously. You also have to mutex reading
the data container because you have to avoid reading it while another
thread is copying it (ie. effectively changing its contents, which would
mean that the first thread would get old data, even assuming it won't
access freed memory at any point).

> Would it be faster if different threads rarely accessed the same locked
> data?

  AFAIK the problem with POV-Ray's radiosity is that if all threads don't
have the exact same data, you may get different lighting in different
threads (effectively, you will get squares with slightly different
lightning). If one thread adds samples to the container, other threads
must see it immediately.

  That, of course, introduces another problem: If two threads would want
to sample the same location, only one of them should do it. The other one
should wait for it to finish it (it wouldn't make too much sense for two
threads to calculate the exact same thing), and then they can both use the
same samples. This adds an additional level of complexity.

-- 
                                                          - Warp

Post a reply to this message

From: Warp
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 08:14:03
Message: <494f929a@news.povray.org>

Christian Froeschlin <chr### [at] chrfrde> wrote:
> I don't know the pthread library well, but are you sure its
> lock function works by polling? I'd have expected it to wait
> for a signal sent by the unlock function and Thread 1 to
> get awakened before Thread 2 can lock again.

  You still need a list of threads waiting for the lock to become available.
You might have several threads waiting in line for the lock, and you can't
send a signal to all of them. You should send the signal to the one which
has been waiting the longest. IOW, you need a FIFO.

> Also, why should Thread 2 have a higher priority anyway? It
> may feel unfair for it to have to wait all the time, but if
> both threads lock the resource too frequently one of them
> is bound to be waiting all the time, so the throughput
> is not helped by fairness.

  Because if you don't distribute the lock evenly, you may get uneven load,
and the benefit of multithreading will suffer.

  For example, suppose you have a memory pool from which threads can
allocate and deallocate objects for whatever task they are performing.
If one thread is constantly obstructing another thread from allocating
objects from the pool (as might in theory happen in the way I described),
you will effectively have a single-threaded program rather than a
two-threaded one, because one of the threads will be idle (or seriously
underutilized) because its access to the memory pool is constantly being
blocked by the other thread.

  If the second thread happens to get access to the memory pool by chance,
the situation may reverse itself (iow. now it's the first thread who can't
get to the memory pool), but the end result is the same.

  (Granted, in practice it would be quite unlikely for a situation like
this to happen even with a naive mutex, especially if the threads
allocate and deallocate objects from the memory pool relatively
infrequently. However, the more frequently they do it, the more likely
it is that they will start obstructing each other, decreasing the overall
performance of the program.)

-- 
                                                          - Warp

Post a reply to this message

From: Warp
Subject: Re: Radiosity: status & SMP idea
Date: 22 Dec 2008 08:22:52
Message: <494f94ac@news.povray.org>

Thorsten Froehlich <tho### [at] trfde> wrote:
> Warp wrote:
> >   As a curiosity, I tested the actual slowdown caused by Boost mutexes.

> Sorry, but this is a fairly baseless claim because there are no "Boost" 
> mutexes. Boost provides nothing more than a very low-overhead (zero overhead 
> if using an optimizing compiler) wrapper over Posix mutexes on Unix and 
> Win32 mutexes on Windows. As such, "Boost mutexes" cannot be slower than the 
> native operating system's mutexes because they *are* nothing more than the 
> native operating system's mutexes!

  One would think so, but one would be wrong.

  If in the example program I mentioned I change the Boost mutex to a
Pthreads mutex, the running time of the program decreases from 47 seconds
to 41 seconds. I measured both versions several times, and the times were
completely consistent, with negligible variation (in the order of plus/minus
0.1 seconds).

  The difference is small, but it exists. The "thin" wrapper in Boost does
cause its small overhead.

  (I read somewhere that newer versions of Boost fix this small problem
by resorting to more inlining. It might be that I have a version of boost
which does not inline as much.)

  OpenMP mutexes are slightly faster than Pthreads mutexes, at least here.

-- 
                                                          - Warp

Post a reply to this message

<<< Previous 4 Messages

Goto Latest 10 Messages

Next 10 Messages >>>