POV-Ray: Newsgroups: povray.general: State preservation for catastrophic failure ... (like black-out)

POV-Ray : Newsgroups : povray.general : State preservation for catastrophic failure ... (like black-out)		Server Time 26 Nov 2024 03:51:50 EST (-0500)

From: Marvin
Subject: State preservation for catastrophic failure ... (like black-out)
Date: 17 Nov 2011 10:35:00
Message: <web.4ec52950e8b5092173cc178a0@news.povray.org>

Hi,

I am working on a paper on resilient computing on example of POV-Ray ray tracer
rendering. We chose POV-Ray because it is popular on grid platforms of CERN
type.

In particular, our implementation is able to distribute rendering work of a
clock animation across nodes in rendering server farm, driven by a "smart"
client. Smart client is polling render nodes in such fashion that it is able to
survive crash of many nodes, restart some, and finish the job (unfortunately
with performance penalty) if just one node survives.

In case of catastrophic failure (such as power black-out to entire part of
town), we are able to reboot the cluster and restart smart client with "-C"
option, which will not restart finished frames, being able to reconstruct much
of the state from unrendered frame pool saved state on disk.

However, for ever greater resilience, an option to restart long POV-Ray job is
required. We are capable of restarting POV-Ray process in case of power failure
or other outage that preserved hard disk, yet we are not able to dive into
POV-Ray code and make modifications. IMHO that requires deep knowledge of the
code.

Our architecture offers power of our cluster through a web interface that
delivers CPU power more like a cloud, for specific tasks, rather in all-out
computing manner like CERN grid. Additionally, we were able to recover from node
failures, while at the time of participation in SEE-GRID2, it required all nodes
to finish to complete job, or restarting job for one node would require as much
time as restarting all nodes.

Thank you
iJC
MT

Post a reply to this message

From: Christian Froeschlin
Subject: Re: State preservation for catastrophic failure ... (like black-out)
Date: 17 Nov 2011 15:14:41
Message: <4ec56b31@news.povray.org>

> I am working on a paper on resilient computing on example of POV-Ray ray tracer
> rendering. We chose POV-Ray because it is popular on grid platforms of CERN
> type.

This sounds like an interesting system.

> However, for ever greater resilience, an option to restart long POV-Ray job is
> required. We are capable of restarting POV-Ray process in case of power failure
> or other outage that preserved hard disk, yet we are not able to dive into
> POV-Ray code and make modifications. IMHO that requires deep knowledge of the
> code.

I'm not quite sure if this was intended as a question. Given the nature
of your project I assume you already know that POV-Ray has a -C option
to resume an aborted render, just like your smart client, but it
wasn't quite clear from your post.

Post a reply to this message

From: Marvin
Subject: Re: State preservation for catastrophic failure ... (like black-out)
Date: 17 Nov 2011 15:55:01
Message: <web.4ec57415ccaca39fdb2910cf0@news.povray.org>

Christian Froeschlin <chr### [at] chrfrde> wrote:
> > I am working on a paper on resilient computing on example of POV-Ray ray tracer
> > rendering. We chose POV-Ray because it is popular on grid platforms of CERN
> > type.
>
> This sounds like an interesting system.
>
> > However, for ever greater resilience, an option to restart long POV-Ray job is
> > required. We are capable of restarting POV-Ray process in case of power failure
> > or other outage that preserved hard disk, yet we are not able to dive into
> > POV-Ray code and make modifications. IMHO that requires deep knowledge of the
> > code.
>
> I'm not quite sure if this was intended as a question. Given the nature
> of your project I assume you already know that POV-Ray has a -C option
> to resume an aborted render, just like your smart client, but it
> wasn't quite clear from your post.

Dear Christian,

Thank you for your interest in the system. To be honest, I wasn't aware of +/-C
options. Thank you for this pointer.

Still, it will require a great deal of work to revamp smart client grc-client to
continue rendering where it left off.

This project is GPL'ed open source so we might publish code once we clean it up
and make it more tidy. Right now I am not satisfied, we have a lot of dead code
that no longer serves purpose and belongs in the archives, not in main source
tree.

Thank you very much for this information.

iJC
MT

Post a reply to this message

From: Darren New
Subject: Re: State preservation for catastrophic failure ... (like black-out)
Date: 17 Nov 2011 22:19:29
Message: <4ec5cec1$1@news.povray.org>

On 11/17/2011 12:53, Marvin wrote:
> Still, it will require a great deal of work to revamp smart client grc-client to
> continue rendering where it left off.

It does not naively seem like a lot of work to figure this out. Your jobs 
are either not started yet, started on a particular client, or finished and 
reported back to the master. The only time recovery of a client or master is 
difficult is the middle case, where part of the rendering is finished. In 
that case, when the client recovers, you get the data off the disk and start 
rendering again with a -C option, or you copy the partially-complete files 
to a new machine and fire up povray there with a -C.

What exactly do you think are the problems with failed clients? Do you have 
individual frames spread over multiple clients, or individual clients 
rendering multiple frames in parallel, or something like that?

-- 
Darren New, San Diego CA, USA (PST)
   People tell me I am the counter-example.

Post a reply to this message

From: Marvin
Subject: Re: State preservation for catastrophic failure ... (like black-out)
Date: 18 Nov 2011 00:50:22
Message: <web.4ec5f108ccaca39fdb2910cf0@news.povray.org>

Darren New <dne### [at] sanrrcom> wrote:
> On 11/17/2011 12:53, Marvin wrote:
> > Still, it will require a great deal of work to revamp smart client grc-client to
> > continue rendering where it left off.
>
> It does not naively seem like a lot of work to figure this out. Your jobs
> are either not started yet, started on a particular client, or finished and
> reported back to the master. The only time recovery of a client or master is
> difficult is the middle case, where part of the rendering is finished. In
> that case, when the client recovers, you get the data off the disk and start
> rendering again with a -C option, or you copy the partially-complete files
> to a new machine and fire up povray there with a -C.
>
> What exactly do you think are the problems with failed clients? Do you have
> individual frames spread over multiple clients, or individual clients
> rendering multiple frames in parallel, or something like that?

Hi, Darren,

Problem is in the implementation. The method you propose is obvious and logical.
The problem is that I implemented rendering in TEMP dir of Linux and those do
not survive reboots. But this is only a minor issue, "the weight is in the
mind", said Zen.

NOTE: -C option actually clears recovery data, +C restarts after abrupt stop.

See: http://www.povray.org/documentation/view/3.6.1/217/

However, I need to test thoroughly before continuing, as this is a new feature.
I will test it on abrupt shutdown and blackout.

iJC
MT

Post a reply to this message

From: Darren New
Subject: Re: State preservation for catastrophic failure ... (like black-out)
Date: 19 Nov 2011 01:01:22
Message: <4ec74632$1@news.povray.org>

On 11/17/2011 21:45, Marvin wrote:
> The problem is that I implemented rendering in TEMP dir of Linux and those do
> not survive reboots.


Doctor, Doctor, it hurts when I do this! :-)


-- 
Darren New, San Diego CA, USA (PST)
   People tell me I am the counter-example.

Post a reply to this message