POV-Ray: Newsgroups: povray.off-topic: Dual Server Failure

POV-Ray : Newsgroups : povray.off-topic : Dual Server Failure		Server Time 29 Jul 2024 16:19:15 EDT (-0400)

Goto Latest 10 Messages

Next 10 Messages >>>

From: Tom Austin
Subject: Dual Server Failure
Date: 13 Jan 2012 14:06:24
Message: <4f1080b0@news.povray.org>

OK this morning - the one morning that I slept in and went to work more 
'on-time' we had a server failure.

Workstations could not access the file shares.

First step of action with windows - reboot.

Rebooting the Windows server caused it to die - literally.
Nothing on the screen - just some cryptic beeps.
After some looking - memory module went dead - of course it had to be 
the 'big' one - so from 2.5 GB down to 1.5GB.....

Ok, so got that fixed.

Now bringing that back up - workstations still cannot access the file 
server.

The Linux file server uses Winbind with Samba to authenticate with 
domain credentials......  for some reason that is broken.

So I had the fun of on the fly configuring Samba to just share files 
without much security.

So now the hunt - why did winbind die......

Post a reply to this message

From: Warp
Subject: Re: Dual Server Failure
Date: 13 Jan 2012 14:17:19
Message: <4f10833e@news.povray.org>

Tom Austin <voi### [at] voidnet> wrote:
> Workstations could not access the file shares.

> After some looking - memory module went dead - of course it had to be 
> the 'big' one - so from 2.5 GB down to 1.5GB.....

  One would think that if the server is mission-critical, it would have
redundant hardware. In other words, if for example a memory module dies,
the only consequence is that the amount of available RAM decreases and
a big-ass notification is logged somewhere, but otherwise the service
continues as usual.

  Of course this requires specialized server hardware, as well as
software support. (I don't even know if Windows supports this. I'm
assuming NT and its spawns ought to, but I have never heard either way.)

  And also I'm assuming this is not cheap, so management do not want.

-- 
                                                          - Warp

Post a reply to this message

From: Darren New
Subject: Re: Dual Server Failure
Date: 14 Jan 2012 00:23:40
Message: <4f11115c$1@news.povray.org>

On 1/13/2012 11:17, Warp wrote:
>    One would think that if the server is mission-critical, it would have
> redundant hardware.

Or a trivial replacement system. Turn off the power, slide the disk drive 
out, slide it into the other case, turn the power on.

Otherwise you're in the Tandem Computing realm, where the system doesn't die 
for decades at a time, and occasionally a new board shows up via FedEx with 
instructions on which one to replace.  "Oh, and by the way, we upgraded your 
kernel without you noticing, while you were out."

>    And also I'm assuming this is not cheap, so management do not want.

Making something truly zero down-time is exceedingly expensive.

-- 
Darren New, San Diego CA, USA (PST)
   People tell me I am the counter-example.

Post a reply to this message

From: Orchid XP v8
Subject: Re: Dual Server Failure
Date: 15 Jan 2012 06:03:37
Message: <4f12b289$1@news.povray.org>

On 13/01/2012 07:17 PM, Warp wrote:

>    One would think that if the server is mission-critical, it would have
> redundant hardware. In other words, if for example a memory module dies,
> the only consequence is that the amount of available RAM decreases and
> a big-ass notification is logged somewhere, but otherwise the service
> continues as usual.
>
>    Of course this requires specialized server hardware, as well as
> software support. (I don't even know if Windows supports this. I'm
> assuming NT and its spawns ought to, but I have never heard either way.)

For a time we had a HP ProLiant server with a memory "RAID" feature. 
(This is on top of all the memory being ECC RAM.) It's transparent to 
the OS.

For example, I might fit two 4GB RAM modules in a mirror configuration. 
The OS sees 4GB installed. If the ECC on one of them starts reporting 
uncorrectable errors, the system board will transparently fetch data 
from the other RAM module, as if nothing ever happened. In addition, an 
LED lights up on the front of the chassis, showing you exactly where on 
the motherboard the failed RAM module is, so you can replace it. (I'm 
unsure whether it was hot-swappable...)

Additionally, there were lights for EVERY INDIVIDUAL FAN (all 15 of 
them), both CPU sockets (so if one CPU dies, the server continues 
running - although I guess the OS is going to notice that one), and both 
of the redundant PSUs.

>    And also I'm assuming this is not cheap, so management do not want.

as you might imagine. (I mean, sure, that's more money than *I* will 
ever own. But for a professional business enterprise, it's potentially 
not a lot of money.)

-- 
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*

Post a reply to this message

From: Orchid XP v8
Subject: Re: Dual Server Failure
Date: 15 Jan 2012 06:05:46
Message: <4f12b30a@news.povray.org>

>> And also I'm assuming this is not cheap, so management do not want.
>
> Making something truly zero down-time is exceedingly expensive.

Yeah, reducing down-time isn't usually too bad, but /zero/ down-time 
requires going to absurd lengths. We're talking about backup power 
generators, multiple telecom providers, multiple physical locations, 
continuous data replication, multiple redundant systems... it gets 
expensive rapidly. As with any buying decision, you need to look at the 
cost of down-time verses the cost of preventing it.

-- 
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*

Post a reply to this message

From: Tom Austin
Subject: Re: Dual Server Failure
Date: 16 Jan 2012 07:36:20
Message: <4f1419c4@news.povray.org>

On 1/13/2012 2:17 PM, Warp wrote:
> Tom Austin<voi### [at] voidnet>  wrote:
>> Workstations could not access the file shares.
>
>> After some looking - memory module went dead - of course it had to be
>> the 'big' one - so from 2.5 GB down to 1.5GB.....
>
>    One would think that if the server is mission-critical, it would have
> redundant hardware. In other words, if for example a memory module dies,
> the only consequence is that the amount of available RAM decreases and
> a big-ass notification is logged somewhere, but otherwise the service
> continues as usual.
>

Yes, I agree, but we are a very small business.  We are only 7 people 
and have doubled in size int he past year.

The Windows server is an old low end Dell server machine.
I'm glad it was the memory module and not something on the MB.

We are working on migrating off of it as time allows.

>    Of course this requires specialized server hardware, as well as
> software support. (I don't even know if Windows supports this. I'm
> assuming NT and its spawns ought to, but I have never heard either way.)
>
>    And also I'm assuming this is not cheap, so management do not want.
>

We do have some methods to getting back up relatively quickly - tho they 
are not instantaneous.  Management is now willing to spend the money 
where needed - tho I don't think we will go to fail-safe quite yet.

Post a reply to this message

From: Invisible
Subject: Re: Dual Server Failure
Date: 16 Jan 2012 08:11:20
Message: <4f1421f8$1@news.povray.org>

On 16/01/2012 12:36 PM, Tom Austin wrote:

> We do have some methods to getting back up relatively quickly - tho they
> are not instantaneous. Management is now willing to spend the money
> where needed - tho I don't think we will go to fail-safe quite yet.

Yeah, it's funny... I've noticed this strange correlation between 
expensive down-time and management willingness to invest in 
fault-tolerant equipment. ;-)

Post a reply to this message

From: Francois Labreque
Subject: Re: Dual Server Failure
Date: 16 Jan 2012 08:51:02
Message: <4f142b46$1@news.povray.org>

> On 16/01/2012 12:36 PM, Tom Austin wrote:
>
>> We do have some methods to getting back up relatively quickly - tho they
>> are not instantaneous. Management is now willing to spend the money
>> where needed - tho I don't think we will go to fail-safe quite yet.
>
> Yeah, it's funny... I've noticed this strange correlation between
> expensive down-time and management willingness to invest in
> fault-tolerant equipment. ;-)

One of my all-time favorite post-mortem meetings went something like this:

Background info: there was a major catastrophe the night before.  A 
repeat of prevous incidents.  As a result, our customer made the news.

10:45am: we meet with our management to quickly explain to the big 
bosses what happened, what we did to resolve it, and how we wrote this 
nice proposal six months ago to permanently deal with the issue.  Big 
Boss says "we'll try to steer clear of We-Told-You-So, but we'll also 
remind them that this could have been avoided.  Let me do the talking 
and jump in if I mess up on my technical mumbo-jumbo"

11:00am: We walk in the customer's board room.  Before we're fully 
seated, the CIO opens up by saying "We know you told us so last time, 
but... is there something else we can do other than upgrade these 
Whatchamacallits?"

...

It took them two more major incidents - including one where they ended 
up on the cover of Time Magazine - before those old whatchamacallits 
were removed.

-- 
/*Francois Labreque*/#local a=x+y;#local b=x+a;#local c=a+b;#macro P(F//
/*    flabreque    */L)polygon{5,F,F+z,L+z,L,F pigment{rgb 9}}#end union
/*        @        */{P(0,a)P(a,b)P(b,c)P(2*a,2*b)P(2*b,b+c)P(b+c,<2,3>)
/*   gmail.com     */}camera{orthographic location<6,1.25,-6>look_at a }

Post a reply to this message

From: Invisible
Subject: Re: Dual Server Failure
Date: 16 Jan 2012 09:05:42
Message: <4f142eb6$1@news.povray.org>

On 16/01/2012 01:51 PM, Francois Labreque wrote:

> "We know you told us so last time,
> but... is there something else we can do other than upgrade these
> Whatchamacallits?"
>
> ...
>
> It took them two more major incidents - including one where they ended
> up on the cover of Time Magazine - before those old whatchamacallits
> were removed.

One has to wonder why people are so resistant to fixing the problem. The 
solution is right there, and yet you want to go around the hard way. Why?

Post a reply to this message

From: Tom Austin
Subject: Re: Dual Server Failure
Date: 16 Jan 2012 10:56:26
Message: <4f1448aa$1@news.povray.org>

On 1/16/2012 8:11 AM, Invisible wrote:
> On 16/01/2012 12:36 PM, Tom Austin wrote:
>
>> We do have some methods to getting back up relatively quickly - tho they
>> are not instantaneous. Management is now willing to spend the money
>> where needed - tho I don't think we will go to fail-safe quite yet.
>
> Yeah, it's funny... I've noticed this strange correlation between
> expensive down-time and management willingness to invest in
> fault-tolerant equipment. ;-)

As we get more people and process more data - the more down time will 
cost.  Eventually we may get to higher uptime requirements - but for now 
2 hours of downtime 2x a year is not too bad.

Post a reply to this message

Goto Latest 10 Messages

Next 10 Messages >>>