|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
OK this morning - the one morning that I slept in and went to work more
'on-time' we had a server failure.
Workstations could not access the file shares.
First step of action with windows - reboot.
Rebooting the Windows server caused it to die - literally.
Nothing on the screen - just some cryptic beeps.
After some looking - memory module went dead - of course it had to be
the 'big' one - so from 2.5 GB down to 1.5GB.....
Ok, so got that fixed.
Now bringing that back up - workstations still cannot access the file
server.
The Linux file server uses Winbind with Samba to authenticate with
domain credentials...... for some reason that is broken.
So I had the fun of on the fly configuring Samba to just share files
without much security.
So now the hunt - why did winbind die......
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Tom Austin <voi### [at] voidnet> wrote:
> Workstations could not access the file shares.
> After some looking - memory module went dead - of course it had to be
> the 'big' one - so from 2.5 GB down to 1.5GB.....
One would think that if the server is mission-critical, it would have
redundant hardware. In other words, if for example a memory module dies,
the only consequence is that the amount of available RAM decreases and
a big-ass notification is logged somewhere, but otherwise the service
continues as usual.
Of course this requires specialized server hardware, as well as
software support. (I don't even know if Windows supports this. I'm
assuming NT and its spawns ought to, but I have never heard either way.)
And also I'm assuming this is not cheap, so management do not want.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 1/13/2012 11:17, Warp wrote:
> One would think that if the server is mission-critical, it would have
> redundant hardware.
Or a trivial replacement system. Turn off the power, slide the disk drive
out, slide it into the other case, turn the power on.
Otherwise you're in the Tandem Computing realm, where the system doesn't die
for decades at a time, and occasionally a new board shows up via FedEx with
instructions on which one to replace. "Oh, and by the way, we upgraded your
kernel without you noticing, while you were out."
> And also I'm assuming this is not cheap, so management do not want.
Making something truly zero down-time is exceedingly expensive.
--
Darren New, San Diego CA, USA (PST)
People tell me I am the counter-example.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 13/01/2012 07:17 PM, Warp wrote:
> One would think that if the server is mission-critical, it would have
> redundant hardware. In other words, if for example a memory module dies,
> the only consequence is that the amount of available RAM decreases and
> a big-ass notification is logged somewhere, but otherwise the service
> continues as usual.
>
> Of course this requires specialized server hardware, as well as
> software support. (I don't even know if Windows supports this. I'm
> assuming NT and its spawns ought to, but I have never heard either way.)
For a time we had a HP ProLiant server with a memory "RAID" feature.
(This is on top of all the memory being ECC RAM.) It's transparent to
the OS.
For example, I might fit two 4GB RAM modules in a mirror configuration.
The OS sees 4GB installed. If the ECC on one of them starts reporting
uncorrectable errors, the system board will transparently fetch data
from the other RAM module, as if nothing ever happened. In addition, an
LED lights up on the front of the chassis, showing you exactly where on
the motherboard the failed RAM module is, so you can replace it. (I'm
unsure whether it was hot-swappable...)
Additionally, there were lights for EVERY INDIVIDUAL FAN (all 15 of
them), both CPU sockets (so if one CPU dies, the server continues
running - although I guess the OS is going to notice that one), and both
of the redundant PSUs.
> And also I'm assuming this is not cheap, so management do not want.
as you might imagine. (I mean, sure, that's more money than *I* will
ever own. But for a professional business enterprise, it's potentially
not a lot of money.)
--
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
>> And also I'm assuming this is not cheap, so management do not want.
>
> Making something truly zero down-time is exceedingly expensive.
Yeah, reducing down-time isn't usually too bad, but /zero/ down-time
requires going to absurd lengths. We're talking about backup power
generators, multiple telecom providers, multiple physical locations,
continuous data replication, multiple redundant systems... it gets
expensive rapidly. As with any buying decision, you need to look at the
cost of down-time verses the cost of preventing it.
--
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 1/13/2012 2:17 PM, Warp wrote:
> Tom Austin<voi### [at] voidnet> wrote:
>> Workstations could not access the file shares.
>
>> After some looking - memory module went dead - of course it had to be
>> the 'big' one - so from 2.5 GB down to 1.5GB.....
>
> One would think that if the server is mission-critical, it would have
> redundant hardware. In other words, if for example a memory module dies,
> the only consequence is that the amount of available RAM decreases and
> a big-ass notification is logged somewhere, but otherwise the service
> continues as usual.
>
Yes, I agree, but we are a very small business. We are only 7 people
and have doubled in size int he past year.
The Windows server is an old low end Dell server machine.
I'm glad it was the memory module and not something on the MB.
We are working on migrating off of it as time allows.
> Of course this requires specialized server hardware, as well as
> software support. (I don't even know if Windows supports this. I'm
> assuming NT and its spawns ought to, but I have never heard either way.)
>
> And also I'm assuming this is not cheap, so management do not want.
>
We do have some methods to getting back up relatively quickly - tho they
are not instantaneous. Management is now willing to spend the money
where needed - tho I don't think we will go to fail-safe quite yet.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 16/01/2012 12:36 PM, Tom Austin wrote:
> We do have some methods to getting back up relatively quickly - tho they
> are not instantaneous. Management is now willing to spend the money
> where needed - tho I don't think we will go to fail-safe quite yet.
Yeah, it's funny... I've noticed this strange correlation between
expensive down-time and management willingness to invest in
fault-tolerant equipment. ;-)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> On 16/01/2012 12:36 PM, Tom Austin wrote:
>
>> We do have some methods to getting back up relatively quickly - tho they
>> are not instantaneous. Management is now willing to spend the money
>> where needed - tho I don't think we will go to fail-safe quite yet.
>
> Yeah, it's funny... I've noticed this strange correlation between
> expensive down-time and management willingness to invest in
> fault-tolerant equipment. ;-)
One of my all-time favorite post-mortem meetings went something like this:
Background info: there was a major catastrophe the night before. A
repeat of prevous incidents. As a result, our customer made the news.
10:45am: we meet with our management to quickly explain to the big
bosses what happened, what we did to resolve it, and how we wrote this
nice proposal six months ago to permanently deal with the issue. Big
Boss says "we'll try to steer clear of We-Told-You-So, but we'll also
remind them that this could have been avoided. Let me do the talking
and jump in if I mess up on my technical mumbo-jumbo"
11:00am: We walk in the customer's board room. Before we're fully
seated, the CIO opens up by saying "We know you told us so last time,
but... is there something else we can do other than upgrade these
Whatchamacallits?"
...
It took them two more major incidents - including one where they ended
up on the cover of Time Magazine - before those old whatchamacallits
were removed.
--
/*Francois Labreque*/#local a=x+y;#local b=x+a;#local c=a+b;#macro P(F//
/* flabreque */L)polygon{5,F,F+z,L+z,L,F pigment{rgb 9}}#end union
/* @ */{P(0,a)P(a,b)P(b,c)P(2*a,2*b)P(2*b,b+c)P(b+c,<2,3>)
/* gmail.com */}camera{orthographic location<6,1.25,-6>look_at a }
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 16/01/2012 01:51 PM, Francois Labreque wrote:
> "We know you told us so last time,
> but... is there something else we can do other than upgrade these
> Whatchamacallits?"
>
> ...
>
> It took them two more major incidents - including one where they ended
> up on the cover of Time Magazine - before those old whatchamacallits
> were removed.
One has to wonder why people are so resistant to fixing the problem. The
solution is right there, and yet you want to go around the hard way. Why?
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 1/16/2012 8:11 AM, Invisible wrote:
> On 16/01/2012 12:36 PM, Tom Austin wrote:
>
>> We do have some methods to getting back up relatively quickly - tho they
>> are not instantaneous. Management is now willing to spend the money
>> where needed - tho I don't think we will go to fail-safe quite yet.
>
> Yeah, it's funny... I've noticed this strange correlation between
> expensive down-time and management willingness to invest in
> fault-tolerant equipment. ;-)
As we get more people and process more data - the more down time will
cost. Eventually we may get to higher uptime requirements - but for now
2 hours of downtime 2x a year is not too bad.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|