|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
OK, so as you guys know, I now write computer software for a living. The
installation DVD for this thing used to be hand-crafted, but thanks to
[mostly] my efforts it's now auto-generated by a script. Anyway,
obviously if you completely replace the installation system, you need to
go test it. So we're testing it.
So yesterday the tester is trying out the software on the various
platforms we support it on. And he starts complaining that on one
specific model of laptop, it's not working properly. He claims the
installer gets to 60% of the first step, and then skips the rest of that
step and skips the next step completely and just says "installation
successful".
So I burn a copy of the same DVD and try installing it on another laptop
of the same model. It works perfectly. But the tester is insistent; he
hands me the laptop with the DVD still in the drive. I run the
installer, and it appears to work fine. But then, around about 65%, it
suddenly "completes", just like the guy said. That's odd.
The install disk is a live Linux environment that runs a Bash script to
do the installation. It does some crazy multi-way piping and redirecting
to get the progress display to work. (Most Linux commands helpfully
provide absolutely no feedback whatsoever, which isn't very good for an
operation that takes 10 minutes to complete...) With all this piping
going on, it wouldn't surprise me if some command somewhere is emitting
an error message and it's simply been "lost" somewhere. A bit alarming
that the script still claims that "installation was successful" though...
At this point I'm wondering if maybe the disk has a scratch on it which
prevents it reading past 65% of the image file. So I boot the disk and
do an md5sum of both image files. They're both fine. Hmm. I manually run
the key command that actually does the installation. Initially I'm
getting 150 MB/sec - which is odd, given that the internal SSD device
maxes out at about 40 MB/sec. After a few minutes, the speed drops to
about 10 MB/sec - a more usual number. And then the command simply
/stops/, with no indication as to why. It claims to have completed, but
the amount of data copied is clearly too low.
So I take the DVD out and put in the one I've been testing with. I
notice the DVD I took out has a slightly older version of the image
file. But the installation script is identical, so that shouldn't matter
at this stage.
I run the installer again, and again it fails the same way. I run it
manually, and again it fails the same way. Now I'm wondering if I've
somehow burned out that sector on the SSD or something. (But surely
wear-levelling would... hmm, anyway.) So I look inside the harddrive bay
to see how old the drive is...
At this point, things get WEIRD! I look in the drive bay, and see...
daylight. It turns out there's NOTHING IN THERE. There *is* no harddrive!
This is not /that/ unusual; we do occasionally swap drives around or
take drives out of machines. So somehow our test guy ended up with a
laptop with no drive. That's not especially surprising.
But... so... if there's no drive, WHAT THE HELL IS THE INSTALLER
INSTALLING TO?!? O_O
Why does it sit there for 5 minutes apparently working perfectly when
there's NO DRIVE PRESENT?!
My collegue suggested that maybe someone had left an SD card in the
internal card reader. But no. Man, that would have been fun though!
So I open up the Bash script and start reading. This is one of the few
parts that I didn't write. (It was written early on, when I had hardly
learned how to do Bash, so my boss else wrote it.) The programmers among
you will appreciate this: The script reports an error if the number of
devices connected is GREATER THAN ONE. But the script neglects to check
for the possibility that the number of devices is LESS THAN ONE. (!)
Because, hey, who the hell has a laptop with no harddrive in it?
OK, well that's fine I guess, but next the script checks that the
[assumed to be] one device has sufficient space. Surely a non-existent
device has zero space on it, right? RIGHT?? How in the HELL is this test
passing?
Ah, but wait. You're thinking about this as if you're using a REAL
programming language. Bash is just text-munging. It doesn't compare the
contents of a variable to a number. It compares two bits of text to each
other. And one of those pieces of text is now "error: the device
/dev/sda does not exist", which is more characters than "80000000000",
and hence is reported as being "greater than" the required disk size.
FACEPALM!
So there's no error produced by the script. But how the HELL does it
image a device that DOESN'T SODDING EXIST?!
Ah, but wait. This is Linux, remember? Consider the following command:
cat Image1.raw.gz | gzip -d | dd of=/dev/sda
Slowly it dawned on my what is happening here. If there *is* a block
device connected, then UDev will generate a special file named /dev/sda
which represents this device, and the above command will overwrite the
contents of that device. HOWEVER... if there is *no* block device
connected, then this file will not exist. In that case, rather than
produce some kind of error, the above command will *create* a regular
file named /dev/sda and try to decompress 20GB of data into it. (!)
Once upon a time, that might not have worked. A DVD is a read-only
device, after all. But in this modern age, you have the DVD, which is
read-only, and you then overlay a read/write filesystem backed by RAM.
In summary, the installer is decompressing a 20GB disk image into RAM,
hence the 150MB/sec transfer speed. (Probably limited by DVD drive speed
and the rate at which the CPU can decompress the data.) After about 5GB,
free RAM has become so fragmented that the transfer rate drops to
10MB/sec as the OS desperately searches for empty pages. And eventually,
once ALL AVAILABLE RAM HAS BEEN EXHAUSTED, the process is summarily
terminated.
If this was written in a real programming language, some sort of
exception would have been thrown, which would have alerted me to the
problem (and prevented the "installation successful" message being
shown). But this is Bash. By default, it completely ignores all errors,
problems and malfunctions, and continues executing the next command as
if everything worked perfectly. So when DD gets terminated, Bash simply
executes the next line of the script - which says "installation successful"!
I told my boss, and he spent literally 15 minutes laughing
uncontrollably. There's always a lot of banter in our office, but it's
unusual for somebody to actually find something so funny that they're
actually unable to speak any more. And for 15 minutes?
Damn, this is probably the most amusing bug I've ever seen.
Fortunately, the fix is very simple. You just need to add a check for
the possibility of there being zero devices. But I love the way that
there are three separate stages where the problem *should* have been
caught, but wasn't. And all because we wrote the thing in Bash...
Also, somebody give that tester a medal. There's no way in hell we would
have thought to actually *test* for such an obscure condition. (Not that
the tester did so intentionally... It was simply a lucky accident. But
you *know* some customer somewhere is going to do this one day, and
saying something succeeded when it didn't is a pretty serious bug!)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 23.02.2013 12:34, schrieb Orchid Win7 v1:
> Also, somebody give that tester a medal. There's no way in hell we would
> have thought to actually *test* for such an obscure condition. (Not that
> the tester did so intentionally... It was simply a lucky accident. But
> you *know* some customer somewhere is going to do this one day, and
> saying something succeeded when it didn't is a pretty serious bug!)
Heh - I once had a work colleague like that: Be it karma, a genetic
disposition, or - as he used to put it - having "shit on his fingers" -
he had a supernatural talent for having things go wrong on him. He
actually took quite some pride in this mysterious gift of his, because
yes, of course, he did work as a tester. I swear this guy was /born/ to
test the holy crap out of things.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 23/02/2013 1:01 PM, clipka wrote:
> Heh - I once had a work colleague like that: Be it karma, a genetic
> disposition, or - as he used to put it - having "shit on his fingers" -
> he had a supernatural talent for having things go wrong on him. He
> actually took quite some pride in this mysterious gift of his, because
> yes, of course, he did work as a tester. I swear this guy was /born/ to
> test the holy crap out of things.
I can do that. If anything can be broken, I can break it. ;-)
--
Regards
Stephen
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 23.02.2013 14:40, schrieb Stephen:
> On 23/02/2013 1:01 PM, clipka wrote:
>> Heh - I once had a work colleague like that: Be it karma, a genetic
>> disposition, or - as he used to put it - having "shit on his fingers" -
>> he had a supernatural talent for having things go wrong on him. He
>> actually took quite some pride in this mysterious gift of his, because
>> yes, of course, he did work as a tester. I swear this guy was /born/ to
>> test the holy crap out of things.
>
> I can do that. If anything can be broken, I can break it. ;-)
Well, he didn't have to break things. Things broke by themselves at his
merest presence ;-)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Orchid Win7 v1 <voi### [at] devnull> wrote:
>
> At this point, things get WEIRD! I look in the drive bay, and see...
> daylight. It turns out there's NOTHING IN THERE. There *is* no harddrive!
>
That is too funny! The best laugh of the day.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
>> At this point, things get WEIRD! I look in the drive bay, and see...
>> daylight. It turns out there's NOTHING IN THERE. There *is* no harddrive!
>
> That is too funny! The best laugh of the day.
Like I said, it had us all rolling around on the floor...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
fun indeed
I mean, to put a windoze guy on charge of Linux stuff :)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> The install disk is a live Linux environment that runs a Bash script to
> do the installation. It does some crazy multi-way piping and redirecting
> to get the progress display to work. (Most Linux commands helpfully
> provide absolutely no feedback whatsoever, which isn't very good for an
> operation that takes 10 minutes to complete...) With all this piping
> going on, it wouldn't surprise me if some command somewhere is emitting
> an error message and it's simply been "lost" somewhere. A bit alarming
> that the script still claims that "installation was successful" though...
1. Redirect stderr to a different file. And check for that file being
more than 0 bytes before claiming the install completed succesfully.
2. Expect the unexpected. I am currently having issues with a "serious"
IT company over an install script that does not error checking and where
there are multiple error-prone steps between zeroing out the old config
and recreating a new one, which can leave the machine pretty much
brain-dead if Something-Bad(tm) happens in the middle of the upgrade.
--
/*Francois Labreque*/#local a=x+y;#local b=x+a;#local c=a+b;#macro P(F//
/* flabreque */L)polygon{5,F,F+z,L+z,L,F pigment{rgb 9}}#end union
/* @ */{P(0,a)P(a,b)P(b,c)P(2*a,2*b)P(2*b,b+c)P(b+c,<2,3>)
/* gmail.com */}camera{orthographic location<6,1.25,-6>look_at a }
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Francois Labreque <fla### [at] videotronca> wrote:
> 1. Redirect stderr to a different file. And check for that file being
> more than 0 bytes before claiming the install completed succesfully.
All programs should return an error code if an error happens. If a program
ends in error but returns a success code, that program is broken (and can
subsequently break other programs, such as 'make', which relies on programs
returning non-success on error.)
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 24/02/2013 12:06 AM, Warp wrote:
> All programs should return an error code if an error happens. If a program
> ends in error but returns a success code, that program is broken
I agree. However, unfortunately it seems that by default Bash ignores
all such errors and happily proceeds, unless you manually suffix every
single command with an explicit return-code check.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|