POV-Ray : Newsgroups : povray.off-topic : Funniest bug ever : Funniest bug ever Server Time
28 Jul 2024 20:26:28 EDT (-0400)
  Funniest bug ever  
From: Orchid Win7 v1
Date: 23 Feb 2013 06:34:07
Message: <5128a92f$1@news.povray.org>
OK, so as you guys know, I now write computer software for a living. The 
installation DVD for this thing used to be hand-crafted, but thanks to 
[mostly] my efforts it's now auto-generated by a script. Anyway, 
obviously if you completely replace the installation system, you need to 
go test it. So we're testing it.

So yesterday the tester is trying out the software on the various 
platforms we support it on. And he starts complaining that on one 
specific model of laptop, it's not working properly. He claims the 
installer gets to 60% of the first step, and then skips the rest of that 
step and skips the next step completely and just says "installation 
successful".

So I burn a copy of the same DVD and try installing it on another laptop 
of the same model. It works perfectly. But the tester is insistent; he 
hands me the laptop with the DVD still in the drive. I run the 
installer, and it appears to work fine. But then, around about 65%, it 
suddenly "completes", just like the guy said. That's odd.

The install disk is a live Linux environment that runs a Bash script to 
do the installation. It does some crazy multi-way piping and redirecting 
to get the progress display to work. (Most Linux commands helpfully 
provide absolutely no feedback whatsoever, which isn't very good for an 
operation that takes 10 minutes to complete...) With all this piping 
going on, it wouldn't surprise me if some command somewhere is emitting 
an error message and it's simply been "lost" somewhere. A bit alarming 
that the script still claims that "installation was successful" though...

At this point I'm wondering if maybe the disk has a scratch on it which 
prevents it reading past 65% of the image file. So I boot the disk and 
do an md5sum of both image files. They're both fine. Hmm. I manually run 
the key command that actually does the installation. Initially I'm 
getting 150 MB/sec - which is odd, given that the internal SSD device 
maxes out at about 40 MB/sec. After a few minutes, the speed drops to 
about 10 MB/sec - a more usual number. And then the command simply 
/stops/, with no indication as to why. It claims to have completed, but 
the amount of data copied is clearly too low.

So I take the DVD out and put in the one I've been testing with. I 
notice the DVD I took out has a slightly older version of the image 
file. But the installation script is identical, so that shouldn't matter 
at this stage.

I run the installer again, and again it fails the same way. I run it 
manually, and again it fails the same way. Now I'm wondering if I've 
somehow burned out that sector on the SSD or something. (But surely 
wear-levelling would... hmm, anyway.) So I look inside the harddrive bay 
to see how old the drive is...



At this point, things get WEIRD! I look in the drive bay, and see... 
daylight. It turns out there's NOTHING IN THERE. There *is* no harddrive!

This is not /that/ unusual; we do occasionally swap drives around or 
take drives out of machines. So somehow our test guy ended up with a 
laptop with no drive. That's not especially surprising.

But... so... if there's no drive, WHAT THE HELL IS THE INSTALLER 
INSTALLING TO?!? O_O

Why does it sit there for 5 minutes apparently working perfectly when 
there's NO DRIVE PRESENT?!

My collegue suggested that maybe someone had left an SD card in the 
internal card reader. But no. Man, that would have been fun though!

So I open up the Bash script and start reading. This is one of the few 
parts that I didn't write. (It was written early on, when I had hardly 
learned how to do Bash, so my boss else wrote it.) The programmers among 
you will appreciate this: The script reports an error if the number of 
devices connected is GREATER THAN ONE. But the script neglects to check 
for the possibility that the number of devices is LESS THAN ONE. (!) 
Because, hey, who the hell has a laptop with no harddrive in it?

OK, well that's fine I guess, but next the script checks that the 
[assumed to be] one device has sufficient space. Surely a non-existent 
device has zero space on it, right? RIGHT?? How in the HELL is this test 
passing?

Ah, but wait. You're thinking about this as if you're using a REAL 
programming language. Bash is just text-munging. It doesn't compare the 
contents of a variable to a number. It compares two bits of text to each 
other. And one of those pieces of text is now "error: the device 
/dev/sda does not exist", which is more characters than "80000000000", 
and hence is reported as being "greater than" the required disk size. 
FACEPALM!

So there's no error produced by the script. But how the HELL does it 
image a device that DOESN'T SODDING EXIST?!

Ah, but wait. This is Linux, remember? Consider the following command:

   cat Image1.raw.gz | gzip -d | dd of=/dev/sda

Slowly it dawned on my what is happening here. If there *is* a block 
device connected, then UDev will generate a special file named /dev/sda 
which represents this device, and the above command will overwrite the 
contents of that device. HOWEVER... if there is *no* block device 
connected, then this file will not exist. In that case, rather than 
produce some kind of error, the above command will *create* a regular 
file named /dev/sda and try to decompress 20GB of data into it. (!)

Once upon a time, that might not have worked. A DVD is a read-only 
device, after all. But in this modern age, you have the DVD, which is 
read-only, and you then overlay a read/write filesystem backed by RAM.

In summary, the installer is decompressing a 20GB disk image into RAM, 
hence the 150MB/sec transfer speed. (Probably limited by DVD drive speed 
and the rate at which the CPU can decompress the data.) After about 5GB, 
free RAM has become so fragmented that the transfer rate drops to 
10MB/sec as the OS desperately searches for empty pages. And eventually, 
once ALL AVAILABLE RAM HAS BEEN EXHAUSTED, the process is summarily 
terminated.

If this was written in a real programming language, some sort of 
exception would have been thrown, which would have alerted me to the 
problem (and prevented the "installation successful" message being 
shown). But this is Bash. By default, it completely ignores all errors, 
problems and malfunctions, and continues executing the next command as 
if everything worked perfectly. So when DD gets terminated, Bash simply 
executes the next line of the script - which says "installation successful"!

I told my boss, and he spent literally 15 minutes laughing 
uncontrollably. There's always a lot of banter in our office, but it's 
unusual for somebody to actually find something so funny that they're 
actually unable to speak any more. And for 15 minutes?

Damn, this is probably the most amusing bug I've ever seen.

Fortunately, the fix is very simple. You just need to add a check for 
the possibility of there being zero devices. But I love the way that 
there are three separate stages where the problem *should* have been 
caught, but wasn't. And all because we wrote the thing in Bash...

Also, somebody give that tester a medal. There's no way in hell we would 
have thought to actually *test* for such an obscure condition. (Not that 
the tester did so intentionally... It was simply a lucky accident. But 
you *know* some customer somewhere is going to do this one day, and 
saying something succeeded when it didn't is a pretty serious bug!)


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.