POV-Ray : Newsgroups : povray.off-topic : Tape-sort : Tape-sort Server Time
3 Sep 2024 17:17:11 EDT (-0400)
  Tape-sort  
From: Invisible
Date: 12 Oct 2010 11:03:49
Message: <4cb478d5$1@news.povray.org>
I've been thinking about how you'd go about efficiently sorting data 
stored on magnetic tape.

The key thing about tape is that it's sequential. We'll also assume that 
it holds more data than you can actually fit into main memory, as was 
the case in the days of old.

You can of course wind tape forward and backwards, and presumably with a 
sufficiently expensive tape transport system you can do this quite 
accurately. You can't do it particularly fast, however. Fundamental 
limits such as friction prevent that. (If you wind the tape fast enough, 
it'll overheat and melt.)

Presumably the tape transport system isn't accurate enough to do an 
in-place sort on tape. (Even if it is, the seek times are probably going 
to be prohibitively high if we want to read a record, rewind one record, 
re-record that record, read the next record, rewind one record, 
re-record this record, etc.) This limits us to tape-to-tape operations.

So, we have this set of constraints:
- We have so much data that only tape is large enough to hold it.
- Tape provides efficient sequential access, but has very long seek times.
- Moving data around on a single tape is infeasible.
- Tape is presumably much slower than main memory for read/write access 
(in addition to the huge seek latency).
- Only a finite number of records will fit into main memory at once.
- We want to sort our data in the smallest amount of wall-clock time.

So what are our options here? Well, since in-place sorting is basically 
out of the question, we're going to need at least two tapes, an input 
and an output. We *could* do something like read one tape, bubble-sort 
it, and write the results onto the other tape. Then rewind both tapes, 
swap input and output, and start again. Repeat until the data is sorted.

Trouble is, this requires both tapes to be rewound approximately O(n) 
times. If we're sorting a few million records, we're going to have to 
read the tape end-to-end and rewind to the beginning a few million 
times. Since presumably even just rewinding a big reel of tape is quite 
a timely operation (never mind winding it past the pick-up head slowly 
enough to read it), this is going to take a rather drastic amount of time.

At the opposite end of the spectrum, we could read as much data as 
possible into main memory, sort that chunk completely (which is just a 
normal in-memory sort, using any algorithm we like), and write the 
sorted block out to tape #1. We then read in the next chunk, sort, and 
write to tape #2. We continue in this fashion until there is no more 
data to read. Finally, we rewind all the tapes and do a multi-way merge 
operation.

Trouble is, if we have 1 million records to sort and only 1 thousand 
records fit in memory, we're going to need 1 thousand tapes and tape 
drives. (!) This, presumably, is prohibitively expensive.

So far, we have a 2-tape bubble-sort, which is absurdly slow, and an 
unlimited-tape merge-sort, which requires a ridiculous number of tape 
drives.

Let's back up for a moment here. What fundamental operations can we do?

- We can read a chunk of unsorted data into memory and completely sort 
it. Since main memory is fast and has random access, we can use any 
algorithm we like for this. And presumably this part of the task is so 
much faster than tape access that we can regard it as almost 
instantaneous by comparison.

- We can read data from one or more input tapes and write it out to one 
or more output tapes.

Looking at the second item, this basically amounts to allowing splits 
and merges:

- We can take unsorted input data and split it into several output 
ranges, as per one iteration of a quick-sort.

- We can take several tapes of sorted data and merge it together onto a 
single output tape, as per a single iteration of merge-sort.

Trouble is, both quick-sort and merge-sort are recursive algorithms, and 
to implement them on tape would require an unbounded number of tapes, 
which isn't really practical. Even if you can somehow arrange the data 
onto tapes so that you can fit lots of different tapes through only a 
limited number of tape drives, you're still going to spend more time 
rewinding tapes than actually doing useful processing.

This is where my musings on the min-heap come in. I was thinking, you 
could take an input tape, read data off of it into a min-heap, keep 
reading until main memory is full, and then remove the top item (the 
minimum element) and put it onto the output tape. Something like this:

   REPEAT
     READ NEXT RECORD FROM INPUT TAPE
     INSERT INTO MIN HEAP
     IF HEAP IS FULL
       REMOVE MINIMUM RECORD
       WRITE ONTO OUTPUT TAPE
   UNTIL END-OF-INPUT-TAPE
   REPEAT
     REMOVE MINIMUM RECORD
     WRITE ONTO OUTPUT TAPE
   UNTIL HEAP-IS-EMPTY

If you think about it, it's sort of like a bubble-sort, but instead of 
comparing just two items, you compare as many as you can hold in memory 
at once. Like a bubble-sort, it has the property that the "stones" fall 
to the bottom very quickly, but the "bubbles" rise to the top quite 
slowly. (If the heap holds N items, each bubble can only rise a maximum 
of N slots further up the tape on each pass.)

So it's *like* a bubble-sort, but better. N times better, actually 
(where N is the maximum size of the heap). Indeed, for a 2-element 
min-heap, it degenerates into a real bubble-sort, of the kind we all 
know and hate.

Even so, a bubble-sort over N elements requires up to N passes, whereas 
this new hybrid sort requires up to N-H passes, where H is the maximum 
heap size. For a million items and a thousand-element heap, that's not 
much of a reduction, actually. (It's 0.1% faster.)

What you can do is split the input data into many smaller datasets, and 
thus reduce the work required to sort them. You could read the input 
tape and partition it onto, say, 10 output tapes, and then repeatedly 
heap-bubble the 10 tapes in parallel until they're all sorted, and 
finally scan them all to merge them back together. That would result in 
roughly a 100-fold speed-up in the sorting process.

That got me thinking about how you determine the optimum pivot values. 
Ideally you'd want to look at the histogram of the data and move the 
pivot values until there's roughly the same amount of data between each 
pair. So first you'd need to scan the input tape to find the min and max 
values, and then maybe split the range between those into, say, 1000 
buckets, rewind the input tape and count how many items fall into each 
range.

Trouble is, the data might be clumped very unevenly. So maybe what we 
need to do is

1. Scan for the min and max values.
2. Linearly divide up the space and scan for a 1000-point histogram.
3. Partition the resulting histogram into 10 equal-looking ranges, split 
each new super-range into 100 sub-ranges, and scan the tape a third time 
for another 1000-point histogram (but now with unequal bucket sizes to 
hopefully better sample the data).

Maybe 2 histogram passes isn't enough? Maybe the process could be made 
iterative, such that it repeats until the bucket edges stop changing? 
Perhaps there should be special handling for particularly sparse 
buckets? Come to think of it, does each of the final computed buckets 
even need to be a contiguous range? The sorting algorithm doesn't 
require that - but maybe the storage requirements of keeping track of 
all the buckets makes it necessary...

Then again, all this assumes that we're going to carefully split the 
data and then just concat-merge it. You could of course do it the other 
way around - and *that* requires no such special processing. Just take 
the input data, copy it onto multiple tapes in any old fashion that you 
fancy, and do the careful bit when you come to merge the sorted results 
back together.

Ideally we want to do either a quick-sort or a merge-sort. But these 
both require recursive splitting or merging. We don't have enough tapes 
for that. We could put all the fragments onto the same tape, but now 
we're in danger of requiring a lot of seeking to find the data we want, 
and that'll be slow.

I did think to myself that you could begin by reading unsorted chunks, 
sorting them in-memory and writing them to tape. You now have a tape of 
N-item sorted chunks. You could then read N/2 records from the first 
chunk, seek forward N/2 records, and read N/2 records from the second 
chunk, and write the N merged results to the output. Repeat until you 
get to the end of the input tape, and then rewind it to the beginning 
and go about merging the second halves of all the chunks.

Trouble is, you're going to have to end up seeking in bigger and bigger 
steps with each subsequent pass. Eventually it's going to get just too slow.

Then I though to myself, why not have the chunks on two input tapes? 
That way, if the chunks are already sorted, you don't even need to hold 
the whole chunk in memory at once. You can just merge them as they read 
off the two input tapes, and write the results onto the single output tape.

Single output tape. Yeah, that means that after the first merge cycle, 
you now have all your data on one tape, and you've going to have to 
start seeking around again. Except that, wait, you could have two output 
tapes, and just alternate between then as you're writing. So all the odd 
chunks go on one tape, and all the even chunks go on the other. And then 
you can rewind everything and start again in the opposite direction, 
with bigger chunks. Repeat until you're done.

This approach requires you to have 4 tapes / tape drives. It also 
requires the tapes to be read / rewound / written / rewound 
approximately log N - log C times (where C is the initial chunk size - 
presumably the largest chunk size you can sort in-memory). That doesn't 
sound too bad. Better still, there's now no seeking at all. Just 
sequential reads and writes.

Still, when a pass finishes, all four tapes are going to have to rewind 
to the beginning before the next pass can start. The computer itself is 
going to be doing nothing while we wait for the tapes to rewind. Is 
there some way that we can add a modest number of additional tape drives 
and keep the computer busy?

You could have two sets of 4 tapes, but they're all going to tend to end 
up rewinding at the same time. That doesn't seem to buy much.

You could have 4 "big" tapes and 4 "small" tapes, where you basically 
put half as much data onto the small tapes. At any time you're reading 
from two tapes and writing to two other tapes. When you're reading from 
two big tapes, you write half the data to two small tapes, rewind them 
and write to the other two small tapes while the first two small tapes 
are rewinding. By the time you've finished reading the two big tapes, 
the first two small tapes should have finished rewinding, so while the 
other tapes are rewinding, the two small tapes act as input and the two 
hitherto unused big tapes act as output. By the time you get to the end 
of the small tapes, the other two small tapes should have finished 
rewinding.

By this arrangement, I think you can keep the machine running 
constantly. There's always some tape rewound and ready for reading, and 
some other tapes rewound and ready for writing.

It seems that by adding more tape drives, it ought to be possible to get 
the job done even faster. I need to go meditate on this...


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.