|
|
I've been thinking about how you'd go about efficiently sorting data
stored on magnetic tape.
The key thing about tape is that it's sequential. We'll also assume that
it holds more data than you can actually fit into main memory, as was
the case in the days of old.
You can of course wind tape forward and backwards, and presumably with a
sufficiently expensive tape transport system you can do this quite
accurately. You can't do it particularly fast, however. Fundamental
limits such as friction prevent that. (If you wind the tape fast enough,
it'll overheat and melt.)
Presumably the tape transport system isn't accurate enough to do an
in-place sort on tape. (Even if it is, the seek times are probably going
to be prohibitively high if we want to read a record, rewind one record,
re-record that record, read the next record, rewind one record,
re-record this record, etc.) This limits us to tape-to-tape operations.
So, we have this set of constraints:
- We have so much data that only tape is large enough to hold it.
- Tape provides efficient sequential access, but has very long seek times.
- Moving data around on a single tape is infeasible.
- Tape is presumably much slower than main memory for read/write access
(in addition to the huge seek latency).
- Only a finite number of records will fit into main memory at once.
- We want to sort our data in the smallest amount of wall-clock time.
So what are our options here? Well, since in-place sorting is basically
out of the question, we're going to need at least two tapes, an input
and an output. We *could* do something like read one tape, bubble-sort
it, and write the results onto the other tape. Then rewind both tapes,
swap input and output, and start again. Repeat until the data is sorted.
Trouble is, this requires both tapes to be rewound approximately O(n)
times. If we're sorting a few million records, we're going to have to
read the tape end-to-end and rewind to the beginning a few million
times. Since presumably even just rewinding a big reel of tape is quite
a timely operation (never mind winding it past the pick-up head slowly
enough to read it), this is going to take a rather drastic amount of time.
At the opposite end of the spectrum, we could read as much data as
possible into main memory, sort that chunk completely (which is just a
normal in-memory sort, using any algorithm we like), and write the
sorted block out to tape #1. We then read in the next chunk, sort, and
write to tape #2. We continue in this fashion until there is no more
data to read. Finally, we rewind all the tapes and do a multi-way merge
operation.
Trouble is, if we have 1 million records to sort and only 1 thousand
records fit in memory, we're going to need 1 thousand tapes and tape
drives. (!) This, presumably, is prohibitively expensive.
So far, we have a 2-tape bubble-sort, which is absurdly slow, and an
unlimited-tape merge-sort, which requires a ridiculous number of tape
drives.
Let's back up for a moment here. What fundamental operations can we do?
- We can read a chunk of unsorted data into memory and completely sort
it. Since main memory is fast and has random access, we can use any
algorithm we like for this. And presumably this part of the task is so
much faster than tape access that we can regard it as almost
instantaneous by comparison.
- We can read data from one or more input tapes and write it out to one
or more output tapes.
Looking at the second item, this basically amounts to allowing splits
and merges:
- We can take unsorted input data and split it into several output
ranges, as per one iteration of a quick-sort.
- We can take several tapes of sorted data and merge it together onto a
single output tape, as per a single iteration of merge-sort.
Trouble is, both quick-sort and merge-sort are recursive algorithms, and
to implement them on tape would require an unbounded number of tapes,
which isn't really practical. Even if you can somehow arrange the data
onto tapes so that you can fit lots of different tapes through only a
limited number of tape drives, you're still going to spend more time
rewinding tapes than actually doing useful processing.
This is where my musings on the min-heap come in. I was thinking, you
could take an input tape, read data off of it into a min-heap, keep
reading until main memory is full, and then remove the top item (the
minimum element) and put it onto the output tape. Something like this:
REPEAT
READ NEXT RECORD FROM INPUT TAPE
INSERT INTO MIN HEAP
IF HEAP IS FULL
REMOVE MINIMUM RECORD
WRITE ONTO OUTPUT TAPE
UNTIL END-OF-INPUT-TAPE
REPEAT
REMOVE MINIMUM RECORD
WRITE ONTO OUTPUT TAPE
UNTIL HEAP-IS-EMPTY
If you think about it, it's sort of like a bubble-sort, but instead of
comparing just two items, you compare as many as you can hold in memory
at once. Like a bubble-sort, it has the property that the "stones" fall
to the bottom very quickly, but the "bubbles" rise to the top quite
slowly. (If the heap holds N items, each bubble can only rise a maximum
of N slots further up the tape on each pass.)
So it's *like* a bubble-sort, but better. N times better, actually
(where N is the maximum size of the heap). Indeed, for a 2-element
min-heap, it degenerates into a real bubble-sort, of the kind we all
know and hate.
Even so, a bubble-sort over N elements requires up to N passes, whereas
this new hybrid sort requires up to N-H passes, where H is the maximum
heap size. For a million items and a thousand-element heap, that's not
much of a reduction, actually. (It's 0.1% faster.)
What you can do is split the input data into many smaller datasets, and
thus reduce the work required to sort them. You could read the input
tape and partition it onto, say, 10 output tapes, and then repeatedly
heap-bubble the 10 tapes in parallel until they're all sorted, and
finally scan them all to merge them back together. That would result in
roughly a 100-fold speed-up in the sorting process.
That got me thinking about how you determine the optimum pivot values.
Ideally you'd want to look at the histogram of the data and move the
pivot values until there's roughly the same amount of data between each
pair. So first you'd need to scan the input tape to find the min and max
values, and then maybe split the range between those into, say, 1000
buckets, rewind the input tape and count how many items fall into each
range.
Trouble is, the data might be clumped very unevenly. So maybe what we
need to do is
1. Scan for the min and max values.
2. Linearly divide up the space and scan for a 1000-point histogram.
3. Partition the resulting histogram into 10 equal-looking ranges, split
each new super-range into 100 sub-ranges, and scan the tape a third time
for another 1000-point histogram (but now with unequal bucket sizes to
hopefully better sample the data).
Maybe 2 histogram passes isn't enough? Maybe the process could be made
iterative, such that it repeats until the bucket edges stop changing?
Perhaps there should be special handling for particularly sparse
buckets? Come to think of it, does each of the final computed buckets
even need to be a contiguous range? The sorting algorithm doesn't
require that - but maybe the storage requirements of keeping track of
all the buckets makes it necessary...
Then again, all this assumes that we're going to carefully split the
data and then just concat-merge it. You could of course do it the other
way around - and *that* requires no such special processing. Just take
the input data, copy it onto multiple tapes in any old fashion that you
fancy, and do the careful bit when you come to merge the sorted results
back together.
Ideally we want to do either a quick-sort or a merge-sort. But these
both require recursive splitting or merging. We don't have enough tapes
for that. We could put all the fragments onto the same tape, but now
we're in danger of requiring a lot of seeking to find the data we want,
and that'll be slow.
I did think to myself that you could begin by reading unsorted chunks,
sorting them in-memory and writing them to tape. You now have a tape of
N-item sorted chunks. You could then read N/2 records from the first
chunk, seek forward N/2 records, and read N/2 records from the second
chunk, and write the N merged results to the output. Repeat until you
get to the end of the input tape, and then rewind it to the beginning
and go about merging the second halves of all the chunks.
Trouble is, you're going to have to end up seeking in bigger and bigger
steps with each subsequent pass. Eventually it's going to get just too slow.
Then I though to myself, why not have the chunks on two input tapes?
That way, if the chunks are already sorted, you don't even need to hold
the whole chunk in memory at once. You can just merge them as they read
off the two input tapes, and write the results onto the single output tape.
Single output tape. Yeah, that means that after the first merge cycle,
you now have all your data on one tape, and you've going to have to
start seeking around again. Except that, wait, you could have two output
tapes, and just alternate between then as you're writing. So all the odd
chunks go on one tape, and all the even chunks go on the other. And then
you can rewind everything and start again in the opposite direction,
with bigger chunks. Repeat until you're done.
This approach requires you to have 4 tapes / tape drives. It also
requires the tapes to be read / rewound / written / rewound
approximately log N - log C times (where C is the initial chunk size -
presumably the largest chunk size you can sort in-memory). That doesn't
sound too bad. Better still, there's now no seeking at all. Just
sequential reads and writes.
Still, when a pass finishes, all four tapes are going to have to rewind
to the beginning before the next pass can start. The computer itself is
going to be doing nothing while we wait for the tapes to rewind. Is
there some way that we can add a modest number of additional tape drives
and keep the computer busy?
You could have two sets of 4 tapes, but they're all going to tend to end
up rewinding at the same time. That doesn't seem to buy much.
You could have 4 "big" tapes and 4 "small" tapes, where you basically
put half as much data onto the small tapes. At any time you're reading
from two tapes and writing to two other tapes. When you're reading from
two big tapes, you write half the data to two small tapes, rewind them
and write to the other two small tapes while the first two small tapes
are rewinding. By the time you've finished reading the two big tapes,
the first two small tapes should have finished rewinding, so while the
other tapes are rewinding, the two small tapes act as input and the two
hitherto unused big tapes act as output. By the time you get to the end
of the small tapes, the other two small tapes should have finished
rewinding.
By this arrangement, I think you can keep the machine running
constantly. There's always some tape rewound and ready for reading, and
some other tapes rewound and ready for writing.
It seems that by adding more tape drives, it ought to be possible to get
the job done even faster. I need to go meditate on this...
Post a reply to this message
|
|