|
 |
Insert-sort is an O(N^2) algorithm. It takes O(N) time to work out where
the next item needs to go, and it takes O(N) time to move all the data
down one slot in order to make room to insert the new item. And you have
to do this sequence of operations exactly N times. Hence, O(N^2).
An obvious optimisation is to do a binary search over the array. Then
you can find the correct insert-point in O(log N) time. But still it'll
take you O(N) operations to shift everything down one slot to make room.
Now, suppose I built a machine containing special RAM chips, where
there's a special signal I can send that copies the contents of every
memory register in a specified range into the next register along.
Assuming all these shifts happen concurrently, I just made shifting an
O(1) operation. (!!)
Depending on how wide your array slots are, you might need to perform
several hardware-level shifts. But since the slot size is by definition
fixed, that's still O(1).
With this hoopy trick, insert-sort suddenly becomes an O(N log N)
algorithm. (!) It takes O(log N) operations to find the insert-point,
O(1) operations to move stuff, and O(1) to actually insert the item. We
need to do this N times. O(N log N) time complexity.
Better yet, this whole shifting operation doesn't need to go through the
CPU's cache hierachy. We are completely avoiding the von Neuman
bottleneck here, and getting performance in a different complexity class
as the result.
It would of course be somewhat tricky to design a system where we can
affect a subset of the memory registers and have the time taken to do
this be independent of the size of that subrange.
The most obvious thing to do would be to specify two memory addresses
and affect everything between them. Have the address decoder send
signals to each endpoint register, and have those signals propogate in
opposite directions until they meet. But the time required to do that
would be proportional to the size of the subrange.
A better idea is for the signals to propogate to several neighboring
registers: one 1 unit away, one 2 units away, one 4 units away, and so
forth. Given that the chips have a finite capacity, you could make it
constant-time (and fast).
Of course, I don't seriously expect anybody to rush out and implement
this cool new optimisation. Why make insert-sort faster when you can
just use a better algorithm? It's just interesting that you can change
the number of operations something requires simply by redefining what
counts as an "operation".
Trouble is, this would require adding lots and lots of new circuitry to
the RAM chips, which would sit there doing nothing 99.9% of the time,
until somebody asks it to do an insert-sort. (I can't think of many
other potential uses for it.)
Still, while there seem to be rather stern limits on the bandwidth you
can have *between* chips, inside a single chip there's in principle very
much more bandwidth available. You could potentially move data between
millions of registers simultaneously.
(The tricky part is the fact that RAM almost always consists of more
than one discrete chip. But still, even if the CPU has to finish off the
transfer between the chip edges, we're still talking about a
breathtaking speedup.)
Now I'm wondering what other functions you could hypothetically get your
RAM chips to perform in parallel...
Post a reply to this message
|
 |