POV-Ray : Newsgroups : povray.off-topic : An idle observation : An idle observation Server Time
3 Sep 2024 21:14:26 EDT (-0400)
  An idle observation  
From: Invisible
Date: 2 Aug 2010 11:46:58
Message: <4c56e872$1@news.povray.org>
Insert-sort is an O(N^2) algorithm. It takes O(N) time to work out where 
the next item needs to go, and it takes O(N) time to move all the data 
down one slot in order to make room to insert the new item. And you have 
to do this sequence of operations exactly N times. Hence, O(N^2).

An obvious optimisation is to do a binary search over the array. Then 
you can find the correct insert-point in O(log N) time. But still it'll 
take you O(N) operations to shift everything down one slot to make room.

Now, suppose I built a machine containing special RAM chips, where 
there's a special signal I can send that copies the contents of every 
memory register in a specified range into the next register along. 
Assuming all these shifts happen concurrently, I just made shifting an 
O(1) operation. (!!)

Depending on how wide your array slots are, you might need to perform 
several hardware-level shifts. But since the slot size is by definition 
fixed, that's still O(1).

With this hoopy trick, insert-sort suddenly becomes an O(N log N) 
algorithm. (!) It takes O(log N) operations to find the insert-point, 
O(1) operations to move stuff, and O(1) to actually insert the item. We 
need to do this N times. O(N log N) time complexity.

Better yet, this whole shifting operation doesn't need to go through the 
CPU's cache hierachy. We are completely avoiding the von Neuman 
bottleneck here, and getting performance in a different complexity class 
as the result.

It would of course be somewhat tricky to design a system where we can 
affect a subset of the memory registers and have the time taken to do 
this be independent of the size of that subrange.

The most obvious thing to do would be to specify two memory addresses 
and affect everything between them. Have the address decoder send 
signals to each endpoint register, and have those signals propogate in 
opposite directions until they meet. But the time required to do that 
would be proportional to the size of the subrange.

A better idea is for the signals to propogate to several neighboring 
registers: one 1 unit away, one 2 units away, one 4 units away, and so 
forth. Given that the chips have a finite capacity, you could make it 
constant-time (and fast).

Of course, I don't seriously expect anybody to rush out and implement 
this cool new optimisation. Why make insert-sort faster when you can 
just use a better algorithm? It's just interesting that you can change 
the number of operations something requires simply by redefining what 
counts as an "operation".

Trouble is, this would require adding lots and lots of new circuitry to 
the RAM chips, which would sit there doing nothing 99.9% of the time, 
until somebody asks it to do an insert-sort. (I can't think of many 
other potential uses for it.)

Still, while there seem to be rather stern limits on the bandwidth you 
can have *between* chips, inside a single chip there's in principle very 
much more bandwidth available. You could potentially move data between 
millions of registers simultaneously.

(The tricky part is the fact that RAM almost always consists of more 
than one discrete chip. But still, even if the CPU has to finish off the 
transfer between the chip edges, we're still talking about a 
breathtaking speedup.)

Now I'm wondering what other functions you could hypothetically get your 
RAM chips to perform in parallel...


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.