|
|
http://lwn.net/Articles/250967/
An interesting read, I thought.
[See what I did there?]
Certainly much of this document is pretty much news to me. Last time I
looked into this stuff, the bus from the CPU to RAM ran at the same
speed as the rest of the system, and if you had a 16-bit address space,
you had 16 physical wires.
Also, last time I looked into this, CPUs ran at about 1 MHz. ;-)
I was aware that these days the RAM is about 20x slower than the CPU,
and that's why the CPU has to have a cache the size of a small planet to
achieve any performance whatsoever. But it's certainly news to me that
these days the communication involves a complex *protocol* instead of
just a direct passive electronic connection. Or that the refresh signal
has to come from outside the memory subsystem. Now I see why a "memory
controller" needs to exist.
And now they tell me that the latest designs use a serial link... this
seems entirely counter-intuitive. Sure, a serial link can be driven at a
higher clock rate more easily, but it would have to be *orders of
magnitude* higher to overcome the inherant slowness of a serial link in
the first place! Hmm, well I guess these guys must know what they're
doing, even if it doesn't make any sense. ;-)
I must confess to being comfused about "fully associative" and "set
associative".
The info on TLBs was quite interesting too. And all that NUMA stuff too.
But you begin to wonder what us programmers can actually *do* about all
this stuff.
Most especially, when the document talks about laying out data in memory
in a certain way, and performing loops in a certain order to optimise
cache usage, I can't help thinking that although that might be quite
easy in C or C++, it's going to be 100% impossible in any moderately
high-level language. (Can you spell "garbage collection"?)
Obviously, the idea of high-level languages is that you work at a higher
level of abstraction. And that means, for example, that you might
implement some O(log n) algorithm instead of an O(n^3) algorithm,
[thereby making your code *millions* of times after], rather than
fussing about cache hits and maybe gaining a few *hundred* percent speed
increase. But even so...
Also, this explains some of the weird benchmark results I've been
seeing. I wrote a merge-sort and made it run in parallel, and was rather
perplexed to see that this is *slower* than a purely sequential
algorithm. Now I begin to see why: sorting integers involves grabbing a
pair of data values, performing a simple numerical comparison, and then
moving them to somewhere else. I would imagine it's bounded by RAM
bandwidth rather than CPU power. Maybe trying to run two sorts in
parallel just saturates the FSB bandwidth twice as fast, so each CPU
spends longer waiting around?
--
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*
Post a reply to this message
|
|