|
![](/i/fill.gif) |
On 02/06/2011 03:06 PM, Mike Raiford wrote:
> So, I've been sort of reading this:
>
> http://download.intel.com/design/intarch/manuals/27320401.pdf
Woooo boy, that's one big can of complexity, right there! ;-)
> I've had a pretty good idea of how the 8086 and 8088 deal with the
> system bus. But, I wanted to understand more about the later generation
> processors, so I started with the Pentium.
I'm going to sound really old now, but... I remember back in the days
when the RAM was faster than the CPU. I remember in-order execution
without pipelines. When I first sat down to read about how modern IA32
processors actually work... I was pretty shocked. The circuitry seems to
spend more time deciding how to execute the code than actually, you
know, *executing* it! o_O
> I'm on the section dealing with bus arbitration and cache coherency when
> there are 2 processors in the system.
> It occurs to me that handling the cache when there are 2 parts vying for
> the same resource can get rather messy.
> I can definitely see some potential for bottle-necks in a
> multi-processor system when dealing with the bus, since electrically,
> only one device can place data on the bus at one time. The nice thing
> is, in system design, you can design your bus the same for single or
> dual processors. Provided you've wired the proper signals together, and
> initialized the processors properly with software and certain pin
> levels, it's totally transparent to the rest of the system.
There are two main problems with trying to connect two CPUs to one RAM
block:
1. Bandwidth.
2. Cache coherence.
Currently, RAM is way, way slower than the CPU. Adding more CPU cores
simply makes things worse. Unless you're doing a specific type of
compute-heavy process that doesn't require much RAM access, all you're
doing is taking the system's main bottleneck and making it twice as bad.
Instead of having /one/ super-fast CPU sitting around waiting for RAM to
respond, now you have /two/ super-fast CPUs fighting over who gets to
access RAM next.
(I've seen plenty of documentation from the Haskell developers where
they benchmark something, and then find that it goes faster if you turn
*off* parallelism, because otherwise the multiple cores fight for
bandwidth or constantly invalidate each other's caches.)
It seems inevitable that the more memory you have, the slower it is.
Even if we completely ignore issues of cost, bigger memory = more
address decode logic + longer traces. Longer traces mean more power,
more latency, more capacitance, more radio interference. It seems
inevitable that if you have a really huge memory, it will necessarily be
very slow.
Similarly, if you have huge numbers of processing elements all trying to
access the same chunk of RAM through the same bus, you're splitting the
available bandwidth many, many ways. The result isn't going to be high
performance.
My personal belief is that the future is NUMA. We already have cache
hierarchies three or four levels deep, backed by demand-paged virtual
memory. Let's just stop pretending that "memory" is all uniform with
identical latency and start explicitly treating it as what it is.
Indeed, sometimes I start thinking about what some kind of hyper-Harvard
architecture would look like. For example, what happens if the machine
stack is on-chip? (I.e., it exists in its own unique address space, and
the circuitry for it is entirely on-chip.) What about if there were
several on-chip memories optimised for different types of operation?
Unfortunately, the answer to all these questions generally ends up being
"good luck implementing multitasking".
The other model I've looked at is having not a computer with one giant
memory connected to one giant CPU, but zillions of smallish memories
connected to zillions of processing elements. The problem with *that*
model tends to be "how do I get my data to where I need it?"
As someone else once said, "a supercomputer is a device for turning a
compute-bound problem into an I/O-bound problem".
> What I understand so far is: One process has the bus, and either reads
> or writes from/to the bus. The other processor watches the activity and,
> if it sees an address it has modified it tells the other processor,
> which passes bus control to the other, puts the data out on the bus,
> then returns control to the first processor.
http://en.wikipedia.org/wiki/Cache_coherence
http://en.wikipedia.org/wiki/MESI_protocol
Looks like what you're describing is a "snooping" implementation. There
are also other ways to implement this.
> Apparently, the bus can also be pipelined. I'm not exactly sure how this
> works, but the processors then have to agree on whether the operation
> actually can be put in a pipeline.
Again, when I learned this stuff, RAM was the fastest thing in the
system. You just send the address you want on the address bus and read
back (or write) the data on the data bus.
These days, the CPU is the fastest thing. You can "pipeline" read
requests by requesting a contiguous block of data. That way, you
eliminate some of the latency of sending a memory request for each
datum. (And, let's face it, you only talk to main memory to fill or
empty cache lines, which are contiguous anyway.)
I understand this also has something to do with the internal way that
the RAM chips do two-dimensional addressing...
Post a reply to this message
|
![](/i/fill.gif) |