POV-Ray: Newsgroups: povray.off-topic: Complicated: Re: Complicated

POV-Ray : Newsgroups : povray.off-topic : Complicated : Re: Complicated		Server Time 12 Jul 2025 07:00:36 EDT (-0400)
From: Invisible
Date: 3 Jun 2011 09:48:47
Message: <4de8e63f$1@news.povray.org>
On 02/06/2011 03:06 PM, Mike Raiford wrote:
> So, I've been sort of reading this:
>
> http://download.intel.com/design/intarch/manuals/27320401.pdf

Woooo boy, that's one big can of complexity, right there! ;-)

> I've had a pretty good idea of how the 8086 and 8088 deal with the
> system bus. But, I wanted to understand more about the later generation
> processors, so I started with the Pentium.

I'm going to sound really old now, but... I remember back in the days 
when the RAM was faster than the CPU. I remember in-order execution 
without pipelines. When I first sat down to read about how modern IA32 
processors actually work... I was pretty shocked. The circuitry seems to 
spend more time deciding how to execute the code than actually, you 
know, *executing* it! o_O

> I'm on the section dealing with bus arbitration and cache coherency when
> there are 2 processors in the system.

> It occurs to me that handling the cache when there are 2 parts vying for
> the same resource can get rather messy.

 > I can definitely see some potential for bottle-necks in a
 > multi-processor system when dealing with the bus, since electrically,
 > only one device can place data on the bus at one time. The nice thing
 > is, in system design, you can design your bus the same for single or
 > dual processors. Provided you've wired the proper signals together, and
 > initialized the processors properly with software and certain pin
 > levels, it's totally transparent to the rest of the system.

There are two main problems with trying to connect two CPUs to one RAM 
block:

1. Bandwidth.

2. Cache coherence.

Currently, RAM is way, way slower than the CPU. Adding more CPU cores 
simply makes things worse. Unless you're doing a specific type of 
compute-heavy process that doesn't require much RAM access, all you're 
doing is taking the system's main bottleneck and making it twice as bad. 
Instead of having /one/ super-fast CPU sitting around waiting for RAM to 
respond, now you have /two/ super-fast CPUs fighting over who gets to 
access RAM next.

(I've seen plenty of documentation from the Haskell developers where 
they benchmark something, and then find that it goes faster if you turn 
*off* parallelism, because otherwise the multiple cores fight for 
bandwidth or constantly invalidate each other's caches.)

It seems inevitable that the more memory you have, the slower it is. 
Even if we completely ignore issues of cost, bigger memory = more 
address decode logic + longer traces. Longer traces mean more power, 
more latency, more capacitance, more radio interference. It seems 
inevitable that if you have a really huge memory, it will necessarily be 
very slow.

Similarly, if you have huge numbers of processing elements all trying to 
access the same chunk of RAM through the same bus, you're splitting the 
available bandwidth many, many ways. The result isn't going to be high 
performance.

My personal belief is that the future is NUMA. We already have cache 
hierarchies three or four levels deep, backed by demand-paged virtual 
memory. Let's just stop pretending that "memory" is all uniform with 
identical latency and start explicitly treating it as what it is.

Indeed, sometimes I start thinking about what some kind of hyper-Harvard 
architecture would look like. For example, what happens if the machine 
stack is on-chip? (I.e., it exists in its own unique address space, and 
the circuitry for it is entirely on-chip.) What about if there were 
several on-chip memories optimised for different types of operation?

Unfortunately, the answer to all these questions generally ends up being 
"good luck implementing multitasking".

The other model I've looked at is having not a computer with one giant 
memory connected to one giant CPU, but zillions of smallish memories 
connected to zillions of processing elements. The problem with *that* 
model tends to be "how do I get my data to where I need it?"

As someone else once said, "a supercomputer is a device for turning a 
compute-bound problem into an I/O-bound problem".

> What I understand so far is: One process has the bus, and either reads
> or writes from/to the bus. The other processor watches the activity and,
> if it sees an address it has modified it tells the other processor,
> which passes bus control to the other, puts the data out on the bus,
> then returns control to the first processor.

http://en.wikipedia.org/wiki/Cache_coherence
http://en.wikipedia.org/wiki/MESI_protocol

Looks like what you're describing is a "snooping" implementation. There 
are also other ways to implement this.

> Apparently, the bus can also be pipelined. I'm not exactly sure how this
> works, but the processors then have to agree on whether the operation
> actually can be put in a pipeline.

Again, when I learned this stuff, RAM was the fastest thing in the 
system. You just send the address you want on the address bus and read 
back (or write) the data on the data bus.

These days, the CPU is the fastest thing. You can "pipeline" read 
requests by requesting a contiguous block of data. That way, you 
eliminate some of the latency of sending a memory request for each 
datum. (And, let's face it, you only talk to main memory to fill or 
empty cache lines, which are contiguous anyway.)

I understand this also has something to do with the internal way that 
the RAM chips do two-dimensional addressing...
Post a reply to this message