POV-Ray: Newsgroups: povray.off-topic: Complicated: Re: Complicated

POV-Ray : Newsgroups : povray.off-topic : Complicated : Re: Complicated		Server Time 15 Jul 2025 16:41:48 EDT (-0400)
From: Mike Raiford
Date: 3 Jun 2011 10:50:02
Message: <4de8f49a$1@news.povray.org>
On 6/3/2011 8:48 AM, Invisible wrote:
> On 02/06/2011 03:06 PM, Mike Raiford wrote:
>> So, I've been sort of reading this:
>>
>> http://download.intel.com/design/intarch/manuals/27320401.pdf
>
> Woooo boy, that's one big can of complexity, right there! ;-)
>
>> I've had a pretty good idea of how the 8086 and 8088 deal with the
>> system bus. But, I wanted to understand more about the later generation
>> processors, so I started with the Pentium.
>
> I'm going to sound really old now, but... I remember back in the days
> when the RAM was faster than the CPU. I remember in-order execution
> without pipelines. When I first sat down to read about how modern IA32
> processors actually work... I was pretty shocked. The circuitry seems to
> spend more time deciding how to execute the code than actually, you
> know, *executing* it! o_O
>

Me, too. I remember (I think it was back in the 386 days) when the 
processors started to outpace some memory, you had to start adding wait 
states in order to keep the processor from spamming the memory with too 
many requests. Now the core operates many times the speed of the bus. 
Everything with a modern multi-core CPU is IO bound.

>> I'm on the section dealing with bus arbitration and cache coherency when
>> there are 2 processors in the system.
>
>> It occurs to me that handling the cache when there are 2 parts vying for
>> the same resource can get rather messy.
>
>> I can definitely see some potential for bottle-necks in a
>> multi-processor system when dealing with the bus, since electrically,
>> only one device can place data on the bus at one time. The nice thing
>> is, in system design, you can design your bus the same for single or
>> dual processors. Provided you've wired the proper signals together, and
>> initialized the processors properly with software and certain pin
>> levels, it's totally transparent to the rest of the system.
>
> There are two main problems with trying to connect two CPUs to one RAM
> block:
>
> 1. Bandwidth.
>
> 2. Cache coherence.
>
> Currently, RAM is way, way slower than the CPU. Adding more CPU cores
> simply makes things worse. Unless you're doing a specific type of
> compute-heavy process that doesn't require much RAM access, all you're
> doing is taking the system's main bottleneck and making it twice as bad.
> Instead of having /one/ super-fast CPU sitting around waiting for RAM to
> respond, now you have /two/ super-fast CPUs fighting over who gets to
> access RAM next.
>
> (I've seen plenty of documentation from the Haskell developers where
> they benchmark something, and then find that it goes faster if you turn
> *off* parallelism, because otherwise the multiple cores fight for
> bandwidth or constantly invalidate each other's caches.)

The solution, of course would be to break the task down into small 
blocks that can fit within the CPUs' caches (good locality) and work on 
units that are in areas of memory where cache lines will not overlap. 
Which is all fine and good, but ... not really feasable because you are 
not in charge of where your task's pages physically reside in memory. 
The OS is in charge of that, and it could change the physical layout at 
any moment.

> It seems inevitable that the more memory you have, the slower it is.
> Even if we completely ignore issues of cost, bigger memory = more
> address decode logic + longer traces. Longer traces mean more power,
> more latency, more capacitance, more radio interference. It seems
> inevitable that if you have a really huge memory, it will necessarily be
> very slow.

Right, because all of those elements work against you. It still amazes 
me that a CPU can operate at the speeds it does and still run robustly 
enough to produce good results without completely distorting the signals 
it's handling. Each FET has a capacitance at the gate. They've managed 
to get the total capacitance (traces, gates, etc) so tiny that it 
doesn't kill the signal.

> Similarly, if you have huge numbers of processing elements all trying to
> access the same chunk of RAM through the same bus, you're splitting the
> available bandwidth many, many ways. The result isn't going to be high
> performance.
>
> My personal belief is that the future is NUMA. We already have cache
> hierarchies three or four levels deep, backed by demand-paged virtual
> memory. Let's just stop pretending that "memory" is all uniform with
> identical latency and start explicitly treating it as what it is.
>
> Indeed, sometimes I start thinking about what some kind of hyper-Harvard
> architecture would look like. For example, what happens if the machine
> stack is on-chip? (I.e., it exists in its own unique address space, and
> the circuitry for it is entirely on-chip.) What about if there were
> several on-chip memories optimised for different types of operation?

Don't some RISC processors sort of do this?

> Unfortunately, the answer to all these questions generally ends up being
> "good luck implementing multitasking".
>
> The other model I've looked at is having not a computer with one giant
> memory connected to one giant CPU, but zillions of smallish memories
> connected to zillions of processing elements. The problem with *that*
> model tends to be "how do I get my data to where I need it?"

Right.. There would need to be some sort of controller that could 
transfer data from Block A to Block B. Addressing would be quite ... 
interesting in this scheme.

> As someone else once said, "a supercomputer is a device for turning a
> compute-bound problem into an I/O-bound problem".
>

Pretty much. Unless you're working on something that can fit entirely 
inside the processing units' caches, you'll have exactly that problem.

>> What I understand so far is: One process has the bus, and either reads
>> or writes from/to the bus. The other processor watches the activity and,
>> if it sees an address it has modified it tells the other processor,
>> which passes bus control to the other, puts the data out on the bus,
>> then returns control to the first processor.
>
> http://en.wikipedia.org/wiki/Cache_coherence
> http://en.wikipedia.org/wiki/MESI_protocol
>

Wow. I actually recognized MESI right off: Modified, Exclusive, Shared, 
Invalid. This is exactly what the Pentium uses.

> Looks like what you're describing is a "snooping" implementation. There
> are also other ways to implement this.

Yep. exactly, the LRM snoops the MRM's activity. If it sees that the MRM 
is writing or reading a line that the LRM has marked as modified or 
exclusive, then it signals the MRM that it needs the bus to write out 
what it has.

>
> These days, the CPU is the fastest thing. You can "pipeline" read
> requests by requesting a contiguous block of data. That way, you
> eliminate some of the latency of sending a memory request for each
> datum. (And, let's face it, you only talk to main memory to fill or
> empty cache lines, which are contiguous anyway.)
>

OK, So... essentially, this is burst mode.

> I understand this also has something to do with the internal way that
> the RAM chips do two-dimensional addressing...

I can't remember the specifics, at the moment.

-- 
~Mike
Post a reply to this message