POV-Ray: Newsgroups: povray.off-topic: Complicated

POV-Ray : Newsgroups : povray.off-topic : Complicated		Server Time 15 Oct 2025 11:32:05 EDT (-0400)

Goto Latest 10 Messages

Next 10 Messages >>>

From: Mike Raiford
Subject: Complicated
Date: 2 Jun 2011 10:06:05
Message: <4de798cd$1@news.povray.org>

So, I've been sort of reading this:

http://download.intel.com/design/intarch/manuals/27320401.pdf

I've had a pretty good idea of how the 8086 and 8088 deal with the 
system bus. But, I wanted to understand more about the later generation 
processors, so I started with the Pentium.

I'm on the section dealing with bus arbitration and cache coherency when 
there are 2 processors in the system. (This is the embedded version, I'm 
not sure if the full version is all that different)

It occurs to me that handling the cache when there are 2 parts vying for 
the same resource can get rather messy.

What I understand so far is: One process has the bus, and either reads 
or writes from/to the bus. The other processor watches the activity and, 
if it sees an address it has modified it tells the other processor, 
which passes bus control to the other, puts the data out on the bus, 
then returns control to the first processor.

Apparently, the bus can also be pipelined. I'm not exactly sure how this 
works, but the processors then have to agree on whether the operation 
actually can be put in a pipeline.

I can definitely see some potential for bottle-necks in a 
multi-processor system when dealing with the bus, since electrically, 
only one device can place data on the bus at one time. The nice thing 
is, in system design, you can design your bus the same for single or 
dual processors. Provided you've wired the proper signals together, and 
initialized the processors properly with software and certain pin 
levels, it's totally transparent to the rest of the system.

-- 
~Mike

Post a reply to this message

From: Invisible
Subject: Re: Complicated
Date: 3 Jun 2011 09:48:47
Message: <4de8e63f$1@news.povray.org>

On 02/06/2011 03:06 PM, Mike Raiford wrote:
> So, I've been sort of reading this:
>
> http://download.intel.com/design/intarch/manuals/27320401.pdf

Woooo boy, that's one big can of complexity, right there! ;-)

> I've had a pretty good idea of how the 8086 and 8088 deal with the
> system bus. But, I wanted to understand more about the later generation
> processors, so I started with the Pentium.

I'm going to sound really old now, but... I remember back in the days 
when the RAM was faster than the CPU. I remember in-order execution 
without pipelines. When I first sat down to read about how modern IA32 
processors actually work... I was pretty shocked. The circuitry seems to 
spend more time deciding how to execute the code than actually, you 
know, *executing* it! o_O

> I'm on the section dealing with bus arbitration and cache coherency when
> there are 2 processors in the system.

> It occurs to me that handling the cache when there are 2 parts vying for
> the same resource can get rather messy.

 > I can definitely see some potential for bottle-necks in a
 > multi-processor system when dealing with the bus, since electrically,
 > only one device can place data on the bus at one time. The nice thing
 > is, in system design, you can design your bus the same for single or
 > dual processors. Provided you've wired the proper signals together, and
 > initialized the processors properly with software and certain pin
 > levels, it's totally transparent to the rest of the system.

There are two main problems with trying to connect two CPUs to one RAM 
block:

1. Bandwidth.

2. Cache coherence.

Currently, RAM is way, way slower than the CPU. Adding more CPU cores 
simply makes things worse. Unless you're doing a specific type of 
compute-heavy process that doesn't require much RAM access, all you're 
doing is taking the system's main bottleneck and making it twice as bad. 
Instead of having /one/ super-fast CPU sitting around waiting for RAM to 
respond, now you have /two/ super-fast CPUs fighting over who gets to 
access RAM next.

(I've seen plenty of documentation from the Haskell developers where 
they benchmark something, and then find that it goes faster if you turn 
*off* parallelism, because otherwise the multiple cores fight for 
bandwidth or constantly invalidate each other's caches.)

It seems inevitable that the more memory you have, the slower it is. 
Even if we completely ignore issues of cost, bigger memory = more 
address decode logic + longer traces. Longer traces mean more power, 
more latency, more capacitance, more radio interference. It seems 
inevitable that if you have a really huge memory, it will necessarily be 
very slow.

Similarly, if you have huge numbers of processing elements all trying to 
access the same chunk of RAM through the same bus, you're splitting the 
available bandwidth many, many ways. The result isn't going to be high 
performance.

My personal belief is that the future is NUMA. We already have cache 
hierarchies three or four levels deep, backed by demand-paged virtual 
memory. Let's just stop pretending that "memory" is all uniform with 
identical latency and start explicitly treating it as what it is.

Indeed, sometimes I start thinking about what some kind of hyper-Harvard 
architecture would look like. For example, what happens if the machine 
stack is on-chip? (I.e., it exists in its own unique address space, and 
the circuitry for it is entirely on-chip.) What about if there were 
several on-chip memories optimised for different types of operation?

Unfortunately, the answer to all these questions generally ends up being 
"good luck implementing multitasking".

The other model I've looked at is having not a computer with one giant 
memory connected to one giant CPU, but zillions of smallish memories 
connected to zillions of processing elements. The problem with *that* 
model tends to be "how do I get my data to where I need it?"

As someone else once said, "a supercomputer is a device for turning a 
compute-bound problem into an I/O-bound problem".

> What I understand so far is: One process has the bus, and either reads
> or writes from/to the bus. The other processor watches the activity and,
> if it sees an address it has modified it tells the other processor,
> which passes bus control to the other, puts the data out on the bus,
> then returns control to the first processor.

http://en.wikipedia.org/wiki/Cache_coherence
http://en.wikipedia.org/wiki/MESI_protocol

Looks like what you're describing is a "snooping" implementation. There 
are also other ways to implement this.

> Apparently, the bus can also be pipelined. I'm not exactly sure how this
> works, but the processors then have to agree on whether the operation
> actually can be put in a pipeline.

Again, when I learned this stuff, RAM was the fastest thing in the 
system. You just send the address you want on the address bus and read 
back (or write) the data on the data bus.

These days, the CPU is the fastest thing. You can "pipeline" read 
requests by requesting a contiguous block of data. That way, you 
eliminate some of the latency of sending a memory request for each 
datum. (And, let's face it, you only talk to main memory to fill or 
empty cache lines, which are contiguous anyway.)

I understand this also has something to do with the internal way that 
the RAM chips do two-dimensional addressing...

Post a reply to this message

From: Mike Raiford
Subject: Re: Complicated
Date: 3 Jun 2011 10:18:35
Message: <4de8ed3b$1@news.povray.org>

OK... something I don't get. Presumably, it has some weird thing to do 
with alignment.

A bit of background first: The Pentium has a 64-bit data bus. What this 
means is that the lower two bits of the address bus have been dropped. 
Meaning, the processor really only accesses memory on 8 byte boundaries.

But, it has a way of getting around that, it has 8 pins that act as a 
mask. If it only needs the first four bytes, then it can pull low only 
the first four of the BE# pins. This tells the system's logic what it 
wants. This makes accessing smaller chunks of memory much more efficient 
if, say, the RAM is composed of modules that have an 8 bit data bus. Oh, 
by the way... this is why it was critically important to make sure your 
memory is installed in pairs. ;) You see, back in the day DRAM was sold 
in 32 bit modules. With only one module installed, the data bus was only 
32 bits wide (*IF* the motherboard actually supported that 
configuration) but, with both....

But here's the part that isn't making sense to me:

>
+---------------------------------------------------------------------------------------------------------------+
> |Length of Transfer            |1 Byte  |2 Bytes                                    
                           |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Low Order Address             |xxx     |000     |001     |010     |011     |100    
|101     |110     |111     |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |1st transfer                  |b       |w       |w       |w       |hb      |w      
|w       |w       |hb      |

> |Value driven on A3            |        |0       |0       |0       |0       |0      
|0       |0       |1       |
>
+------------------------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
> |2nd transfer (if needed)      |        |        |        |        |lb      |       
|        |lb      |        |
> |Byte enables driven           |        |        |        |        |BE3#    |       
|        |BE7#    |        |
> |Value driven on  A3           |        |        |        |        |0       |       
|        |0       |        |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Length of Transfer            |4 Bytes                                             
                           |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Low Order Address             | 000     |001      |010      |011       |100     
|101      |110       |111     |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |1st transfer                  | d       |hb       |hw       |h3        |d       
|hb       | hw       |h3      |

> |Low order address             | 0       |0        |0        |0         |0        |1
       |1         |1       |
>
+------------------------------+---------+---------+---------+----------+---------+---------+----------+--------+
> |2nd transfer (if needed)      |         |l3       |lw       |lb        |        
|l3       |lw        |lb      |

> |Value driven on A3            |         |0        |0        |0         |         |0
       |0         |0       |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Length of Transfer            |8 Bytes                                             
                           |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Low Order Address             | 000     |001      |010      |011       |100     
|101      |110       |111     |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |1st transfer                  | q       |hb       |hw       |h3        |hd      
|h5       |h6        |h7      |

> |Value driven on A3            | 0       |1        |1        |1         |1        |1
       |1         |1       |
>
+------------------------------+---------+---------+---------+----------+---------+---------+----------+--------+
> |2nd transfer (if needed)      |         |l7       |l6       |l5        |ld      
|l3       |lw        |lb      |

> |Value driven on A3            |         |0        |0        |0         |0        |0
       |0         |0       |
>
+---------------------------------------------------------------------------------------------------------------+
>
> Key:
>
> b = byte transfer       w = 2-byte transfer     3 = 3-byte transfer     d = 4-byte
transfer
> 5 = 5-byte transfer     6 = 6-byte transfer     7 = 7-byte transfer     q = 8-byte
transfer
> h = high order         ll = low order
>
> 8-byte operand:
>
+----------------------------------------------------------------------------------------+
> | high order byte | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | low order
byte |
>
+----------------------------------------------------------------------------------------+

OK, so... in this table you can see when the operand is a word it 
requires a second transfer if the low order bits of the address are 3 or 7.

7 is understandable. after all, you're crossing a boundary, there. But, 
3 is utterly baffling.

from this chart, when dealing with word-length operands I can surmise:

Address = # transfers for word
0x12345678 = 1 transfer
0x12345679 = 1 transfer
0x1234567A = 1 transfer
0x1234567B = 2 transfers
0x1234567C = 1 transfer
0x1234567D = 1 transfer
0x1234567E = 1 transfer
0x1234567F = 2 transfer

To me it makes no sense. xxxxB should be able to transfer in one go. Why 
wouldn't it?
-- 
~Mike

Post a reply to this message

From: Invisible
Subject: Re: Complicated
Date: 3 Jun 2011 10:45:54
Message: <4de8f3a2$1@news.povray.org>

On 03/06/2011 02:48 PM, Invisible wrote:

> Woooo boy, that's one big can of complexity, right there! ;-)

I've said it before, and I'll say it again: The IA32 platform is one 
huge stack of backwards compatibility kludges.

The story begins (arguably) with the Intel 8008, released in 1972 or so. 
(!!) It consisted of 3,500 transistors, and was manufactured on a 10 μm 
process PMOS. It ran at 500 kHz. (That's 0.5 MHz or 0.00005 GHz.) It had 
a grand total of 18 pins (despite the 14-bit address space). It featured 
two main registers, A and B, both 8 bits wide.

Then came the 8080 (around 1974), with a 2 MHz maximum clock speed. This 
had 7 registers, named A, B, C, D, H, and L, all 8 bits wide. Certain 
pairs (BC, DE, HL) could be used together for certain instructions to 
perform 16 bit operations.

(The world-famous Z80 processor is an enhanced version of the 8080.)

Now, finally, in 1978 (!) we arrive at the 8086. The A, B, C and D 
registers are renamed AL, BL, CL and DL, and new registers called AH, 
BH, CH and DH were added. These are all 8-bit, and can be combined 
together creating AX, BX, CX and DX registers 16 bits wide.

Also new was the infamous memory segmentation model. Under this bizarre 
scheme, there are four "segment pointer" registers which select which 
"segment" you access data from. But because the segment offsets are 
16-bit, the segments actually overlap, so there are multiple ways to 
refer to the same physical address.

Basically, this is a huge kludge. Rather than implementing real 32-bit 
addressing, they kludged in 20-bit addressing. While not /completely/ 
without merit (e.g., this whole segment melarchy makes relocatable code 
quite a bit easier), it's really a bad solution.

Not content with that, Intel developed the 8087, the FPU to go with the 
8086 CPU. Unlike any sane design, this FPU has 8 registers, but you 
cannot access them directly. Instead, they function as a "stack". Math 
operations "pop" their operands from the top and "push" the result back 
on. If you want to access something lower down, you have to FXCH 
instructions to swap the top register's contents with one of the 
registers lower down.

In later generations of chip, the registers are mapped in hardware with 
pointers, and two parallel instruction pipelines allow you to optimise 
FXCH down to zero clock cycles (effectively). But still, WTF?

In 1982 (this is the first year in the list so far when I was actually 
*live*!) the 80286 (or "286") appeared. This was the first CPU with 
memory protection.

In 1985, the 80386 ("386") came along. This was the first 32-bit 
processor. (Which is why IA32 is sometimes referred to as "i386", and 
why Linux generally refuses to work with anything older.) This was the 
first processor where the relationship between segment numbers and 
physical memory addresses is programmable rather than hard-wired. In 
other words, this is where memory pages got invented.

The 386 inherits all of the registers from the 286 (i.e., AL, AH, AX, 
BL, BH, BX, etc.) But AX is a 16-bit register. So the 386 adds a new 
register, EAX, which is 32 bits. AX is the bottom 16 bits of EAX. 
Similarly for B, C and D.

(By contrast, a *real* 32-bit chip like the Motorola 68000 has registers 
A0 through A7 and D0 through D7, and when you do an operation, you 
specify how many bits to use, e.g., mov.l d3 d7. None of this stupidity 
with multiple names for the same register.)

When AMD64 eventually came along, these became 64-bit registers RAX, 
RBX, etc., of which EAX, EBX, etc. are the lower 32-bits. (AMD64 also 
adds completely new 64-bit registers R8 through R15, just for good 
measure. You would expect this to result in an utterly massive speed 
increase, but apparently people have measured it at less than 2%.)

If I say 80486, you probably think of some ancient old thing. But it was 
the first chip in the family to include an L1 cache, and the first one 
to support superscalar execution (i.e., executing more than one 
instruction per clock cycle). In the form of the 486DX, it was also the 
first one with an on-chip FPU.

In 1996, the Pentium MMX arrived. MMX stands for "multimedia 
extensions". (Remember, in the 1990s, "multimedia" was the wave of the 
future that was going to take over the world...) What this actually 
*does* is it adds SIMD (single-instruction, multiple-data) instructions. 
Basically, there's 8 new registers, MMX0 through MMX7, each 64 bits 
wide. Using the new MMX instructions, you can basically treat a given 
MMX register as an array of values (e.g., 4 items of 16 bits each) and 
do element-wise operations over them.

Nothing wrong with that.

Oh yeah, and the MMX registers are the FPU registers.

KLUDGE! >_<

Yes, rather than add 8 *new* registers, MMX just adds new names for the 
existing FPU registers. (But now you have proper random access to them.) 
The reason for this is simple: it means that the OS doesn't have to 
support MMX for context switches to work properly. (I.e., a context 
switch under an OS that doesn't know about MMX won't clobber the MMX 
registers, because the MMX registers *are* the FPU registers, which any 
FPU-aware OS will already be preserving.)

This horrifying kludge still haunts us to this day. Yes, it meant that 
developers could start using MMX because Microsoft released an updated 
version of Windows. On the other hand, it means you can't use MMX (which 
is integer-only) and normal FPU operations at the same time, because one 
will clobber the other. FAIL!

In 1998, AMD released the 3DNow! technology that almost nobody now 
remembers. This basically adds new MMX operations, using the same "MMX 
registers that are really the FPU registers" kludge, for the same reason.

Apparently 3DNow! was never that popular, and is being phased out now. 
Instead, Intel came up with SSE, which adds new registers named XMM. 
(Get it?) Yes, that's right, *finally* they actually added new registers 
rather than kludging old ones. These new XMM registers are 128 bits 
wide, and can be operated as 4 x single-precision floats.

Then SSE2 came along, and added versions of all the MMX instructions 
that work on the XMM registers instead. So now you never need to use the 
old MMX instructions ever again! And now the XMM registers can be 
treated not just as 2 x single-precision floats, but also as (say) 2 x 
double-precision floats, 8 x 16-bit integers, and so on.

Of course, since the OS has to know to include the new XMM registers in 
context switches, SSE is disabled by default. The OS has to explicitly 
enable it before it will work. A bit like the way the processor starts 
up in "real mode" (i.e., 8086 emulation mode), and the OS has to 
manually switch it into "protected mode" (i.e., the normal operating 
mode that all modern software actually freaking uses) during the boot 
sequence.

Then we have AMD64, which runs in 32-bit mode by default until the OS 
switches it to 64-bit mode. (The PC I am using right now is *still* 
running in 32-bit mode, despite possessing a 64-bit processor.) In that 
mode, an extra 8 XMM registers appear (XMM8 to XMM15).

Did you follow all that?

Don't even get me started on all the different memory paging and 
segmentation schemes...

In short, they kept kludging more and more stuff in. Having a 
stack-based FPU register file is a stupid, stupid idea. But now all our 
software depends on this arrangement, so we're stuck with it forever. 
Aliasing the MMX registers to the FPU registers was stupid, but 
fortunately we don't have to live with that one. Memory segmentation was 
stupid, but now we're basically stuck with it. The list goes on...

Post a reply to this message

From: Mike Raiford
Subject: Re: Complicated
Date: 3 Jun 2011 10:50:02
Message: <4de8f49a$1@news.povray.org>

On 6/3/2011 8:48 AM, Invisible wrote:
> On 02/06/2011 03:06 PM, Mike Raiford wrote:
>> So, I've been sort of reading this:
>>
>> http://download.intel.com/design/intarch/manuals/27320401.pdf
>
> Woooo boy, that's one big can of complexity, right there! ;-)
>
>> I've had a pretty good idea of how the 8086 and 8088 deal with the
>> system bus. But, I wanted to understand more about the later generation
>> processors, so I started with the Pentium.
>
> I'm going to sound really old now, but... I remember back in the days
> when the RAM was faster than the CPU. I remember in-order execution
> without pipelines. When I first sat down to read about how modern IA32
> processors actually work... I was pretty shocked. The circuitry seems to
> spend more time deciding how to execute the code than actually, you
> know, *executing* it! o_O
>

Me, too. I remember (I think it was back in the 386 days) when the 
processors started to outpace some memory, you had to start adding wait 
states in order to keep the processor from spamming the memory with too 
many requests. Now the core operates many times the speed of the bus. 
Everything with a modern multi-core CPU is IO bound.

>> I'm on the section dealing with bus arbitration and cache coherency when
>> there are 2 processors in the system.
>
>> It occurs to me that handling the cache when there are 2 parts vying for
>> the same resource can get rather messy.
>
>> I can definitely see some potential for bottle-necks in a
>> multi-processor system when dealing with the bus, since electrically,
>> only one device can place data on the bus at one time. The nice thing
>> is, in system design, you can design your bus the same for single or
>> dual processors. Provided you've wired the proper signals together, and
>> initialized the processors properly with software and certain pin
>> levels, it's totally transparent to the rest of the system.
>
> There are two main problems with trying to connect two CPUs to one RAM
> block:
>
> 1. Bandwidth.
>
> 2. Cache coherence.
>
> Currently, RAM is way, way slower than the CPU. Adding more CPU cores
> simply makes things worse. Unless you're doing a specific type of
> compute-heavy process that doesn't require much RAM access, all you're
> doing is taking the system's main bottleneck and making it twice as bad.
> Instead of having /one/ super-fast CPU sitting around waiting for RAM to
> respond, now you have /two/ super-fast CPUs fighting over who gets to
> access RAM next.
>
> (I've seen plenty of documentation from the Haskell developers where
> they benchmark something, and then find that it goes faster if you turn
> *off* parallelism, because otherwise the multiple cores fight for
> bandwidth or constantly invalidate each other's caches.)

The solution, of course would be to break the task down into small 
blocks that can fit within the CPUs' caches (good locality) and work on 
units that are in areas of memory where cache lines will not overlap. 
Which is all fine and good, but ... not really feasable because you are 
not in charge of where your task's pages physically reside in memory. 
The OS is in charge of that, and it could change the physical layout at 
any moment.

> It seems inevitable that the more memory you have, the slower it is.
> Even if we completely ignore issues of cost, bigger memory = more
> address decode logic + longer traces. Longer traces mean more power,
> more latency, more capacitance, more radio interference. It seems
> inevitable that if you have a really huge memory, it will necessarily be
> very slow.

Right, because all of those elements work against you. It still amazes 
me that a CPU can operate at the speeds it does and still run robustly 
enough to produce good results without completely distorting the signals 
it's handling. Each FET has a capacitance at the gate. They've managed 
to get the total capacitance (traces, gates, etc) so tiny that it 
doesn't kill the signal.

> Similarly, if you have huge numbers of processing elements all trying to
> access the same chunk of RAM through the same bus, you're splitting the
> available bandwidth many, many ways. The result isn't going to be high
> performance.
>
> My personal belief is that the future is NUMA. We already have cache
> hierarchies three or four levels deep, backed by demand-paged virtual
> memory. Let's just stop pretending that "memory" is all uniform with
> identical latency and start explicitly treating it as what it is.
>
> Indeed, sometimes I start thinking about what some kind of hyper-Harvard
> architecture would look like. For example, what happens if the machine
> stack is on-chip? (I.e., it exists in its own unique address space, and
> the circuitry for it is entirely on-chip.) What about if there were
> several on-chip memories optimised for different types of operation?

Don't some RISC processors sort of do this?

> Unfortunately, the answer to all these questions generally ends up being
> "good luck implementing multitasking".
>
> The other model I've looked at is having not a computer with one giant
> memory connected to one giant CPU, but zillions of smallish memories
> connected to zillions of processing elements. The problem with *that*
> model tends to be "how do I get my data to where I need it?"

Right.. There would need to be some sort of controller that could 
transfer data from Block A to Block B. Addressing would be quite ... 
interesting in this scheme.

> As someone else once said, "a supercomputer is a device for turning a
> compute-bound problem into an I/O-bound problem".
>

Pretty much. Unless you're working on something that can fit entirely 
inside the processing units' caches, you'll have exactly that problem.

>> What I understand so far is: One process has the bus, and either reads
>> or writes from/to the bus. The other processor watches the activity and,
>> if it sees an address it has modified it tells the other processor,
>> which passes bus control to the other, puts the data out on the bus,
>> then returns control to the first processor.
>
> http://en.wikipedia.org/wiki/Cache_coherence
> http://en.wikipedia.org/wiki/MESI_protocol
>

Wow. I actually recognized MESI right off: Modified, Exclusive, Shared, 
Invalid. This is exactly what the Pentium uses.

> Looks like what you're describing is a "snooping" implementation. There
> are also other ways to implement this.

Yep. exactly, the LRM snoops the MRM's activity. If it sees that the MRM 
is writing or reading a line that the LRM has marked as modified or 
exclusive, then it signals the MRM that it needs the bus to write out 
what it has.

>
> These days, the CPU is the fastest thing. You can "pipeline" read
> requests by requesting a contiguous block of data. That way, you
> eliminate some of the latency of sending a memory request for each
> datum. (And, let's face it, you only talk to main memory to fill or
> empty cache lines, which are contiguous anyway.)
>

OK, So... essentially, this is burst mode.

> I understand this also has something to do with the internal way that
> the RAM chips do two-dimensional addressing...

I can't remember the specifics, at the moment.

-- 
~Mike

Post a reply to this message

From: Invisible
Subject: Re: Complicated
Date: 3 Jun 2011 11:19:03
Message: <4de8fb67@news.povray.org>

>> I'm going to sound really old now, but... I remember back in the days
>> when the RAM was faster than the CPU. I remember in-order execution
>> without pipelines. When I first sat down to read about how modern IA32
>> processors actually work... I was pretty shocked. The circuitry seems to
>> spend more time deciding how to execute the code than actually, you
>> know, *executing* it! o_O
>
> Me, too. I remember (I think it was back in the 386 days) when the
> processors started to outpace some memory, you had to start adding wait
> states in order to keep the processor from spamming the memory with too
> many requests. Now the core operates many times the speed of the bus.
> Everything with a modern multi-core CPU is IO bound.

Well, I suppose if you want to be picking, it's not bound by I/O 
operations, it's bound by RAM bandwidth and/or latency.

Current trends seem to be towards increasing RAM bandwidth to greater 
and greater amounts, at the expense of also increasing latency. If your 
caches and pipelines and prefetch and branch prediction manage to hige 
the latency, that's fine. If they don't... SLOOOOOW!

There was a time when the CPU core was only a dozen times faster than 
RAM. Those days are gone. Last I heard, if the CPU needs to access main 
memory, you're talking about a 400 clock cycle stall.

At this point, increasing clock speed or adding more cores simply 
increases the amount of time the CPU spends waiting. Even the faster 
memory connections only increase the bandwidth /of the interface/. The 
actual RAM cells aren't getting any faster.

>> (I've seen plenty of documentation from the Haskell developers where
>> they benchmark something, and then find that it goes faster if you turn
>> *off* parallelism, because otherwise the multiple cores fight for
>> bandwidth or constantly invalidate each other's caches.)
>
> The solution, of course would be to break the task down into small
> blocks that can fit within the CPUs' caches (good locality) and work on
> units that are in areas of memory where cache lines will not overlap.

Sometimes this is possible. For example, the latest releases of the 
Haskell run-time system have an independent heap per CPU core, and each 
core can run a garbage collection cycle of its own local heap 
independently of the other cores. Since the heaps never overlap, there's 
no problem here.

The *problem* of course happens when data migrates from the generation-1 
area into the shared generation-2 heap area. Then suddenly you have to 
start worrying about cores invalidating each other's caches and so 
forth. Again, the GC engine uses a block-based system to try to minimise 
this.

Another example: A Haskell thread can generate "sparks" which are tasks 
which can usefully be run in the background. But the developers found 
that if you let /any/ core do so, it tends to slow down rather than 
speed up. Basically, it's best to keep sparks local to the core that 
created them, unless there are cores actually sitting idle.

> Which is all fine and good, but ... not really feasable because you are
> not in charge of where your task's pages physically reside in memory.
> The OS is in charge of that, and it could change the physical layout at
> any moment.

Almost every OS known to Man exposes RAM to the application as a linear 
range of virtual addresses. No two [writeable] pages in your 
application's virtual address space will ever map to the same physical 
address.

>> It seems inevitable that the more memory you have, the slower it is.
>> Even if we completely ignore issues of cost, bigger memory = more
>> address decode logic + longer traces. Longer traces mean more power,
>> more latency, more capacitance, more radio interference. It seems
>> inevitable that if you have a really huge memory, it will necessarily be
>> very slow.
>
> Right, because all of those elements work against you. It still amazes
> me that a CPU can operate at the speeds it does and still run robustly
> enough to produce good results without completely distorting the signals
> it's handling. Each FET has a capacitance at the gate. They've managed
> to get the total capacitance (traces, gates, etc) so tiny that it
> doesn't kill the signal.

Over the distances inside a chip, it's not too bad. If the signal has to 
leave the chip, suddenly the distances, the capacitance, the 
interference, the power requirements increase by an order of magnitude. 
That's why you can have a 2.2 GHz L1 cache, but you have a piffling 0.1 
GHz RAM bus.

>> Indeed, sometimes I start thinking about what some kind of hyper-Harvard
>> architecture would look like. For example, what happens if the machine
>> stack is on-chip? (I.e., it exists in its own unique address space, and
>> the circuitry for it is entirely on-chip.) What about if there were
>> several on-chip memories optimised for different types of operation?
>
> Don't some RISC processors sort of do this?

Certainly there are chips that have used some of these ideas. DSP chips 
often have separate address spaces for code and data, and sometimes 
multiple buses too. It tends not to be used for desktop processors though.

(Hell, anything that isn't 8086-compatible tends not to be used for 
desktop processors! Which is a shame, because 8086 isn't that good...)

>> The other model I've looked at is having not a computer with one giant
>> memory connected to one giant CPU, but zillions of smallish memories
>> connected to zillions of processing elements. The problem with *that*
>> model tends to be "how do I get my data to where I need it?"
>
> Right.. There would need to be some sort of controller that could
> transfer data from Block A to Block B. Addressing would be quite ...
> interesting in this scheme.

It's quite feasible that the processing elements themselves would 
organise forwarding messages to where they need to go. You don't 
necessarily need dedicated switching and routing circuitry.

The trouble, of course, is that as the number of processing elements 
increases, either the number of buses increases geometrically, or the 
communications latency increases geometrically. One or the other.

I gather that several supercomputers make use of AMD Opteron chips.

http://en.wikipedia.org/wiki/Jaguar_%28computer%29

The Opteron is a bit unusual:

http://en.wikipedia.org/wiki/Opteron#Multi-processor_features

http://www.amd.com/us/products/technologies/direct-connect-architecture/Pages/direct-connect-architecture.aspx

Basically, each time you add a new chip, you're adding new buses too, 
increasing bandwidth. The Opteron has a limit of 8 CPUs per motherboard; 
Wikipedia claims you can buy "expensive routing chips" to extend this.

>>> What I understand so far is: One process has the bus, and either reads
>>> or writes from/to the bus. The other processor watches the activity and,
>>> if it sees an address it has modified it tells the other processor,
>>> which passes bus control to the other, puts the data out on the bus,
>>> then returns control to the first processor.
>>
>> http://en.wikipedia.org/wiki/Cache_coherence
>> http://en.wikipedia.org/wiki/MESI_protocol
>>
>
> Wow. I actually recognized MESI right off: Modified, Exclusive, Shared,
> Invalid. This is exactly what the Pentium uses.

This is exactly what almost all systems of this kind do.

Post a reply to this message

From: clipka
Subject: Re: Complicated
Date: 3 Jun 2011 13:57:50
Message: <4de9209e$1@news.povray.org>

Am 03.06.2011 16:45, schrieb Invisible:

> In 1982 (this is the first year in the list so far when I was actually
> *live*!) the 80286 (or "286") appeared. This was the first CPU with
> memory protection.
>
> In 1985, the 80386 ("386") came along. This was the first 32-bit
> processor. (Which is why IA32 is sometimes referred to as "i386", and
> why Linux generally refuses to work with anything older.) This was the
> first processor where the relationship between segment numbers and
> physical memory addresses is programmable rather than hard-wired. In
> other words, this is where memory pages got invented.

Not really; IIRC the first x86 processor to introduce programmable 
mapping of logical addresses to physical memory was the 80286; it could 
only do this on a per-segment basis though.

Post a reply to this message

From: Darren New
Subject: Re: Complicated
Date: 6 Jun 2011 11:59:06
Message: <4decf94a$1@news.povray.org>

On 6/3/2011 7:45, Invisible wrote:
> (By contrast, a *real* 32-bit chip like the Motorola 68000 has registers A0
> through A7 and D0 through D7, and when you do an operation,

Which is fine if you're not trying to be assembly-language compatible.

> In short, they kept kludging more and more stuff in. Having a stack-based
> FPU register file is a stupid, stupid idea.

Not when your FPU is a separate chip from your CPU.

> But now all our software depends on this arrangement,

Not any longer. For example, I believe gcc has a command-line switch to say 
"use x87 instructions" instead of loading floats via the MMX instructions.

> Aliasing the MMX registers to the FPU registers was stupid,

No, it saved chip space.

> The list goes on...

It would be nice if it was practical to throw out all software and start 
over every time we had a new idea, wouldn't it? But then, everything would 
be as successful as Haskell. ;-)

-- 
Darren New, San Diego CA, USA (PST)
   "Coding without comments is like
    driving without turn signals."

Post a reply to this message

From: Darren New
Subject: Re: Complicated
Date: 6 Jun 2011 12:04:21
Message: <4decfa85$1@news.povray.org>

On 6/3/2011 7:50, Mike Raiford wrote:
> Don't some RISC processors sort of do this?

Actually, all you need to do is accept that maybe C-like/Algol-like 
languages aren't the best approach to programming these types of machines. 
FORTH, for example, is very happy doing bunches of work on the local stack 
before going explicitly out to external memory, and I've seen designs for 
very high-speed FORTH-based chips with 32 or 64 cores per chip that have 
relatively little memory contention because of it.

Go to a language model that's either primitive enough (like FORTH) yet close 
enough to the chip architecture that you can force the programmer to 
structure the algorithm in a way that's efficient, or something high-level 
enough (like Hermes or Haskell perhaps) that a Sufficiently Smart Compiler 
(TM) can restructure the code appropriately.

> Right.. There would need to be some sort of controller that could transfer
> data from Block A to Block B. Addressing would be quite ... interesting in
> this scheme.

This is a well known problem with lots of good solutions, depending on how 
you implement the interconnections. The problem is that lots of algorithms 
are inherently sequential.

-- 
Darren New, San Diego CA, USA (PST)
   "Coding without comments is like
    driving without turn signals."

Post a reply to this message

From: Darren New
Subject: Re: Complicated
Date: 6 Jun 2011 12:56:11
Message: <4ded06ab@news.povray.org>

On 6/3/2011 6:48, Invisible wrote:
> I'm going to sound really old now, but... I remember back in the days when
> the RAM was faster than the CPU.

The first mainframe I worked on had memory clocked at 8x the CPU speed. 7 
DMA channels for each CPU instruction executed. You could be swapping out 
two processes while bringing in two processes without noticing the load.

Even the Atari 400 had RAM clocked twice the CPU speed.

-- 
Darren New, San Diego CA, USA (PST)
   "Coding without comments is like
    driving without turn signals."

Post a reply to this message

Goto Latest 10 Messages

Next 10 Messages >>>