|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
So, I've been sort of reading this:
http://download.intel.com/design/intarch/manuals/27320401.pdf
I've had a pretty good idea of how the 8086 and 8088 deal with the
system bus. But, I wanted to understand more about the later generation
processors, so I started with the Pentium.
I'm on the section dealing with bus arbitration and cache coherency when
there are 2 processors in the system. (This is the embedded version, I'm
not sure if the full version is all that different)
It occurs to me that handling the cache when there are 2 parts vying for
the same resource can get rather messy.
What I understand so far is: One process has the bus, and either reads
or writes from/to the bus. The other processor watches the activity and,
if it sees an address it has modified it tells the other processor,
which passes bus control to the other, puts the data out on the bus,
then returns control to the first processor.
Apparently, the bus can also be pipelined. I'm not exactly sure how this
works, but the processors then have to agree on whether the operation
actually can be put in a pipeline.
I can definitely see some potential for bottle-necks in a
multi-processor system when dealing with the bus, since electrically,
only one device can place data on the bus at one time. The nice thing
is, in system design, you can design your bus the same for single or
dual processors. Provided you've wired the proper signals together, and
initialized the processors properly with software and certain pin
levels, it's totally transparent to the rest of the system.
--
~Mike
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 02/06/2011 03:06 PM, Mike Raiford wrote:
> So, I've been sort of reading this:
>
> http://download.intel.com/design/intarch/manuals/27320401.pdf
Woooo boy, that's one big can of complexity, right there! ;-)
> I've had a pretty good idea of how the 8086 and 8088 deal with the
> system bus. But, I wanted to understand more about the later generation
> processors, so I started with the Pentium.
I'm going to sound really old now, but... I remember back in the days
when the RAM was faster than the CPU. I remember in-order execution
without pipelines. When I first sat down to read about how modern IA32
processors actually work... I was pretty shocked. The circuitry seems to
spend more time deciding how to execute the code than actually, you
know, *executing* it! o_O
> I'm on the section dealing with bus arbitration and cache coherency when
> there are 2 processors in the system.
> It occurs to me that handling the cache when there are 2 parts vying for
> the same resource can get rather messy.
> I can definitely see some potential for bottle-necks in a
> multi-processor system when dealing with the bus, since electrically,
> only one device can place data on the bus at one time. The nice thing
> is, in system design, you can design your bus the same for single or
> dual processors. Provided you've wired the proper signals together, and
> initialized the processors properly with software and certain pin
> levels, it's totally transparent to the rest of the system.
There are two main problems with trying to connect two CPUs to one RAM
block:
1. Bandwidth.
2. Cache coherence.
Currently, RAM is way, way slower than the CPU. Adding more CPU cores
simply makes things worse. Unless you're doing a specific type of
compute-heavy process that doesn't require much RAM access, all you're
doing is taking the system's main bottleneck and making it twice as bad.
Instead of having /one/ super-fast CPU sitting around waiting for RAM to
respond, now you have /two/ super-fast CPUs fighting over who gets to
access RAM next.
(I've seen plenty of documentation from the Haskell developers where
they benchmark something, and then find that it goes faster if you turn
*off* parallelism, because otherwise the multiple cores fight for
bandwidth or constantly invalidate each other's caches.)
It seems inevitable that the more memory you have, the slower it is.
Even if we completely ignore issues of cost, bigger memory = more
address decode logic + longer traces. Longer traces mean more power,
more latency, more capacitance, more radio interference. It seems
inevitable that if you have a really huge memory, it will necessarily be
very slow.
Similarly, if you have huge numbers of processing elements all trying to
access the same chunk of RAM through the same bus, you're splitting the
available bandwidth many, many ways. The result isn't going to be high
performance.
My personal belief is that the future is NUMA. We already have cache
hierarchies three or four levels deep, backed by demand-paged virtual
memory. Let's just stop pretending that "memory" is all uniform with
identical latency and start explicitly treating it as what it is.
Indeed, sometimes I start thinking about what some kind of hyper-Harvard
architecture would look like. For example, what happens if the machine
stack is on-chip? (I.e., it exists in its own unique address space, and
the circuitry for it is entirely on-chip.) What about if there were
several on-chip memories optimised for different types of operation?
Unfortunately, the answer to all these questions generally ends up being
"good luck implementing multitasking".
The other model I've looked at is having not a computer with one giant
memory connected to one giant CPU, but zillions of smallish memories
connected to zillions of processing elements. The problem with *that*
model tends to be "how do I get my data to where I need it?"
As someone else once said, "a supercomputer is a device for turning a
compute-bound problem into an I/O-bound problem".
> What I understand so far is: One process has the bus, and either reads
> or writes from/to the bus. The other processor watches the activity and,
> if it sees an address it has modified it tells the other processor,
> which passes bus control to the other, puts the data out on the bus,
> then returns control to the first processor.
http://en.wikipedia.org/wiki/Cache_coherence
http://en.wikipedia.org/wiki/MESI_protocol
Looks like what you're describing is a "snooping" implementation. There
are also other ways to implement this.
> Apparently, the bus can also be pipelined. I'm not exactly sure how this
> works, but the processors then have to agree on whether the operation
> actually can be put in a pipeline.
Again, when I learned this stuff, RAM was the fastest thing in the
system. You just send the address you want on the address bus and read
back (or write) the data on the data bus.
These days, the CPU is the fastest thing. You can "pipeline" read
requests by requesting a contiguous block of data. That way, you
eliminate some of the latency of sending a memory request for each
datum. (And, let's face it, you only talk to main memory to fill or
empty cache lines, which are contiguous anyway.)
I understand this also has something to do with the internal way that
the RAM chips do two-dimensional addressing...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
OK... something I don't get. Presumably, it has some weird thing to do
with alignment.
A bit of background first: The Pentium has a 64-bit data bus. What this
means is that the lower two bits of the address bus have been dropped.
Meaning, the processor really only accesses memory on 8 byte boundaries.
But, it has a way of getting around that, it has 8 pins that act as a
mask. If it only needs the first four bytes, then it can pull low only
the first four of the BE# pins. This tells the system's logic what it
wants. This makes accessing smaller chunks of memory much more efficient
if, say, the RAM is composed of modules that have an 8 bit data bus. Oh,
by the way... this is why it was critically important to make sure your
memory is installed in pairs. ;) You see, back in the day DRAM was sold
in 32 bit modules. With only one module installed, the data bus was only
32 bits wide (*IF* the motherboard actually supported that
configuration) but, with both....
But here's the part that isn't making sense to me:
>
+---------------------------------------------------------------------------------------------------------------+
> |Length of Transfer |1 Byte |2 Bytes
|
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Low Order Address |xxx |000 |001 |010 |011 |100
|101 |110 |111 |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |1st transfer |b |w |w |w |hb |w
|w |w |hb |
> |Value driven on A3 | |0 |0 |0 |0 |0
|0 |0 |1 |
>
+------------------------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
> |2nd transfer (if needed) | | | | |lb |
| |lb | |
> |Byte enables driven | | | | |BE3# |
| |BE7# | |
> |Value driven on A3 | | | | |0 |
| |0 | |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Length of Transfer |4 Bytes
|
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Low Order Address | 000 |001 |010 |011 |100
|101 |110 |111 |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |1st transfer | d |hb |hw |h3 |d
|hb | hw |h3 |
> |Low order address | 0 |0 |0 |0 |0 |1
|1 |1 |
>
+------------------------------+---------+---------+---------+----------+---------+---------+----------+--------+
> |2nd transfer (if needed) | |l3 |lw |lb |
|l3 |lw |lb |
> |Value driven on A3 | |0 |0 |0 | |0
|0 |0 |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Length of Transfer |8 Bytes
|
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |Low Order Address | 000 |001 |010 |011 |100
|101 |110 |111 |
>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> |1st transfer | q |hb |hw |h3 |hd
|h5 |h6 |h7 |
> |Value driven on A3 | 0 |1 |1 |1 |1 |1
|1 |1 |
>
+------------------------------+---------+---------+---------+----------+---------+---------+----------+--------+
> |2nd transfer (if needed) | |l7 |l6 |l5 |ld
|l3 |lw |lb |
> |Value driven on A3 | |0 |0 |0 |0 |0
|0 |0 |
>
+---------------------------------------------------------------------------------------------------------------+
>
> Key:
>
> b = byte transfer w = 2-byte transfer 3 = 3-byte transfer d = 4-byte
transfer
> 5 = 5-byte transfer 6 = 6-byte transfer 7 = 7-byte transfer q = 8-byte
transfer
> h = high order ll = low order
>
> 8-byte operand:
>
+----------------------------------------------------------------------------------------+
> | high order byte | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | low order
byte |
>
+----------------------------------------------------------------------------------------+
OK, so... in this table you can see when the operand is a word it
requires a second transfer if the low order bits of the address are 3 or 7.
7 is understandable. after all, you're crossing a boundary, there. But,
3 is utterly baffling.
from this chart, when dealing with word-length operands I can surmise:
Address = # transfers for word
0x12345678 = 1 transfer
0x12345679 = 1 transfer
0x1234567A = 1 transfer
0x1234567B = 2 transfers
0x1234567C = 1 transfer
0x1234567D = 1 transfer
0x1234567E = 1 transfer
0x1234567F = 2 transfer
To me it makes no sense. xxxxB should be able to transfer in one go. Why
wouldn't it?
--
~Mike
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 03/06/2011 02:48 PM, Invisible wrote:
> Woooo boy, that's one big can of complexity, right there! ;-)
I've said it before, and I'll say it again: The IA32 platform is one
huge stack of backwards compatibility kludges.
The story begins (arguably) with the Intel 8008, released in 1972 or so.
(!!) It consisted of 3,500 transistors, and was manufactured on a 10 μm
process PMOS. It ran at 500 kHz. (That's 0.5 MHz or 0.00005 GHz.) It had
a grand total of 18 pins (despite the 14-bit address space). It featured
two main registers, A and B, both 8 bits wide.
Then came the 8080 (around 1974), with a 2 MHz maximum clock speed. This
had 7 registers, named A, B, C, D, H, and L, all 8 bits wide. Certain
pairs (BC, DE, HL) could be used together for certain instructions to
perform 16 bit operations.
(The world-famous Z80 processor is an enhanced version of the 8080.)
Now, finally, in 1978 (!) we arrive at the 8086. The A, B, C and D
registers are renamed AL, BL, CL and DL, and new registers called AH,
BH, CH and DH were added. These are all 8-bit, and can be combined
together creating AX, BX, CX and DX registers 16 bits wide.
Also new was the infamous memory segmentation model. Under this bizarre
scheme, there are four "segment pointer" registers which select which
"segment" you access data from. But because the segment offsets are
16-bit, the segments actually overlap, so there are multiple ways to
refer to the same physical address.
Basically, this is a huge kludge. Rather than implementing real 32-bit
addressing, they kludged in 20-bit addressing. While not /completely/
without merit (e.g., this whole segment melarchy makes relocatable code
quite a bit easier), it's really a bad solution.
Not content with that, Intel developed the 8087, the FPU to go with the
8086 CPU. Unlike any sane design, this FPU has 8 registers, but you
cannot access them directly. Instead, they function as a "stack". Math
operations "pop" their operands from the top and "push" the result back
on. If you want to access something lower down, you have to FXCH
instructions to swap the top register's contents with one of the
registers lower down.
In later generations of chip, the registers are mapped in hardware with
pointers, and two parallel instruction pipelines allow you to optimise
FXCH down to zero clock cycles (effectively). But still, WTF?
In 1982 (this is the first year in the list so far when I was actually
*live*!) the 80286 (or "286") appeared. This was the first CPU with
memory protection.
In 1985, the 80386 ("386") came along. This was the first 32-bit
processor. (Which is why IA32 is sometimes referred to as "i386", and
why Linux generally refuses to work with anything older.) This was the
first processor where the relationship between segment numbers and
physical memory addresses is programmable rather than hard-wired. In
other words, this is where memory pages got invented.
The 386 inherits all of the registers from the 286 (i.e., AL, AH, AX,
BL, BH, BX, etc.) But AX is a 16-bit register. So the 386 adds a new
register, EAX, which is 32 bits. AX is the bottom 16 bits of EAX.
Similarly for B, C and D.
(By contrast, a *real* 32-bit chip like the Motorola 68000 has registers
A0 through A7 and D0 through D7, and when you do an operation, you
specify how many bits to use, e.g., mov.l d3 d7. None of this stupidity
with multiple names for the same register.)
When AMD64 eventually came along, these became 64-bit registers RAX,
RBX, etc., of which EAX, EBX, etc. are the lower 32-bits. (AMD64 also
adds completely new 64-bit registers R8 through R15, just for good
measure. You would expect this to result in an utterly massive speed
increase, but apparently people have measured it at less than 2%.)
If I say 80486, you probably think of some ancient old thing. But it was
the first chip in the family to include an L1 cache, and the first one
to support superscalar execution (i.e., executing more than one
instruction per clock cycle). In the form of the 486DX, it was also the
first one with an on-chip FPU.
In 1996, the Pentium MMX arrived. MMX stands for "multimedia
extensions". (Remember, in the 1990s, "multimedia" was the wave of the
future that was going to take over the world...) What this actually
*does* is it adds SIMD (single-instruction, multiple-data) instructions.
Basically, there's 8 new registers, MMX0 through MMX7, each 64 bits
wide. Using the new MMX instructions, you can basically treat a given
MMX register as an array of values (e.g., 4 items of 16 bits each) and
do element-wise operations over them.
Nothing wrong with that.
Oh yeah, and the MMX registers are the FPU registers.
KLUDGE! >_<
Yes, rather than add 8 *new* registers, MMX just adds new names for the
existing FPU registers. (But now you have proper random access to them.)
The reason for this is simple: it means that the OS doesn't have to
support MMX for context switches to work properly. (I.e., a context
switch under an OS that doesn't know about MMX won't clobber the MMX
registers, because the MMX registers *are* the FPU registers, which any
FPU-aware OS will already be preserving.)
This horrifying kludge still haunts us to this day. Yes, it meant that
developers could start using MMX because Microsoft released an updated
version of Windows. On the other hand, it means you can't use MMX (which
is integer-only) and normal FPU operations at the same time, because one
will clobber the other. FAIL!
In 1998, AMD released the 3DNow! technology that almost nobody now
remembers. This basically adds new MMX operations, using the same "MMX
registers that are really the FPU registers" kludge, for the same reason.
Apparently 3DNow! was never that popular, and is being phased out now.
Instead, Intel came up with SSE, which adds new registers named XMM.
(Get it?) Yes, that's right, *finally* they actually added new registers
rather than kludging old ones. These new XMM registers are 128 bits
wide, and can be operated as 4 x single-precision floats.
Then SSE2 came along, and added versions of all the MMX instructions
that work on the XMM registers instead. So now you never need to use the
old MMX instructions ever again! And now the XMM registers can be
treated not just as 2 x single-precision floats, but also as (say) 2 x
double-precision floats, 8 x 16-bit integers, and so on.
Of course, since the OS has to know to include the new XMM registers in
context switches, SSE is disabled by default. The OS has to explicitly
enable it before it will work. A bit like the way the processor starts
up in "real mode" (i.e., 8086 emulation mode), and the OS has to
manually switch it into "protected mode" (i.e., the normal operating
mode that all modern software actually freaking uses) during the boot
sequence.
Then we have AMD64, which runs in 32-bit mode by default until the OS
switches it to 64-bit mode. (The PC I am using right now is *still*
running in 32-bit mode, despite possessing a 64-bit processor.) In that
mode, an extra 8 XMM registers appear (XMM8 to XMM15).
Did you follow all that?
Don't even get me started on all the different memory paging and
segmentation schemes...
In short, they kept kludging more and more stuff in. Having a
stack-based FPU register file is a stupid, stupid idea. But now all our
software depends on this arrangement, so we're stuck with it forever.
Aliasing the MMX registers to the FPU registers was stupid, but
fortunately we don't have to live with that one. Memory segmentation was
stupid, but now we're basically stuck with it. The list goes on...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 6/3/2011 8:48 AM, Invisible wrote:
> On 02/06/2011 03:06 PM, Mike Raiford wrote:
>> So, I've been sort of reading this:
>>
>> http://download.intel.com/design/intarch/manuals/27320401.pdf
>
> Woooo boy, that's one big can of complexity, right there! ;-)
>
>> I've had a pretty good idea of how the 8086 and 8088 deal with the
>> system bus. But, I wanted to understand more about the later generation
>> processors, so I started with the Pentium.
>
> I'm going to sound really old now, but... I remember back in the days
> when the RAM was faster than the CPU. I remember in-order execution
> without pipelines. When I first sat down to read about how modern IA32
> processors actually work... I was pretty shocked. The circuitry seems to
> spend more time deciding how to execute the code than actually, you
> know, *executing* it! o_O
>
Me, too. I remember (I think it was back in the 386 days) when the
processors started to outpace some memory, you had to start adding wait
states in order to keep the processor from spamming the memory with too
many requests. Now the core operates many times the speed of the bus.
Everything with a modern multi-core CPU is IO bound.
>> I'm on the section dealing with bus arbitration and cache coherency when
>> there are 2 processors in the system.
>
>> It occurs to me that handling the cache when there are 2 parts vying for
>> the same resource can get rather messy.
>
>> I can definitely see some potential for bottle-necks in a
>> multi-processor system when dealing with the bus, since electrically,
>> only one device can place data on the bus at one time. The nice thing
>> is, in system design, you can design your bus the same for single or
>> dual processors. Provided you've wired the proper signals together, and
>> initialized the processors properly with software and certain pin
>> levels, it's totally transparent to the rest of the system.
>
> There are two main problems with trying to connect two CPUs to one RAM
> block:
>
> 1. Bandwidth.
>
> 2. Cache coherence.
>
> Currently, RAM is way, way slower than the CPU. Adding more CPU cores
> simply makes things worse. Unless you're doing a specific type of
> compute-heavy process that doesn't require much RAM access, all you're
> doing is taking the system's main bottleneck and making it twice as bad.
> Instead of having /one/ super-fast CPU sitting around waiting for RAM to
> respond, now you have /two/ super-fast CPUs fighting over who gets to
> access RAM next.
>
> (I've seen plenty of documentation from the Haskell developers where
> they benchmark something, and then find that it goes faster if you turn
> *off* parallelism, because otherwise the multiple cores fight for
> bandwidth or constantly invalidate each other's caches.)
The solution, of course would be to break the task down into small
blocks that can fit within the CPUs' caches (good locality) and work on
units that are in areas of memory where cache lines will not overlap.
Which is all fine and good, but ... not really feasable because you are
not in charge of where your task's pages physically reside in memory.
The OS is in charge of that, and it could change the physical layout at
any moment.
> It seems inevitable that the more memory you have, the slower it is.
> Even if we completely ignore issues of cost, bigger memory = more
> address decode logic + longer traces. Longer traces mean more power,
> more latency, more capacitance, more radio interference. It seems
> inevitable that if you have a really huge memory, it will necessarily be
> very slow.
Right, because all of those elements work against you. It still amazes
me that a CPU can operate at the speeds it does and still run robustly
enough to produce good results without completely distorting the signals
it's handling. Each FET has a capacitance at the gate. They've managed
to get the total capacitance (traces, gates, etc) so tiny that it
doesn't kill the signal.
> Similarly, if you have huge numbers of processing elements all trying to
> access the same chunk of RAM through the same bus, you're splitting the
> available bandwidth many, many ways. The result isn't going to be high
> performance.
>
> My personal belief is that the future is NUMA. We already have cache
> hierarchies three or four levels deep, backed by demand-paged virtual
> memory. Let's just stop pretending that "memory" is all uniform with
> identical latency and start explicitly treating it as what it is.
>
> Indeed, sometimes I start thinking about what some kind of hyper-Harvard
> architecture would look like. For example, what happens if the machine
> stack is on-chip? (I.e., it exists in its own unique address space, and
> the circuitry for it is entirely on-chip.) What about if there were
> several on-chip memories optimised for different types of operation?
Don't some RISC processors sort of do this?
> Unfortunately, the answer to all these questions generally ends up being
> "good luck implementing multitasking".
>
> The other model I've looked at is having not a computer with one giant
> memory connected to one giant CPU, but zillions of smallish memories
> connected to zillions of processing elements. The problem with *that*
> model tends to be "how do I get my data to where I need it?"
Right.. There would need to be some sort of controller that could
transfer data from Block A to Block B. Addressing would be quite ...
interesting in this scheme.
> As someone else once said, "a supercomputer is a device for turning a
> compute-bound problem into an I/O-bound problem".
>
Pretty much. Unless you're working on something that can fit entirely
inside the processing units' caches, you'll have exactly that problem.
>> What I understand so far is: One process has the bus, and either reads
>> or writes from/to the bus. The other processor watches the activity and,
>> if it sees an address it has modified it tells the other processor,
>> which passes bus control to the other, puts the data out on the bus,
>> then returns control to the first processor.
>
> http://en.wikipedia.org/wiki/Cache_coherence
> http://en.wikipedia.org/wiki/MESI_protocol
>
Wow. I actually recognized MESI right off: Modified, Exclusive, Shared,
Invalid. This is exactly what the Pentium uses.
> Looks like what you're describing is a "snooping" implementation. There
> are also other ways to implement this.
Yep. exactly, the LRM snoops the MRM's activity. If it sees that the MRM
is writing or reading a line that the LRM has marked as modified or
exclusive, then it signals the MRM that it needs the bus to write out
what it has.
>
> These days, the CPU is the fastest thing. You can "pipeline" read
> requests by requesting a contiguous block of data. That way, you
> eliminate some of the latency of sending a memory request for each
> datum. (And, let's face it, you only talk to main memory to fill or
> empty cache lines, which are contiguous anyway.)
>
OK, So... essentially, this is burst mode.
> I understand this also has something to do with the internal way that
> the RAM chips do two-dimensional addressing...
I can't remember the specifics, at the moment.
--
~Mike
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
>> I'm going to sound really old now, but... I remember back in the days
>> when the RAM was faster than the CPU. I remember in-order execution
>> without pipelines. When I first sat down to read about how modern IA32
>> processors actually work... I was pretty shocked. The circuitry seems to
>> spend more time deciding how to execute the code than actually, you
>> know, *executing* it! o_O
>
> Me, too. I remember (I think it was back in the 386 days) when the
> processors started to outpace some memory, you had to start adding wait
> states in order to keep the processor from spamming the memory with too
> many requests. Now the core operates many times the speed of the bus.
> Everything with a modern multi-core CPU is IO bound.
Well, I suppose if you want to be picking, it's not bound by I/O
operations, it's bound by RAM bandwidth and/or latency.
Current trends seem to be towards increasing RAM bandwidth to greater
and greater amounts, at the expense of also increasing latency. If your
caches and pipelines and prefetch and branch prediction manage to hige
the latency, that's fine. If they don't... SLOOOOOW!
There was a time when the CPU core was only a dozen times faster than
RAM. Those days are gone. Last I heard, if the CPU needs to access main
memory, you're talking about a 400 clock cycle stall.
At this point, increasing clock speed or adding more cores simply
increases the amount of time the CPU spends waiting. Even the faster
memory connections only increase the bandwidth /of the interface/. The
actual RAM cells aren't getting any faster.
>> (I've seen plenty of documentation from the Haskell developers where
>> they benchmark something, and then find that it goes faster if you turn
>> *off* parallelism, because otherwise the multiple cores fight for
>> bandwidth or constantly invalidate each other's caches.)
>
> The solution, of course would be to break the task down into small
> blocks that can fit within the CPUs' caches (good locality) and work on
> units that are in areas of memory where cache lines will not overlap.
Sometimes this is possible. For example, the latest releases of the
Haskell run-time system have an independent heap per CPU core, and each
core can run a garbage collection cycle of its own local heap
independently of the other cores. Since the heaps never overlap, there's
no problem here.
The *problem* of course happens when data migrates from the generation-1
area into the shared generation-2 heap area. Then suddenly you have to
start worrying about cores invalidating each other's caches and so
forth. Again, the GC engine uses a block-based system to try to minimise
this.
Another example: A Haskell thread can generate "sparks" which are tasks
which can usefully be run in the background. But the developers found
that if you let /any/ core do so, it tends to slow down rather than
speed up. Basically, it's best to keep sparks local to the core that
created them, unless there are cores actually sitting idle.
> Which is all fine and good, but ... not really feasable because you are
> not in charge of where your task's pages physically reside in memory.
> The OS is in charge of that, and it could change the physical layout at
> any moment.
Almost every OS known to Man exposes RAM to the application as a linear
range of virtual addresses. No two [writeable] pages in your
application's virtual address space will ever map to the same physical
address.
>> It seems inevitable that the more memory you have, the slower it is.
>> Even if we completely ignore issues of cost, bigger memory = more
>> address decode logic + longer traces. Longer traces mean more power,
>> more latency, more capacitance, more radio interference. It seems
>> inevitable that if you have a really huge memory, it will necessarily be
>> very slow.
>
> Right, because all of those elements work against you. It still amazes
> me that a CPU can operate at the speeds it does and still run robustly
> enough to produce good results without completely distorting the signals
> it's handling. Each FET has a capacitance at the gate. They've managed
> to get the total capacitance (traces, gates, etc) so tiny that it
> doesn't kill the signal.
Over the distances inside a chip, it's not too bad. If the signal has to
leave the chip, suddenly the distances, the capacitance, the
interference, the power requirements increase by an order of magnitude.
That's why you can have a 2.2 GHz L1 cache, but you have a piffling 0.1
GHz RAM bus.
>> Indeed, sometimes I start thinking about what some kind of hyper-Harvard
>> architecture would look like. For example, what happens if the machine
>> stack is on-chip? (I.e., it exists in its own unique address space, and
>> the circuitry for it is entirely on-chip.) What about if there were
>> several on-chip memories optimised for different types of operation?
>
> Don't some RISC processors sort of do this?
Certainly there are chips that have used some of these ideas. DSP chips
often have separate address spaces for code and data, and sometimes
multiple buses too. It tends not to be used for desktop processors though.
(Hell, anything that isn't 8086-compatible tends not to be used for
desktop processors! Which is a shame, because 8086 isn't that good...)
>> The other model I've looked at is having not a computer with one giant
>> memory connected to one giant CPU, but zillions of smallish memories
>> connected to zillions of processing elements. The problem with *that*
>> model tends to be "how do I get my data to where I need it?"
>
> Right.. There would need to be some sort of controller that could
> transfer data from Block A to Block B. Addressing would be quite ...
> interesting in this scheme.
It's quite feasible that the processing elements themselves would
organise forwarding messages to where they need to go. You don't
necessarily need dedicated switching and routing circuitry.
The trouble, of course, is that as the number of processing elements
increases, either the number of buses increases geometrically, or the
communications latency increases geometrically. One or the other.
I gather that several supercomputers make use of AMD Opteron chips.
http://en.wikipedia.org/wiki/Jaguar_%28computer%29
The Opteron is a bit unusual:
http://en.wikipedia.org/wiki/Opteron#Multi-processor_features
http://www.amd.com/us/products/technologies/direct-connect-architecture/Pages/direct-connect-architecture.aspx
Basically, each time you add a new chip, you're adding new buses too,
increasing bandwidth. The Opteron has a limit of 8 CPUs per motherboard;
Wikipedia claims you can buy "expensive routing chips" to extend this.
>>> What I understand so far is: One process has the bus, and either reads
>>> or writes from/to the bus. The other processor watches the activity and,
>>> if it sees an address it has modified it tells the other processor,
>>> which passes bus control to the other, puts the data out on the bus,
>>> then returns control to the first processor.
>>
>> http://en.wikipedia.org/wiki/Cache_coherence
>> http://en.wikipedia.org/wiki/MESI_protocol
>>
>
> Wow. I actually recognized MESI right off: Modified, Exclusive, Shared,
> Invalid. This is exactly what the Pentium uses.
This is exactly what almost all systems of this kind do.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 03.06.2011 16:45, schrieb Invisible:
> In 1982 (this is the first year in the list so far when I was actually
> *live*!) the 80286 (or "286") appeared. This was the first CPU with
> memory protection.
>
> In 1985, the 80386 ("386") came along. This was the first 32-bit
> processor. (Which is why IA32 is sometimes referred to as "i386", and
> why Linux generally refuses to work with anything older.) This was the
> first processor where the relationship between segment numbers and
> physical memory addresses is programmable rather than hard-wired. In
> other words, this is where memory pages got invented.
Not really; IIRC the first x86 processor to introduce programmable
mapping of logical addresses to physical memory was the 80286; it could
only do this on a per-segment basis though.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 6/3/2011 7:45, Invisible wrote:
> (By contrast, a *real* 32-bit chip like the Motorola 68000 has registers A0
> through A7 and D0 through D7, and when you do an operation,
Which is fine if you're not trying to be assembly-language compatible.
> In short, they kept kludging more and more stuff in. Having a stack-based
> FPU register file is a stupid, stupid idea.
Not when your FPU is a separate chip from your CPU.
> But now all our software depends on this arrangement,
Not any longer. For example, I believe gcc has a command-line switch to say
"use x87 instructions" instead of loading floats via the MMX instructions.
> Aliasing the MMX registers to the FPU registers was stupid,
No, it saved chip space.
> The list goes on...
It would be nice if it was practical to throw out all software and start
over every time we had a new idea, wouldn't it? But then, everything would
be as successful as Haskell. ;-)
--
Darren New, San Diego CA, USA (PST)
"Coding without comments is like
driving without turn signals."
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 6/3/2011 7:50, Mike Raiford wrote:
> Don't some RISC processors sort of do this?
Actually, all you need to do is accept that maybe C-like/Algol-like
languages aren't the best approach to programming these types of machines.
FORTH, for example, is very happy doing bunches of work on the local stack
before going explicitly out to external memory, and I've seen designs for
very high-speed FORTH-based chips with 32 or 64 cores per chip that have
relatively little memory contention because of it.
Go to a language model that's either primitive enough (like FORTH) yet close
enough to the chip architecture that you can force the programmer to
structure the algorithm in a way that's efficient, or something high-level
enough (like Hermes or Haskell perhaps) that a Sufficiently Smart Compiler
(TM) can restructure the code appropriately.
> Right.. There would need to be some sort of controller that could transfer
> data from Block A to Block B. Addressing would be quite ... interesting in
> this scheme.
This is a well known problem with lots of good solutions, depending on how
you implement the interconnections. The problem is that lots of algorithms
are inherently sequential.
--
Darren New, San Diego CA, USA (PST)
"Coding without comments is like
driving without turn signals."
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 6/3/2011 6:48, Invisible wrote:
> I'm going to sound really old now, but... I remember back in the days when
> the RAM was faster than the CPU.
The first mainframe I worked on had memory clocked at 8x the CPU speed. 7
DMA channels for each CPU instruction executed. You could be swapping out
two processes while bringing in two processes without noticing the load.
Even the Atari 400 had RAM clocked twice the CPU speed.
--
Darren New, San Diego CA, USA (PST)
"Coding without comments is like
driving without turn signals."
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|