POV-Ray: Newsgroups: povray.off-topic: Complicated: Re: Complicated

POV-Ray : Newsgroups : povray.off-topic : Complicated : Re: Complicated		Server Time 12 Jul 2025 06:26:42 EDT (-0400)
From: Invisible
Date: 3 Jun 2011 11:19:03
Message: <4de8fb67@news.povray.org>
>> I'm going to sound really old now, but... I remember back in the days
>> when the RAM was faster than the CPU. I remember in-order execution
>> without pipelines. When I first sat down to read about how modern IA32
>> processors actually work... I was pretty shocked. The circuitry seems to
>> spend more time deciding how to execute the code than actually, you
>> know, *executing* it! o_O
>
> Me, too. I remember (I think it was back in the 386 days) when the
> processors started to outpace some memory, you had to start adding wait
> states in order to keep the processor from spamming the memory with too
> many requests. Now the core operates many times the speed of the bus.
> Everything with a modern multi-core CPU is IO bound.

Well, I suppose if you want to be picking, it's not bound by I/O 
operations, it's bound by RAM bandwidth and/or latency.

Current trends seem to be towards increasing RAM bandwidth to greater 
and greater amounts, at the expense of also increasing latency. If your 
caches and pipelines and prefetch and branch prediction manage to hige 
the latency, that's fine. If they don't... SLOOOOOW!

There was a time when the CPU core was only a dozen times faster than 
RAM. Those days are gone. Last I heard, if the CPU needs to access main 
memory, you're talking about a 400 clock cycle stall.

At this point, increasing clock speed or adding more cores simply 
increases the amount of time the CPU spends waiting. Even the faster 
memory connections only increase the bandwidth /of the interface/. The 
actual RAM cells aren't getting any faster.

>> (I've seen plenty of documentation from the Haskell developers where
>> they benchmark something, and then find that it goes faster if you turn
>> *off* parallelism, because otherwise the multiple cores fight for
>> bandwidth or constantly invalidate each other's caches.)
>
> The solution, of course would be to break the task down into small
> blocks that can fit within the CPUs' caches (good locality) and work on
> units that are in areas of memory where cache lines will not overlap.

Sometimes this is possible. For example, the latest releases of the 
Haskell run-time system have an independent heap per CPU core, and each 
core can run a garbage collection cycle of its own local heap 
independently of the other cores. Since the heaps never overlap, there's 
no problem here.

The *problem* of course happens when data migrates from the generation-1 
area into the shared generation-2 heap area. Then suddenly you have to 
start worrying about cores invalidating each other's caches and so 
forth. Again, the GC engine uses a block-based system to try to minimise 
this.

Another example: A Haskell thread can generate "sparks" which are tasks 
which can usefully be run in the background. But the developers found 
that if you let /any/ core do so, it tends to slow down rather than 
speed up. Basically, it's best to keep sparks local to the core that 
created them, unless there are cores actually sitting idle.

> Which is all fine and good, but ... not really feasable because you are
> not in charge of where your task's pages physically reside in memory.
> The OS is in charge of that, and it could change the physical layout at
> any moment.

Almost every OS known to Man exposes RAM to the application as a linear 
range of virtual addresses. No two [writeable] pages in your 
application's virtual address space will ever map to the same physical 
address.

>> It seems inevitable that the more memory you have, the slower it is.
>> Even if we completely ignore issues of cost, bigger memory = more
>> address decode logic + longer traces. Longer traces mean more power,
>> more latency, more capacitance, more radio interference. It seems
>> inevitable that if you have a really huge memory, it will necessarily be
>> very slow.
>
> Right, because all of those elements work against you. It still amazes
> me that a CPU can operate at the speeds it does and still run robustly
> enough to produce good results without completely distorting the signals
> it's handling. Each FET has a capacitance at the gate. They've managed
> to get the total capacitance (traces, gates, etc) so tiny that it
> doesn't kill the signal.

Over the distances inside a chip, it's not too bad. If the signal has to 
leave the chip, suddenly the distances, the capacitance, the 
interference, the power requirements increase by an order of magnitude. 
That's why you can have a 2.2 GHz L1 cache, but you have a piffling 0.1 
GHz RAM bus.

>> Indeed, sometimes I start thinking about what some kind of hyper-Harvard
>> architecture would look like. For example, what happens if the machine
>> stack is on-chip? (I.e., it exists in its own unique address space, and
>> the circuitry for it is entirely on-chip.) What about if there were
>> several on-chip memories optimised for different types of operation?
>
> Don't some RISC processors sort of do this?

Certainly there are chips that have used some of these ideas. DSP chips 
often have separate address spaces for code and data, and sometimes 
multiple buses too. It tends not to be used for desktop processors though.

(Hell, anything that isn't 8086-compatible tends not to be used for 
desktop processors! Which is a shame, because 8086 isn't that good...)

>> The other model I've looked at is having not a computer with one giant
>> memory connected to one giant CPU, but zillions of smallish memories
>> connected to zillions of processing elements. The problem with *that*
>> model tends to be "how do I get my data to where I need it?"
>
> Right.. There would need to be some sort of controller that could
> transfer data from Block A to Block B. Addressing would be quite ...
> interesting in this scheme.

It's quite feasible that the processing elements themselves would 
organise forwarding messages to where they need to go. You don't 
necessarily need dedicated switching and routing circuitry.

The trouble, of course, is that as the number of processing elements 
increases, either the number of buses increases geometrically, or the 
communications latency increases geometrically. One or the other.

I gather that several supercomputers make use of AMD Opteron chips.

http://en.wikipedia.org/wiki/Jaguar_%28computer%29

The Opteron is a bit unusual:

http://en.wikipedia.org/wiki/Opteron#Multi-processor_features

http://www.amd.com/us/products/technologies/direct-connect-architecture/Pages/direct-connect-architecture.aspx

Basically, each time you add a new chip, you're adding new buses too, 
increasing bandwidth. The Opteron has a limit of 8 CPUs per motherboard; 
Wikipedia claims you can buy "expensive routing chips" to extend this.

>>> What I understand so far is: One process has the bus, and either reads
>>> or writes from/to the bus. The other processor watches the activity and,
>>> if it sees an address it has modified it tells the other processor,
>>> which passes bus control to the other, puts the data out on the bus,
>>> then returns control to the first processor.
>>
>> http://en.wikipedia.org/wiki/Cache_coherence
>> http://en.wikipedia.org/wiki/MESI_protocol
>>
>
> Wow. I actually recognized MESI right off: Modified, Exclusive, Shared,
> Invalid. This is exactly what the Pentium uses.

This is exactly what almost all systems of this kind do.
Post a reply to this message