 |
 |
|
 |
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Orchid XP v8 <voi### [at] dev null> wrote:
> Well, no. You can use large arrays and access them in a specific order
> in an attempt to improve cache coherancy. But beyond that, in a GC
> language where the runtime randomly allocates and rearranges data in RAM
> from time to time, you really have very little control.
If the GC'd language doesn't reorder elements in memory to be more
cache-friendly, that can have a rather large penalty in speed.
The speed difference between cache-friendly and cache-unfriendly code
can be surprisingly large, even if both do the exact same thing otherwise.
We are talking about something of an order of magnitude slower at worst.
Cache misses are *very* expensive.
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Warp wrote:
> Intel processors have reordered instructions automatically since the
> Pentium Pro (AFAIR).
I think getting the compiler to do this was one of the basic ideas
behind RISC, until they figured out that it meant you had to recompile
your source code for every minor rev of your CPU. :-) I even remember
that the compiler was supposed to know how long instructions took, so
you didn't have any hardware to wait for an instruction to complete
before you started using its results.
A lot of the RISC stuff that sounded real cool actually turned out to be
a bad idea if you wanted to run the same code on multiple compatible CPUs.
> If the GC'd language doesn't reorder elements in memory to be more
> cache-friendly, that can have a rather large penalty in speed.
Hmmmm... I've never heard of that. I wonder how easy it is to figure out
how to arrange things (based on compile-time or run-time information
rather than intuitive understanding of the algorithm) to make cache
misses mless common.
--
Darren New / San Diego, CA, USA (PST)
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Darren New <dne### [at] san rr com> wrote:
> > If the GC'd language doesn't reorder elements in memory to be more
> > cache-friendly, that can have a rather large penalty in speed.
> Hmmmm... I've never heard of that. I wonder how easy it is to figure out
> how to arrange things (based on compile-time or run-time information
> rather than intuitive understanding of the algorithm) to make cache
> misses mless common.
I would say that in the general case it's extremely difficult. It could
be rather easy in simple cases, though.
For example, if you are eg. traversing a list from the beginning to the
end, and doing that a lot of times, then if the list elements are in
sequential order in memory (and one right after the other), it will be
much more cache-friendly than if the elements are located randomly in
memory. Theoretically a compiler could perhaps be able to optimize such
lists (even if it means reordering memory blocks for that to be possible,
as memory fragmentation can naturally cause list elements to be scattered
randomly in memory).
The absolute worst case scenario would be that the list is so large that
it does not fit entirely in the cache, the elements are scattered randomly
in memory, far apart from each other (they don't even have to be a lot
apart from each other; I don't remember how large a typical cache block
is, but it was something like 32 bytes or so), and the list is traversed
several times. This causes each single list element access to cause a
cache miss, which makes it really slow.
Sometimes you could make your program faster by simply doing things a
bit differently. For example, if you are running an algorithm on a list
several times, and the algorithm doesn't actually require you to traverse
the list completely in order to get partial results, you could run the
algorithm on only a part of the list the necessary number of times, after
which you run it on the next part of the list, etc. If the size of the
partial list is small enough, all of its elements will fit in the cache
and the subsequent passes will become a lot faster. This should be so
even if all the list elements are scattered randomly in memory (it causes
the first traversal of the partial list to be slow, but subsequent
traversals should be fast because all the elements will be in the cache).
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Warp wrote:
> The absolute worst case scenario would be that the list is so large that
> it does not fit entirely in the cache, the elements are scattered randomly
> in memory, far apart from each other (they don't even have to be a lot
> apart from each other; I don't remember how large a typical cache block
> is, but it was something like 32 bytes or so), and the list is traversed
> several times. This causes each single list element access to cause a
> cache miss, which makes it really slow.
I'm told there are special processor intructions for accessing memory
locations where you "know" that the data will only be read once, so
there's no point attempting to cache it. This way you can save the cache
for stuff that *will* be accessed multiple times.
Of course, unless you're programming in assembly, this isn't much help.
> Sometimes you could make your program faster by simply doing things a
> bit differently. For example, if you are running an algorithm on a list
> several times, and the algorithm doesn't actually require you to traverse
> the list completely in order to get partial results, you could run the
> algorithm on only a part of the list the necessary number of times, after
> which you run it on the next part of the list, etc. If the size of the
> partial list is small enough, all of its elements will fit in the cache
> and the subsequent passes will become a lot faster. This should be so
> even if all the list elements are scattered randomly in memory (it causes
> the first traversal of the partial list to be slow, but subsequent
> traversals should be fast because all the elements will be in the cache).
This is the kind of thing that Haskell does to you automatically. Of
course, Haskell "lists" are linked lists, scattered over all of the heap
(and periodically moving around). I don't know much about how CPU caches
work, but I imagine having the data scattered over several memory pages
isn't very cache-friendly.
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Invisible wrote:
> Of course, unless you're programming in assembly, this isn't much help.
Or if your compiler is smart enough. For example, C's memset(), which
just assigns a specific value to a range of locations (say, to zero out
an array) can often be coded to use instructions that say "I'm going to
clobber this whole cache line, so don't read it." I guess that counts
as writing in assembly if you consider the authors of the routine, but
you don't need to do that to use it.
> I don't know much about how CPU caches
> work, but I imagine having the data scattered over several memory pages
> isn't very cache-friendly.
I don't think it's a matter of being spread over several pages, but a
matter of having small pieces (much smaller than a cache line) being
randomly accessed.
I.e., when you read xyz[37] as an integer in C (that's the 38'th element
of the array of integers), the cache may pull in
xyz[35],xyz[36],...,xyz[39] in one memory cycle. If you're walking
sequentially through the array, this is helpful. If you're jumping all
over, this isn't helpful (altho maybe not harmful, depending on what
your access just replaces). If you're using linked lists, even going
sequentially will cause the cache accesses to be unhelpful, because
xyz[37] isn't next to xyz[38].
--
Darren New / San Diego, CA, USA (PST)
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Darren New wrote:
>> I don't know much about how CPU caches work, but I imagine having the
>> data scattered over several memory pages isn't very cache-friendly.
>
> I don't think it's a matter of being spread over several pages, but a
> matter of having small pieces (much smaller than a cache line) being
> randomly accessed.
How wide is a typical cache line?
IIRC, a Haskell linked list node is 3 pointers wide.
> If you're using linked lists, even going
> sequentially will cause the cache accesses to be unhelpful, because
> xyz[37] isn't next to xyz[38].
Although, if they happen to be allocated sequentially, they might
possibly be...
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Invisible wrote:
> How wide is a typical cache line?
Warp said 32 bytes. I thought that was way smaller than I remembered,
but I went and variously tracked it down, and indeed, most of the modern
machines seem to have 32-byte cache lines. I think they used to be
wider, like 128 bytes, but maybe I'm misremembering, or thinking of
mainframes, or something.
--
Darren New / San Diego, CA, USA (PST)
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
>> How wide is a typical cache line?
>
>
> Warp said 32 bytes. I thought that was way smaller than I remembered,
> but I went and variously tracked it down, and indeed, most of the modern
> machines seem to have 32-byte cache lines. I think they used to be
> wider, like 128 bytes, but maybe I'm misremembering, or thinking of
> mainframes, or something.
Right. So just to make sure I understand this... you're saying the cache
is managed in chunks of 32 bytes at a time? Like, if you fetch 1 byte of
data from memory, 32 bytes will be loaded into the least recently used
cache line? (Or whatever the cache policy is.)
Are the cache lines aligned in any sort of way? (I'm wondering if it
grabs 32 bytes at the nearest 32-byte boundary or not.)
--
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Orchid XP v8 wrote:
> Right. So just to make sure I understand this... you're saying the cache
> is managed in chunks of 32 bytes at a time? Like, if you fetch 1 byte of
> data from memory, 32 bytes will be loaded into the least recently used
> cache line? (Or whatever the cache policy is.)
That's how I understand it. I believe it's more accurate to say that the
bus between the memory and the registers is actually 32-bytes wide. I
may be mistaken in that, tho.
> Are the cache lines aligned in any sort of way? (I'm wondering if it
> grabs 32 bytes at the nearest 32-byte boundary or not.)
I'm also pretty certain it is on 32-byte boundaries. Otherwise, the
addressing logic winds up being slower than the time you save caching stuff.
--
Darren New / San Diego, CA, USA (PST)
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Orchid XP v8 <voi### [at] dev null> wrote:
> Right. So just to make sure I understand this... you're saying the cache
> is managed in chunks of 32 bytes at a time? Like, if you fetch 1 byte of
> data from memory, 32 bytes will be loaded into the least recently used
> cache line? (Or whatever the cache policy is.)
> Are the cache lines aligned in any sort of way? (I'm wondering if it
> grabs 32 bytes at the nearest 32-byte boundary or not.)
You should read this article series if you want to know everything there
is to know about memory and cache:
http://lwn.net/Articles/250967/
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|
 |