 |
 |
|
 |
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Orchid XP v8 wrote:
> Kevin Wampler wrote:
>> Orchid XP v8 wrote:
>>> OK, cool. So basically there is no way I can tell whether
>>> implementing an algorithm one way or the other will yield the best
>>> speed. Yay, me. :-/
>>
>> It won't tell you everything about the speed of the final program, but
>> I can't imagine it'd be hard to write some test programs and then time
>> them to determine some of these answers for yourself.
>
> I guess I'm still used to the Old Days of computing, when the speed of
> the CPU was the primary bottleneck. Of course, these days the memory
> subsystem is the primary bottleneck - to the point where algorithms
> which are "less efficient" on paper can actually run faster in reality
> if they have superior cache behaviour.
>
> Obviously, cache behaviour is something I have absolutely no control
> over, so there's no point worrying about it.
>
Depends on what language you are using.
Even today some programs are written to use only internal registers -
only using external memory when absolutely necessary.
Also, depending on what processing unit you are using it could be faster
to count down to zero than up to a given number - based on the machine
instructions available and the time cost of each instruction.
When you get into high level languages you are going to lose some
efficiency in execution because the focus starts towards ease of actual
programming vs execution efficiency.
You get to a point where the program just runs and the compiler gives
little care to optimizing the execution speed. It's easier to program,
but at a cost.
Tom
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Mike Raiford <"m[raiford]!at"@gmail.com> wrote:
> Not only that, but modern processors can get a significant speed boost
> based on the order of instructions presented. e.g. if you have 2
> instructions, one dependent on the previous, it's entirely possible to
> move an instruction between those two to take advantage of the
> processor's pipelining.
Intel processors have reordered instructions automatically since the
Pentium Pro (AFAIR).
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Orchid XP v8 <voi### [at] dev null> wrote:
> Interesting. I have always heard that floating-point arithmetic is much
> slower than integer arithmetic.
An FPU addition has taken 1 clock cycle since probably the 486 (the
version which had the integrated FPU), or at most the Pentium.
The first Intel processor to have a 1-clock-cycle FPU multiplication
was, if I remember correctly, the Pentium. (You can believe this requires
a rather large percentage of the entire area of the microchip.)
Of course the Pentium was not the first processor to have a 1-clock-cycle
floating point multiplication. Many RISC processors had that probably 10
years earlier. It's old technology.
Division is much more complicated than multiplication, which is why
it has always taken (and will most probably always take) quite many clock
cycles to compute.
Of course the actual throughput of the FPU in most programs is slower
than that because of all the data which has to be transferred between the
memory and the FPU. You can write specialized code (usually in ASM) which
takes full advantage of the FPU by calculating as much as possible without
having to constantly load and store the values from/to memory, but compilers
are not even today very good at making this kind of optimization. If you
examine the assembler output of a compiler, you will usually notice that
it loads and stores FPU registers *a alot*, often more than would be
necessary.
This may be one of the reason why XMM, SSE and other SIMD chips have
been developed, ie. so that with a new protocol it would be possible to
write code which utilizes the auxiliary chip better.
> So you're saying they're actually roughly the same speed now?
When counting in clock cycles they have been about the same speed since
the 486. The FPU is just a bit hindered by a more complicated data transfer
between the memory and the FPU.
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Orchid XP v8 <voi### [at] dev null> wrote:
> Well, no. You can use large arrays and access them in a specific order
> in an attempt to improve cache coherancy. But beyond that, in a GC
> language where the runtime randomly allocates and rearranges data in RAM
> from time to time, you really have very little control.
If the GC'd language doesn't reorder elements in memory to be more
cache-friendly, that can have a rather large penalty in speed.
The speed difference between cache-friendly and cache-unfriendly code
can be surprisingly large, even if both do the exact same thing otherwise.
We are talking about something of an order of magnitude slower at worst.
Cache misses are *very* expensive.
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Warp wrote:
> Intel processors have reordered instructions automatically since the
> Pentium Pro (AFAIR).
I think getting the compiler to do this was one of the basic ideas
behind RISC, until they figured out that it meant you had to recompile
your source code for every minor rev of your CPU. :-) I even remember
that the compiler was supposed to know how long instructions took, so
you didn't have any hardware to wait for an instruction to complete
before you started using its results.
A lot of the RISC stuff that sounded real cool actually turned out to be
a bad idea if you wanted to run the same code on multiple compatible CPUs.
> If the GC'd language doesn't reorder elements in memory to be more
> cache-friendly, that can have a rather large penalty in speed.
Hmmmm... I've never heard of that. I wonder how easy it is to figure out
how to arrange things (based on compile-time or run-time information
rather than intuitive understanding of the algorithm) to make cache
misses mless common.
--
Darren New / San Diego, CA, USA (PST)
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Darren New <dne### [at] san rr com> wrote:
> > If the GC'd language doesn't reorder elements in memory to be more
> > cache-friendly, that can have a rather large penalty in speed.
> Hmmmm... I've never heard of that. I wonder how easy it is to figure out
> how to arrange things (based on compile-time or run-time information
> rather than intuitive understanding of the algorithm) to make cache
> misses mless common.
I would say that in the general case it's extremely difficult. It could
be rather easy in simple cases, though.
For example, if you are eg. traversing a list from the beginning to the
end, and doing that a lot of times, then if the list elements are in
sequential order in memory (and one right after the other), it will be
much more cache-friendly than if the elements are located randomly in
memory. Theoretically a compiler could perhaps be able to optimize such
lists (even if it means reordering memory blocks for that to be possible,
as memory fragmentation can naturally cause list elements to be scattered
randomly in memory).
The absolute worst case scenario would be that the list is so large that
it does not fit entirely in the cache, the elements are scattered randomly
in memory, far apart from each other (they don't even have to be a lot
apart from each other; I don't remember how large a typical cache block
is, but it was something like 32 bytes or so), and the list is traversed
several times. This causes each single list element access to cause a
cache miss, which makes it really slow.
Sometimes you could make your program faster by simply doing things a
bit differently. For example, if you are running an algorithm on a list
several times, and the algorithm doesn't actually require you to traverse
the list completely in order to get partial results, you could run the
algorithm on only a part of the list the necessary number of times, after
which you run it on the next part of the list, etc. If the size of the
partial list is small enough, all of its elements will fit in the cache
and the subsequent passes will become a lot faster. This should be so
even if all the list elements are scattered randomly in memory (it causes
the first traversal of the partial list to be slow, but subsequent
traversals should be fast because all the elements will be in the cache).
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Warp wrote:
> The absolute worst case scenario would be that the list is so large that
> it does not fit entirely in the cache, the elements are scattered randomly
> in memory, far apart from each other (they don't even have to be a lot
> apart from each other; I don't remember how large a typical cache block
> is, but it was something like 32 bytes or so), and the list is traversed
> several times. This causes each single list element access to cause a
> cache miss, which makes it really slow.
I'm told there are special processor intructions for accessing memory
locations where you "know" that the data will only be read once, so
there's no point attempting to cache it. This way you can save the cache
for stuff that *will* be accessed multiple times.
Of course, unless you're programming in assembly, this isn't much help.
> Sometimes you could make your program faster by simply doing things a
> bit differently. For example, if you are running an algorithm on a list
> several times, and the algorithm doesn't actually require you to traverse
> the list completely in order to get partial results, you could run the
> algorithm on only a part of the list the necessary number of times, after
> which you run it on the next part of the list, etc. If the size of the
> partial list is small enough, all of its elements will fit in the cache
> and the subsequent passes will become a lot faster. This should be so
> even if all the list elements are scattered randomly in memory (it causes
> the first traversal of the partial list to be slow, but subsequent
> traversals should be fast because all the elements will be in the cache).
This is the kind of thing that Haskell does to you automatically. Of
course, Haskell "lists" are linked lists, scattered over all of the heap
(and periodically moving around). I don't know much about how CPU caches
work, but I imagine having the data scattered over several memory pages
isn't very cache-friendly.
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Invisible wrote:
> Of course, unless you're programming in assembly, this isn't much help.
Or if your compiler is smart enough. For example, C's memset(), which
just assigns a specific value to a range of locations (say, to zero out
an array) can often be coded to use instructions that say "I'm going to
clobber this whole cache line, so don't read it." I guess that counts
as writing in assembly if you consider the authors of the routine, but
you don't need to do that to use it.
> I don't know much about how CPU caches
> work, but I imagine having the data scattered over several memory pages
> isn't very cache-friendly.
I don't think it's a matter of being spread over several pages, but a
matter of having small pieces (much smaller than a cache line) being
randomly accessed.
I.e., when you read xyz[37] as an integer in C (that's the 38'th element
of the array of integers), the cache may pull in
xyz[35],xyz[36],...,xyz[39] in one memory cycle. If you're walking
sequentially through the array, this is helpful. If you're jumping all
over, this isn't helpful (altho maybe not harmful, depending on what
your access just replaces). If you're using linked lists, even going
sequentially will cause the cache accesses to be unhelpful, because
xyz[37] isn't next to xyz[38].
--
Darren New / San Diego, CA, USA (PST)
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Darren New wrote:
>> I don't know much about how CPU caches work, but I imagine having the
>> data scattered over several memory pages isn't very cache-friendly.
>
> I don't think it's a matter of being spread over several pages, but a
> matter of having small pieces (much smaller than a cache line) being
> randomly accessed.
How wide is a typical cache line?
IIRC, a Haskell linked list node is 3 pointers wide.
> If you're using linked lists, even going
> sequentially will cause the cache accesses to be unhelpful, because
> xyz[37] isn't next to xyz[38].
Although, if they happen to be allocated sequentially, they might
possibly be...
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Invisible wrote:
> How wide is a typical cache line?
Warp said 32 bytes. I thought that was way smaller than I remembered,
but I went and variously tracked it down, and indeed, most of the modern
machines seem to have 32-byte cache lines. I think they used to be
wider, like 128 bytes, but maybe I'm misremembering, or thinking of
mainframes, or something.
--
Darren New / San Diego, CA, USA (PST)
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|
 |