|
|
(Yes, another long post that nobody is going to actually read...)
So a while back, I mentioned that Haskell has a veritable zoo of array
libraries, and it's all a bit of a mess. Some of you may be wondering
what the problem is. After all, BASIC has arrays. Pascal has arrays. C
has arrays. Java has arrays. And they're all pretty simple. So why the
drama?
First of all, in any of the above languages, if you want to store
several items of data, an array is usually the first data structure you
reach for. It's simple, it's built-in, it's always there, it's
moderately efficient, and it's "good enough" to start with. Maybe at
some point you'll decide that what you "really" want is actually a
hashtable, or a balanced tree, or maybe something even more fancy. But,
initially, an array is an easy way to go.
In Haskell, the default container of choice is of course a list. (When a
Haskeller says "list", they generally mean "single-linked list".) It's
built-in, it's always there, and there's a huge library of list
processing functions. The Haskell Standard Prelude (a library which is
always imported by default unless you explicitly ask not to) contains 53
list-related functions, and the standard List library contains an
additional 51 functions for a total of 104 list-processing functions.
By contrast, the standard Haskell 98 arrays library defines a piffling
12 functions for dealing with arrays. And that's *it*.
Lists are simple Algebraic Data Types (ADTs). They are hard-wired into
the compiler, but only to allow syntax sugar. If you forgo the sugar,
it's quite trivial to implement your own list type:
data List x = Node x (List x) | End
It is then a simple beginner's exercise to define all of the standard
list processing functions from first principles:
head (Node x xs) = x
tail (Node x xs) = xs
length End = 0
length (Node x xs) = 1 + length xs
sum End = 0
sum (Node x xs) = x + sum xs
map fn End = End
map fn (Node x xs) = Node (fn x) (map fn xs)
...etc...
You will find Haskell tutorials which paint this picture. It is true
that defining the list processing functions in a maximally efficient way
is a touch tricky - but defining them such that they merely *work* is
child's play. It's a common beginner's exercise in understanding Haskell.
Lists have another nice property. Haskell is a "lazy" language. If you
call a function, the function call isn't really executed until you try
to "examine" the result. And even then, the call might not be executed
completely, but only far enough to generate the part of the data you're
looking at. In Haskell, data doesn't literally "exist" until you
"inspect" it - much like Schroder's famous undead cat.
What all of this means is that you can construct "infinite" lists. These
lists are notionally of infinite length, but so long as you only
"examine" a finite portion of the list, only that finite portion will be
constructed (which uses finite time and space). In this way, you can
seperate the code for generating the data from the code that decides how
much data to generate - and we programmers all know that seperating
things is usually a good idea.
Lists are simple recursive ADTs. And we can make infinite lists. Trees
are simple recursive ADTs too. And they can be infinite. (Although in
practice this isn't as useful.)
Arrays are not recursive, and they're not ADTs either. Support for
arrays must therefore be hard-wired into the compiler. (Unlike lists and
trees which can be defined in plain Haskell.) But this is no different
to any other programming language.
More irritatingly, arrays can't "grow" or "shrink". Their size is preset
when allocated. (You can, of course, construct a slightly more
sophisticated container that allocates a large array and keeps track of
how much of it is "occupied". But there isn't a standard library for
this.) If you want to process an array to construct a new array, you
have to do potentially extra work to determine how large the result is,
or waste space allocating an array that's larger than necessary.
This, again, is a problem for any programming language. But Haskell has
some particular problems of its own.
Most particularly, the Haskell language is fundamentally based upon the
idea that data never, ever changes. If you want to "change" the contents
of a list, what you actually do is make a new, altered *copy* of that
list. The original list still exists and is unaltered. In the common
case that the old list isn't being used anywhere, the garbage collector
(GC) will remove it.
For a list, that means that altering the Nth element of the list causes
all N-1 elements before it to be copied. Any elements *after* the Nth
one can be shared between the new list and the old list.
Similarly, for a tree, modifying a leaf node means making a new leaf
node plus the roughly log(N) parent nodes of that node. All other leaves
and branches can be shared between the two trees. Which is usually
fairly efficient.
By contrast, to alter one element of an array, you must copy THE ENTIRE
ARRAY. If the array is large, this is obviously absurdly inefficient in
both time and space.
In summary:
| random read | random write |
-------+-------------+--------------+
List | X | X |
Tree | log X | log X |
Array | 1 | Y |
(Approximate number of operations to access element X in a container
holding Y elements.)
Copying a large array is probably a lot faster than copying a large
linked list (better cache performance, possibly CPU-specific fast copy
instructions), but even so, arrays look horribly slow for writing. While
such arrays can still be used as lookup tables, all the algorithms that
depend on O(1) array updates are suddenly infeasibly slow.
Haskell has the ability to read and write files on disk. And what is a
file? Why, it's an array of bytes. GHC, everybody's favourit Haskell
compiler, provides the ability to use *mutable* arrays, using the same
facility used for I/O - that is, monads. Using these arrays, we get O(1)
read *and* O(1) write access.
This, you would think, would be the end of the story. But not so.
Both Haskell 98 immutable arrays and GHC's mutable arrays try to be too
fancy. Instead of using numbers of indicies like any normal programming
language, they allow arbitrary objects to be used (so long as the
programmer supplies a way to convert them to integers). This has two
results:
- Arrays need not be zero-origin. They can be one-origin, or even
twelve-origin.
- Multidimensional arrays are implemented by using tuples are indicies.
Array indicies are always checked at runtime. There is no way to disable
this check. There is also no way to directly use the underlying "real"
indicies.
Amusingly, somebody discovered that by getting the "convert object to
index" code wrong, you could make your Haskell programs segfault. Even
though they contain only pure Haskell code. Obviously, this should
never, ever happen, for any reason. The library was quickly changed. Now
instead of checking the index object against the index bounds, the index
object is converted to a real index and then checked. This means that
now the original index object can be out of bounds, causing
multidimensional arrays to "wrap around" at the ends in an
implementation-dependent way.
You'd think they could get this stuff right, eh?
The obvious course of action would be to expose a lower-level interface
to arrays, using plain integers for indicies (and offering a way to
bypass runtime bounds checks), and then build higher-level libraries on
top of that. But noooo...
I should also point out that the mutable and immutable arrays above are
both "boxed" arrays. That is, an array of *pointers* to data, rather
than literally an array of data itself. (Rather like a Java array of any
object type.)
GHC also offers "unboxed" arrays, which work more like C arrays.
However, these are only offered for certain element types. (Int, Double,
etc.) Theoretically you can add more - but it's not terribly easy (or
documented).
Still, using unboxed arrays brings several advantages:
- Uses less RAM.
- The GC can treat the whole array as "one object", so GC workload is
reduced.
- Probably gives much better cache behaviour than an array of pointers
pointing all over the place.
One nice touch is that an unboxed array of Bool actually uses 1 bit per
element. The library automatically handles all the bit twiddling
operations required to pack 8 Bools into one byte, and unpack them again
later. (I wonder if the C++ STL can do that?)
A non-obvious side effect of this is that the elements of such an array
*cannot* be as-yet unexecuted function calls. Unboxed arrays are
therefore *strict*. That is, "examining" any element of the array causes
the *entire* array to be computed, not just that element (as would be
the case for a normal boxed array or a list).
So, GHC gives you arrays with are immutable, mutable, boxed or unboxed.
Use whatever you fancy. That, surely, should be the end of the story.
Right??
Well... actually no. For a couple of reasons:
- The silly array indicies (and the inability to get round them).
- The array libraries provide only very basic, simplistic
functionallity, compared to the sophisticated list and tree libraries.
- Using a monad is typically not very convinient. Thus, we would like to
have arrays that "look like" lists or something, but have the efficiency
of arrays.
Hence the zoo of other array libraries out there. Allow me to survey a
few of them...
First came the Bytestring library. Now Haskell 98 defines a type called
"String", but this is really just an alias for a linked list of (32-bit
Unicode) characters. That means that the entire list-processing library
can be brought to bear on all your string-processing problems. But it's
also not especially efficient.
[One paper claims that each character in the list takes up 12 *bytes* of
RAM. Not to mention that, say, a 1,000 character string gives the GC
1,001 tiny objects to process. And, obviously, you can forget about
cache coherance. And then there's the O(n) random access time.]
To summarise: Flexible, not very fast.
Enter the Bytestring library. It provides a type called "Bytestring",
and a bunch of functions for working with them each are mostly carbon
copies of the standard list-processing functions. However, a Bytestring
is actually a linked list of array "chunks", and each such chunk can
contain up to 256 bytes. [These bytes can be interpretted either as
8-bit unsigned integers, or as 8-bit character codes.]
This representation has the following effects:
- Memory usage goes from 12 bytes per character to (approximately) 1
byte per character.
- Random access is now 256 times faster.
- The GC now has 256 times fewer "objects" to consider.
- When doing I/O, the Bytestring chunk can be directly used as the I/O
"buffer" in the low-level OS calls. There is no need to "pack" or
"unpack" the byte array that the OS uses to/from Haskell list nodes.
On top of that, Bytestrings can be "sliced". That is, two Bytestring
objects can refer to different portions of the same underlying array
chunk(s). And *that* means that the library can offer a function that
takes a Bytestring and returns a new Bytestring with the initial
character removed - in O(1) time and space.
Since Bytestrings are linked lists, they also support very fast
concatenation operations too. In short, Bytestring is able to provide a
list-like API with efficiency equal or superior to lists.
Bottom line: Take your string-processing program, import Bytestring,
change a few type signatures and function names, and watch your program
go several orders of magnitude faster (and use less RAM). On top of
that, you now have reasonably fast random access over strings too
(assuming they aren't really huge).
Bytestring is great. Bytestring is good. Bytestring makes everything
warm and fuzzy. However, Bytestring only works for... well... bytes.
It's great for fast I/O. It's great for processing raw binary or raw
text. But it's not a general-purpose array; it can only store bytes.
And then along came GHC 6.10.1. This comes with a technology preview of
the Data Parallel Haskell (DPH) library which has been in development
for the last few years.
DPH provides a new kind of array: "parallel arrays". A parallel array is
immutable, unboxed, and therefore strict. That is, "examining" any
element of a parallel array causes the entire array to be computed.
In parallel.
The DPH framework runs off the back of a small stack of brick-thick
research papers describing how to take the Haskell source code and
radically rearrange both the physical data layout in RAM and the order
of operations in the program to expose maximum opportunity for parallel
execution. The work is automatically sliced up into appropriate-size
chunks, and fired off to seperate CPU cores.
For example, in plain Haskell you can represent a vector as a list. You
can then take the dot product of two such vectors using
dot xs ys = sum [ x * y | x <- xs | y <- ys ]
If you now trivially modify this code to say
dotp xs ys = sumP [: x * y | x <- xs | y <- ys :]
then the "vectors" are now parallel arrays, the multiplications will be
performed in parallel, and the sum will also be parallelised as much as
possible.
One of the Big Features is that the stuff you do in parallel can itself
be parallel - they call it Nested Data Parallelism (NDP). And it's very
hard to implement efficiently. The engine does things like storing an
array of pairs as a pair of arrays, and storing an array of arrays as a
single flat array with an array of subarray indicies. (This allows the
flat array to be corsely split into chunks for processing on each core,
even if some of the logical subarrays are tiny while others are huge.)
Most if not all of this would be impossible in a "normal" programming
language, becuase you could (trivially) write code where the order of
execution matters. This kind of thing works so well in Haskell because
you *cannot* write such code. (Although, clearly, you could design a
library like this for another language and say "and if the order
matters, your program will break". Only Haskell gives you a guarantee
though.)
It's very exciting stuff, and it's brand-new, hot off the press... but
it isn't finished yet. Half the code is missing. Documentation? What
documentation? You also have to seriously contort your program to get it
to work. (The compiler support isn't finished yet, so you can't mix
parallel and sequential code in the same module, for example.)
It'll be awesome when it's finished, but for now it's not very useful.
And then there's more libraries - vector, uvector, storablevector, etc.
These address perceived weaknesses of the other array libraries. So we
have various array libraries, each with a different overlapping set of
features and deficiencies.
To give you a flavour, let's take a look at the "vector" library. This
provides "vectors" which are really just arrays again. (They're called
something different to prevent name clashes, basically.) They're boxed,
and come in mutable and immutable flavours. But, unlike what GHC itself
provides, there's no silly indicies here.
Of course, they had to go one better and try to make their immutable
arrays as fast as mutable ones. Basically, using a bunch of clever
trickery, they get the compiler to do in-place updates wherever
possible, even through the programmer thinks they're using plain old
immutable arrays. I can't say how well it works, but the technology is
pretty interesting...
To begin at the beginning, in Haskell is it very common to write
"pipelines" of list-processing functions such as
filter odd . map head . group . sort
[That reads from right to left. So, sort some stuff, group it, map over
it, and finally filter it.]
On paper, each of those functions consumes a list and constructs a new
list. Which isn't terribly efficient. Hence various compilers and
libraries attempt to perform "fusion". That is, we want these seperate
functions to "fuse" together into one big loop that traverses the input
list once, directly building the output list, with no intermediate
temporary lists and no repeated traversals.
Of course, this isn't always possible (e.g., how do you sort a list in
one pass?), but ideally that's what we'd like. We want the compiler to
take our pipeline of seperate functions and build one giant,
custom-tuned loop out of it. (We could, of course, code that loop
ourselves directly. But if the compiler can do it automatically, that
would be much better!)
One of the most recent fusion techniques is "stream fusion". It involves
a "stream" type which can be processed non-recursively. So our pipeline
above becomes something like
stream_to_list . filter_stream odd . list_to_stream .
stream_to_list . map_stream head . list_to_stream .
stream_to_list . group_stream . list_to_stream .
stream_to_list . sort_stream . list_to_stream
So, instead of processing a list, you turn that list into a stream,
process the stream, and turn the stream back into a list. Which,
obviously, is more work. But it turns out that only stream_to_list is
recursive; the rest is completely non-recursive, with no looping at all.
Also, it should be perfectly obvious that
list_to_stream . stream_to_list
does nothing. [Not 100% true, not close enough.] So we can delete the
repeated conversions, leaving
stream_to_list . filter_stream odd .
. map_stream head .
. group_stream .
. sort_stream . list_to_stream
In other words, convert the list to a stream, perform multiple
(non-recursive) stream-processing operations on the stream, and then
convert the stream back into a list. stream_to_list contains the only
actual loop here, so when all the definitions are inlined, we get one
huge loop at the end of it all, as required. (Although I suspect in
reality, you can't actually implement the sort_stream function.)
Stream fusion is great for lists, but it works for any data structure. A
"stream" is really just a traversal of a data structure. xxx_to_stream
sucks data out of a container, and stream_to_xxx squirts that data into
a new container. So you can, for example, apply stream fusion to arrays.
And the vector library does this.
However, if you just want map, filter, sum, etc., why are you using
arrays? Lists can already do that! No, the only reason to use an array
is if you want *random access*. And stream fusion can't help you with that.
Some clever soul has come up with "array recycling" though. This is
basically a flashy term for a system that performs as many updates as
possible in-place rather than making copies. It works together with
stream fusion, and uses a similar approach:
Functions that update arrays are split into several smaller, more
primitive functions. There's a function that allocates new space, a
function that copies the existing data into the new space, and a
function that updates the new array. By using clever rewrite rules, you
can get it so that if you update an array, and then immediately update
it again, there's no need to copy the array. You can just perform all
updates bar the initial one in-place.
The result? Immutable, referentially-transparent arrays with almost the
same performance as mutable arrays. (All these clever transformations
happen at compile-time, not runtime.)
...and now, the C programmers must be chuckling to themselves that
Haskell is doing all this work where a C programmer would just trivially
use a mutable array in the first place. :-P But while I doubt it will
ever be possible to write Haskell code that runs faster than C, being
able to write high-level, maintainable code that still performs close to
C would be a big win, IMHO. C can *always* be made slightly faster
somehow; the question, to me, is how much effort it takes to make fast
code in C vs Haskell.
(If DPH ever gets into a working state, being able to trivially write
highly parallel code in Haskell should become a C-killer feature. But
let's wait until that one actually happens...)
TO BE CONTINUED...
Post a reply to this message
|
|