I wanted to revisit writing code in Haskell, and in doing so try to understand the I/O model a bit better. I spent quite some time reading texts and tutorials on monads, and discovered that most of them are utterly inaccessible unless you already understand monads.
Fortunately for me, I came across this blog post which gave a very good introduction to the concepts involved, while being down to earth enough that it didn’t require prior comprehension of monads.
Armed with this new level of understanding, I thought the best way to apply it would be to attempt an implementation of the UNIX sort program.
Now, there are plenty of noddy implementations of sort around on the web. Most of them look something like this:
main :: IO () main = interact (unlines . sort . lines)
This is all well and good, but actually it’s not much good for understanding I/O. It just takes a single input stream, sorts it and outputs it. The magic is all wrapped up in interact
, so it’s all invisible and rather boring.
Now, real UNIX sort takes a bunch of filenames as arguments, and sorts them all into one giant output. That sounds more interesting. So let’s now have a play with Haskell’s do
notation.
main :: IO () main = do x <- getArgs y <- mapM (readFile) x (putStr . unlines . sort . lines . concat) y
That's more like it. Let's take a closer look.
Essentially, each line in the do
block takes values which are wrapped in a monad and passes them into the next line nicely unwrapped. So, getArgs
returns type IO [String]
, and readFile
takes FilePath
, which is compatible with String
. The do
block does the job of extracting the [String]
from IO [String]
.
mapM
is just like map
, but deals with functions returning monadic types. By the time we get to use y
, it has type [String]
where each string is the contents of a file.
Then we compose a function to apply to the list. I could have written the last line as putStr (unlines (sort (lines (concat y))))
, but I find the brackets ugly. putStr
has type String -> IO ()
, which takes the string we want to print and returns the IO ()
that Main
needs.
I find this code fairly straightforward to understand (and Haskell's powerful type checker made sure that by the time it compiled it was correct) but I wanted to check my understanding of the magic of the do
block by rewriting it as a one-liner.
main :: IO () main = getArgs >>= mapM (readFile) >>= putStr . unlines . sort . lines . concat
This is a bit harder to understand than the do
block at first sight, but not immensely so once you realise the parallel. The do
construct is just syntactic sugar, and this is more or less what it is translated to. For details of why, and of what >>=
means, refer to the tutorial I linked to above.
It can also be written the other way round:
main :: IO () main = putStr . unlines . sort . lines . concat =<< mapM (readFile) =<< getArgs
I'm not sure which one is easier to read.
It's also possible to nicely separate out the monadic nastiness from the pure function dealing with strings:
main :: IO () main = forallfiles transform forallfiles :: (String -> String) -> IO () forallfiles f = getArgs >>= mapM (readFile) >>= putStr . f . concat transform :: String -> String transform = (unlines . sort . lines)
While forallfiles
might not be the best name in the world for this, it does at least wrap up the nastiness into a single function, leaving transform
as a pure function operating on a String
. Pure functions are the staple of functional programming, and it is nice to keep as much of the program pure as possible. If transform
were more substantial, this would likely make the program easier to understand and to maintain.
So what about this IO ()
type? Well, the easiest way to think about it is as an opaque type just representing some kind of I/O. And what is I/O? Well, I/O is actually a change in the state of the outside world. It's quite easy to imagine that with putStr
. The world (let's say your screen) before putStr is as it is, after putStr
the screen has an extra string printed at the bottom. The change in the state of the world is the outputting of the string, and it is that change that is returned from the function.
But what about input? Well, the answer is that input is also returned in the IO ()
type. For example, getArgs
has type IO [String]
. There is no input to the function. The [String]
part gets passed on to something more useful (in this case, mapM (readFile)
) and the IO ()
part is taken care of by the monad. It is this IO ()
part that represents the change of state of the outside world. But what actually is this change of state? Well, in this case it's the fact that the arguments have been got. In this case it just helps with the ordering of the function evaluation by providing a data dependency that the monad can use to sequence the I/O. If we were reading a line from a file though, it could represent something more concrete like the read pointer in the file.
Ultimately though, it's not important exactly what gets wrapped up in IO ()
. That's why it's opaque. But every function that does any I/O at all will return that type, and the return values (ie changes in the state of the outside world) will be combined by the IO monad in the right order. What this represents is ensuring that all the I/O (and hence any computations necessary to support that I/O) is done in the right order.
Of course, it's entirely likely that there isn't any real data being passed around in IO ()
. The compiler is probably being clever and doing the ordering that the monad does in such a way that it's safe for functions which return IO ()
to have side effects, so that there is no way in which a function can have side effects without being ordered safely by the compiler. I think that's a pretty neat way of wrapping up side effects in return values, even if it's hard to understand at first.
This is by no means a formal exploration of monads or the full extent of Haskell's I/O system. There are plenty of texts covering that already. But writing this has helped me to understand it a bit better, so hopefully it will help you too.