First month with Haddock

Posted on July 30, 2013 by Fūzetsu

As per the original proposal, over the last couple of weeks I have been hacking on Haddock. I will talk a bit on where we currently stand, what has been achieved, what hasn’t and what’s next. While the proposal wasn’t originally my idea, I will do my best to expose the reasoning behind it.

Note: This is written in from of me talking about my personal experience with Haddock. If all you want are some hard facts on what’s implemented, what isn’t, what’s planned, I recommend you skip to the overview. The Tests subsection might also interest you. This first post is long because it outlines a large chunk of time. Future posts should be considerably shorter and about a specific area of the project rather than about the project as a whole.

The first couple of weeks were spent on reading the source. Unfortunately Haddock is quite tightly bound to GHC which means that I had to do a lot of GHC source diving as well just to get my grips on the situation. I recommend reading the GHC commentary if you haven’t already as it’s fairly informative. I think the main surprise to me was that Haddock gets comment strings right from GHC. I guess that I always thought that Haddock was simply ripping these right out of the source files.

Parser change

The very first point of the proposal was to re-implement the existing parser with a new, more powerful one. Alex and Happy are currently used to generate the lexer and the parser at compile time. There are following problems:

With these two main points in mind, Attoparsec was chosen with hopes that parser combinators will provide a cleaner, more maintainable and extendable implementation. As a downside, it depends on Data.Text which is currently not in in GHC boot library. Fortunately, it also has a ByteString version which lets us avoid the problem.

A word of advice: if you ever move around modules in an existing project, or even add some new ones, make sure to update your cabal file because all you’re going to get are weird linker errors at the end of the build with no indication of what’s wrong.

Tests

When time to code actually came and I got GHC HEAD going (this is unfortunately a must for Haddock hacking. On an upside, you get all the cool HEAD stuff like TypeHoles), I very quickly realised that running tests was a major pain. There are two test suites: HTML tests and Hspec tests. HTML tests are just short Haskell modules that we compile, generate documentation for and diff with reference, known-good HTML files. This means going to the shell, compiling haddock, running‘cabal test’, waiting for the other test-suite to run first and examining output. Even worse, if anything is actually wrong, you don’t even get told which file it’s in and you have to grep for the elements of the output.

Hspec tests are the nice kind. The kind that you can load into GHCi and whenever you’re ready, simply ‘:r’ and ‘main’ to get a coloured output and clear indication of what’s failing. The downside is that there were only about 5 cases there which means that you often passed these only to fail on HTML tests later. If at all possible, it’d be nice to fail early, before going out to the shell and doing all the tedious stuff. My first remedy to this was less-than-pretty.

I added trace statements to the part of the program that takes in strings and outputs our internal format (nested ‘Doc’ structure). With this, and a couple of minutes with emacs macros, I soon had hundreds of (rather redundant) Hspec tests without going out to the shell. I have later removed the redundancies and split them up into nice categories. This way we actually had some tests. I should note that there’s one more type of testing I like to do and that is to download existing, large libraries that compile both with HEAD and stable, generating docs for them and then diffing the two. If this passes then I feel fairly confident that nothing major got broken. HXT was pretty good for this. Any problems discovered this way would go right into Hspec tests which is how the test-suite has naturally grown into ~75 reasonable test cases.

Implementation

My first one-pass attempt failed horribly. I could not reproduce the behaviour of the original program. A lot of the grammar seems to have been written for Alex and Happy: that is, the syntax was designed for easy parsing to those rather than coming up with the syntax first, independently of the tool used.

Defeated and already a fair amount of time into the coding stage, I have changed my approach. The initial attempt at this resulted in what was very much a grammar implementation using Attoparsec using some custom combinators for state transition (for the lexing) and backtracking with lists (for the parser). As you can imagine, this doesn’t gain us much over the original program but the hopes were that it will be easier to transition from this into a single pass than it would be to do so straight away. While implementing a few new features with this parser and writing few more tests for those, the time came where a single-pass parser was necessary. I have met with the same problems as initially (trying to combine two grammars into a single-pass parser combinator implementation) but have recently received guidance which made a large amount of these issues go away and made the implementation far more idiomatic in respect to using parser combinator approach. This parser is nearly finished (and probably would have been if I didn’t choose to write this post instead). It’s worth noting that this parser will bring a couple of changes and will therefore no longer be 100% compatible with the old one. More syntax will be allowed and no old syntax will be disallowed which means that old documentation should be safe to generate with this new parser without fear of breaking it.

While I have spent a huge majority of my time in the parser part of the program, I suspect that features such as GADTs and type families which are scheduled to be added in the future will take me well outside this area, potentially into the GHC source again. You can expect blog posts about those if they turn out to be interesting enough to post about.

Overview

Note that everything denoted below is not yet in the Haddock tree. Announcement will be made when the changes are pushed upstream.

Plans

Any syntax extensions are unlikely to break any existing documentation.