Documentation Generator

Generating documentation from comments embedded in code

Terms defined: accumulator, block comment, doc comment, line comment, slug

Many programmers believe they're more likely to write documentation and keep it up to date if it is close to the code. Tools that extract specially-formatted comments from code and turn them into documentation have been around since at least the 1980s; many are used for JavaScript, including JSDoc and ESDoc. This chapter will use what we learned in about parsing source code to build a simple documentation generator of our own.

How can we extract documentation comments?

We will use Acorn once again to parse our source files. This time we will use the parser's onComment option, giving it an array to fill in. For the moment we won't bother to assign the AST produced by parsing to a variable because we are just interested in the comments:

import fs from 'fs'
import acorn from 'acorn'

const text = fs.readFileSync(process.argv[2], 'utf-8')
const options = {
  sourceType: 'module',
  locations: true,
  onComment: []
}
acorn.parse(text, options)
console.log(JSON.stringify(options.onComment, null, 2))

// double-slash comment
/* slash-star comment */

[
  {
    "type": "Line",
    "value": " double-slash comment",
    "start": 0,
    "end": 23,
    "loc": {
      "start": {
        "line": 1,
        "column": 0
      },
      "end": {
        "line": 1,
        "column": 23
      }
    }
  },
  {
    "type": "Block",
    "value": " slash-star comment ",
    "start": 24,
    "end": 48,
    "loc": {
      "start": {
        "line": 2,
        "column": 0
      },
      "end": {
        "line": 2,
        "column": 24
      }
    }
  }
]

There is more information here than we need, so let's slim down the JSON that we extract:

import fs from 'fs'
import acorn from 'acorn'

const text = fs.readFileSync(process.argv[2], 'utf-8')
const options = {
  sourceType: 'module',
  locations: true,
  onComment: []
}
acorn.parse(text, options)
const subset = options.onComment.map(entry => {
  return {
    type: entry.type,
    value: entry.value,
    start: entry.loc.start.line,
    end: entry.loc.end.line
  }
})
console.log(JSON.stringify(subset, null, 2))

node extract-comments-subset.js two-kinds-of-comment.js

[
  {
    "type": "Line",
    "value": " double-slash comment",
    "start": 1,
    "end": 1
  },
  {
    "type": "Block",
    "value": " slash-star comment ",
    "start": 2,
    "end": 2
  }
]

Line and block comments
How line comments and block comments are distinguished and represented.

Acorn distinguishes two kinds of comments (). Line comments cannot span multiple lines; if one line comment occurs immediately after another, Acorn reports two comments:

//
// multi-line double-slash comment
//

node extract-comments-subset.js multi-line-double-slash-comment.js

[
  {
    "type": "Line",
    "value": "",
    "start": 1,
    "end": 1
  },
  {
    "type": "Line",
    "value": " multi-line double-slash comment",
    "start": 2,
    "end": 2
  },
  {
    "type": "Line",
    "value": "",
    "start": 3,
    "end": 3
  }
]

Block comments, on the other hand, can span any number of lines. We don't need to prefix each line with * but most people do for readability:

/*
 * multi-line slash-star comment
 */

node extract-comments-subset.js multi-line-slash-star-comment.js

[
  {
    "type": "Block",
    "value": "\n * multi-line slash-star comment\n ",
    "start": 1,
    "end": 3
  }
]

By convention, we use block comments that start with /** for documentation. The first two characters are recognized by the parser as "start of comment", so the first character in the extracted text is *:

/**
 * doc comment
 */

[
  {
    "type": "Block",
    "value": "*\n * doc comment\n ",
    "start": 1,
    "end": 3
  }
]

What input will we try to handle?

We will use Markdown for formatting our documentation. The doc comments for function definitions look like this:

/**
 * # Demonstrate documentation generator.
 */

import util from './util-plain'

/**
 * ## `main`: Main driver.
 */
const main = () => { 
  // Parse arguments.
  // Process input stream.
}

/**
 * ## `parseArgs`: Parse command line.
 * - `args` (`string[]`): arguments to parse.
 * - `defaults` (`Object`): default values.
 *
 * Returns: program configuration object.
 */
const parseArgs = (args, defaults) => { 
  // body would go here
}

/**
 * ## `process`: Transform data.
 * - `input` (`stream`): where to read.
 * - `output` (`stream`): where to write.
 * - `op` (`class`): what to do.
 *    Use @BaseProcessor unless told otherwise.
 */
const process = (input, output, op = util.BaseProcessor) => { 
  // body would go here
}

while the ones for class definitions look like this:

/**
 * # Utilities to demonstrate doc generator.
 */

/**
 * ## `BaseProcessor`: General outline.
 */
class BaseProcessor {
  /**
   * ### `constructor`: Build processor.
   */
  constructor () { 
    // body would go here
  }

  /**
   * ### `run`: Pass input to output.
   * - `input` (`stream`): where to read.
   * - `output` (`stream`): where to write.
   */
  run (input, output) {
    // body would go here
  }
}

export default BaseProcessor

The doc comments are unpleasant at the moment: they repeat the function and method names from the code, we have to create titles ourselves, and we have to remember the back-quotes for formatting code. We will fix some of these problems once we have a basic tool up and running.

The next step in doing that is to translate Markdown into HTML. There are many Markdown parsers in JavaScript; after experimenting with a few, we decided to use markdown-it along with the markdown-it-anchor extension that creates HTML anchors for headings. The main program gets all the doc comments from all of the input files, converts the Markdown to HTML, and displays that:


const HEAD = '<html><body style="font-size: 100%; margin-left: 0.5em">'
const FOOT = '</body></html>'

const main = () => {
  const allComments = getAllComments(process.argv.slice(2))
  const md = new MarkdownIt({ html: true })
    .use(MarkdownAnchor, { level: 1, slugify: slugify })
  const html = md.render(allComments)
  console.log(HEAD)
  console.log(html)
  console.log(FOOT)
}

To get all the comments we extract comments from all the files, remove the leading * characters (which aren't part of the documentation), and then join the results after stripping off extraneous blanks:


const getAllComments = (allFilenames) => {
  return allFilenames
    .map(filename => {
      const comments = extractComments(filename)
      return { filename, comments }
    })
    .map(({ filename, comments }) => {
      comments = comments.map(comment => removePrefix(comment))
      return { filename, comments }
    })
    .map(({ filename, comments }) => {
      const combined = comments
        .map(comment => comment.stripped)
        .join('\n\n')
      return `# ${filename}\n\n${combined}`
    })
    .join('\n\n')
}

Extracting the comments from a single file is done as before:


const extractComments = (filename) => {
  const text = fs.readFileSync(filename, 'utf-8')
  const options = {
    sourceType: 'module',
    locations: true,
    onComment: []
  }
  acorn.parse(text, options)
  const subset = options.onComment
    .filter(entry => entry.type === 'Block')
    .map(entry => {
      return {
        type: entry.type,
        value: entry.value,
        start: entry.start,
        end: entry.end
      }
    })
  return subset
}

and removing the prefix * characters is a matter of splitting the text into lines, removing the leading spaces and asterisks, and putting the lines back together:


const removePrefix = (comment) => {
  comment.stripped = comment.value
    .split('\n')
    .slice(0, -1)
    .map(line => line.replace(/^ *\/?\* */, ''))
    .map(line => line.replace('*/', ''))
    .join('\n')
    .trim()
  return comment
}

One thing that isn't in this file (because we're going to use it in later versions) is the function slugify. A slug is a short string that identifies a header or a web page; the name comes from the era of newspapers, where a slug was a short name used to identify an article while it was in production. Our slugify function strips unnecessary characters out of a title, adds hyphens, and generally makes it something you might see in a URL:

const slugify = (text) => {
  return encodeURIComponent(
    text.split(' ')[0]
      .replace(/.js$/, '')
      .trim()
      .toLowerCase()
      .replace(/[^ \w]/g, '')
      .replace(/\s+/g, '-')
  )
}

export default slugify

Let's run this generator and see what it produces ( and ):

node process-plain.js example-plain.js util-plain.js

<html><body style="font-size: 100%; margin-left: 0.5em">
<h1 id="exampleplain">example-plain.js</h1>
<h1 id="demonstrate">Demonstrate documentation generator.</h1>
<h2 id="main"><code>main</code>: Main driver.</h2>
<h2 id="parseargs"><code>parseArgs</code>: Parse command line.</h2>
<ul>
<li><code>args</code> (<code>string[]</code>): arguments to parse.</li>
<li><code>defaults</code> (<code>Object</code>): default values.</li>
</ul>
<p>Returns: program configuration object.</p>
<h2 id="process"><code>process</code>: Transform data.</h2>
<ul>
<li><code>input</code> (<code>stream</code>): where to read.</li>
<li><code>output</code> (<code>stream</code>): where to write.</li>
<li><code>op</code> (<code>class</code>): what to do.
Use @BaseProcessor unless told otherwise.</li>
</ul>
<h1 id="utilplain">util-plain.js</h1>
<h1 id="utilities">Utilities to demonstrate doc generator.</h1>
<h2 id="baseprocessor"><code>BaseProcessor</code>: General outline.</h2>
<h3 id="constructor"><code>constructor</code>: Build processor.</h3>
<h3 id="run"><code>run</code>: Pass input to output.</h3>
<ul>
<li><code>input</code> (<code>stream</code>): where to read.</li>
<li><code>output</code> (<code>stream</code>): where to write.</li>
</ul>

</body></html>

Output of documentation generator
The page produced by the documentation generator.
Mapping comments to documentation
How comments in code map to documentation in HTML.

It works, but there is a double h1 header for each file (the filename and and the title comment), the anchor IDs are hard to read, there are no cross-references, and so on. Some of the visual issues can be resolved with CSS, and we can change our input format to make processing easier as long as it also makes authoring easier. However, anything that is written twice will eventually be wrong in one place or another, so our first priority is to remove duplication.

How can we avoid duplicating names?

If a comment is the first thing in a file, we want to use it as title text; this will save us having to write an explicit level-1 title in a comment. For each other comment, we can extract the name of the function or method from the node on the line immediately following the doc comment. This allows us to write much tidier comments:

/**
 * Overall file header.
 */

/**
 * Double the input.
 */
const double = (x) => 2 * x 

/**
 * Triple the input.
 */
function triple (x) { 
  return 3 * x
}

/**
 * Define a class.
 */
class Example { 
  /**
   * Method description.
   */
  someMethod () {
  }
}

To extract and display information from nodes immediately following doc comments we must find all the block comments, record the last line of each, and then search the AST to find nodes that are on lines immediately following any of those trailing comment lines. (We will assume for now that there are no blank lines between the comment and the start of the class or function.) The main program finds the comments as usual, creates a set containing the line numbers we are looking for, then searches for the nodes we want:


const main = () => {
  const options = {
    sourceType: 'module',
    locations: true,
    onComment: []
  }
  const text = fs.readFileSync(process.argv[2], 'utf-8')
  const ast = acorn.parse(text, options)
  const comments = options.onComment
    .filter(entry => entry.type === 'Block')
    .map(entry => {
      return {
        value: entry.value,
        start: entry.loc.start.line,
        end: entry.loc.end.line
      }
    })
  const targets = new Set(comments.map(comment => comment.end + 1))
  const nodes = []
  findFollowing(ast, targets, nodes)
  console.log(nodes.map(node => condense(node)))
}

The recursive search is straightforward as well—we delete line numbers from the target set and add nodes to the accumulator as we find matches:


const findFollowing = (node, targets, accum) => {
  if ((!node) || (typeof node !== 'object') || (!('type' in node))) {
    return
  }

  if (targets.has(node.loc.start.line)) {
    accum.push(node)
    targets.delete(node.loc.start.line)
  }

  for (const key in node) {
    if (Array.isArray(node[key])) {
      node[key].forEach(child => findFollowing(child, targets, accum))
    } else if (typeof node[key] === 'object') {
      findFollowing(node[key], targets, accum)
    }
  }
}

Finally, we use a function called condense to get the name we want out of the AST we have:


const condense = (node) => {
  const result = {
    type: node.type,
    start: node.loc.start.line
  }
  switch (node.type) {
    case 'VariableDeclaration':
      result.name = node.declarations[0].id.name
      break
    case 'FunctionDeclaration':
      result.name = node.id.name
      break
    case 'ClassDeclaration':
      result.name = node.id.name
      break
    case 'MethodDefinition':
      result.name = node.key.name
      break
    default:
      assert.fail(`Unknown node type ${node.type}`)
      break
  }
  return result
}

We need this because we get a different structure with:

const name = function() => {
}

than we get with:

function name() {
}

When we run this on our test case we get:

[
  { type: 'VariableDeclaration', start: 8, name: 'double' },
  { type: 'FunctionDeclaration', start: 13, name: 'triple' },
  { type: 'ClassDeclaration', start: 20, name: 'Example' },
  { type: 'MethodDefinition', start: 24, name: 'someMethod' }
]

We can use this to create better output ():

import MarkdownIt from 'markdown-it'
import MarkdownAnchor from 'markdown-it-anchor'

import getComments from './get-comments.js'
import getDefinitions from './get-definitions.js'
import fillIn from './fill-in.js'
import slugify from './slugify.js'

const HEAD = '<html><body style="font-size: 100%; margin-left: 0.5em">'
const FOOT = '</body></html>'

const main = () => {
  const filenames = process.argv.slice(2)
  const allComments = getComments(filenames)
  const allDefinitions = getDefinitions(filenames)
  const combined = []
  for (const [filename, comments] of allComments) {
    const definitions = allDefinitions.get(filename)
    const text = fillIn(filename, comments, definitions)
    combined.push(text)
  }
  const md = new MarkdownIt({ html: true })
    .use(MarkdownAnchor, { level: 1, slugify: slugify })
  const html = md.render(combined.join('\n\n'))
  console.log(HEAD)
  console.log(html)
  console.log(FOOT)
}

main()

<html><body style="font-size: 100%; margin-left: 0.5em">
<h1 id="fillinheadersinput">fill-in-headers-input.js</h1>
<p>Demonstrate documentation generator.</p>
<h2 id="main">main</h2>
<p>Main driver.</p>
<h2 id="parseargs">parseArgs</h2>
<p>Parse command-line arguments.</p>
<ul>
<li><code>args</code> (<code>string[]</code>): arguments to parse.</li>
<li><code>defaults</code> (<code>Object</code>): default values.</li>
</ul>
<blockquote>
<p>Program configuration object.</p>
</blockquote>
<h2 id="baseprocessor">BaseProcessor</h2>
<p>Default processing class.</p>
<h3 id="constructor">constructor</h3>
<p>Build base processor.</p>
<h3 id="run">run</h3>
<p>Pass input to output.</p>
<ul>
<li><code>input</code> (<code>stream</code>): where to read.</li>
<li><code>output</code> (<code>stream</code>): where to write.</li>
</ul>

</body></html>

Filling in headers
Filling in headers when generating documentation.

Code is data

We haven't made this point explicitly in a while, so we will repeat it here: code is just another kind of data, and we can process it just like we would process any other data. Parsing code to produce an AST is no different from parsing HTML to produce DOM; in both cases we are simply transforming a textual representation that's easy for people to author into a data structure that's easy for a program to manipulate. Pulling things out of that data to create a report is no different from pulling numbers out of a hospital database to report monthly vaccination rates.

Treating code as data enables us to do routine programming tasks with a single command, which in turn gives us more time to think about the tasks that we can't (yet) automate. Doing this is the foundation of a tool-based approach to software engineering; as the mathematician Alfred North Whitehead once wrote, "Civilization advances by extending the number of important operations which we can perform without thinking about them."

Exercises

Building an index

Modify the documentation generator to produce an alphabetical index of all classes and methods found. Index entries should be hyperlinks to the documentation for the corresponding item.

Documenting exceptions

Extend the documentation generator to allow people to document the exceptions that a function throws.

Deprecation warning

Add a feature to the documentation generator to allow authors to mark functions and methods as deprecation (i.e., to indicate that while they still exist, they should not be used because they are being phased out).

Usage examples

Enhance the documentation generator so that if a horizontal rule --- appears in a documentation comment, the text following is typeset as usage example. (A doc comment may contain several usage examples.)

Unit testing

Write unit tests for the documentation generator using Mocha.

Summarizing functions

Modify the documentation generator so that line comments inside a function that use //* are formatted as a bullet list in the documentation for that function.

Cross referencing

Modify the documentation generator so that the documentation for one class or function can include Markdown links to other classes or functions.

Data types

Modify the documentation generator to allow authors to define new data types in the same way as JSDoc.

Inline parameter documentation

Some documentation generators put the documentation for a parameter on the same line as the parameter:

/**
 * Transform data.
 */
function process(
  input,  /*- {stream} where to read */
  output, /*- {stream} where to write */
  op      /*- {Operation} what to do */
){
  // body would go here
}

Modify the documentation generator to handle this.

Tests as documentation

The doctest library for Python allows programmers to embed unit tests as documentation in their programs. Write a tool that:

  1. Finds functions that start with a block comment.

  2. Extracts the code and output from those blocks comments and turns them into assertions.

For example, given this input:

const findIncreasing = (values) => {
  /**
   * > findIncreasing([])
   * []
   * > findIncreasing([1])
   * [1]
   * > findIncreasing([1, 2])
   * [1, 2]
   * > findIncreasing([2, 1])
   * [2]
   */
}

the tool would produce:

assert.deepStrictEqual(findIncreasing([]), [])
assert.deepStrictEqual(findIncreasing([1]), [1])
assert.deepStrictEqual(findIncreasing([1, 2]), [1, 2])
assert.deepStrictEqual(findIncreasing([2, 1]), [2])