Systems Programming

Using callbacks to manipulate files and directories

Terms defined: Boolean, anonymous function, asynchronous, callback function, cognitive load, command-line argument, console, current working directory, destructuring assignment, edge case, filesystem, filter, globbing, idiomatic, log message, path (in filesystem), protocol, scope, single-threaded, string interpolation

The biggest difference between JavaScript and most other programming languages is that many operations in JavaScript are asynchronous. Its designers didn't want browsers to freeze while waiting for data to arrive or for users to click on things, so operations that might be slow are implemented by describing now what to do later. And since anything that touches the hard drive is slow from a processor's point of view, Node implements filesystem operations the same way.

How slow is slow?

Gregg2020 used the analogy in to show how long it takes a computer to do different things if we imagine that one CPU cycle is equivalent to one second.

Operation Actual Time Would Be…
1 CPU cycle 0.3 nsec 1 sec
Main memory access 120 nsec 6 min
Solid-state disk I/O 50-150 μsec 2-6 days
Rotational disk I/O 1-10 msec 1-12 months
Internet: San Francisco to New York 40 msec 4 years
Internet: San Francisco to Australia 183 msec 19 years
Physical system reboot 5 min 32,000 years
Computer operation times at human scale.

Early JavaScript programs used callback functions to describe asynchronous operations, but as we're about to see, callbacks can be hard to understand even in small programs. In 2015, the language's developers standardized a higher-level tool called promises to make callbacks easier to manage, and more recently they have added new keywords called async and await to make it easier still. We need to understand all three layers in order to debug things when they go wrong, so this chapter explores callbacks, while shows how promises and async/await work. This chapter also shows how to read and write files and directories with Node's standard libraries, because we're going to be doing that a lot.

How can we list a directory?

To start, let's try listing the contents of a directory the way we would in Python or Java:

import fs from 'fs'

const srcDir = process.argv[2]
const results = fs.readdir(srcDir)
for (const name of results) {
  console.log(name)
}

We use import module from 'source' to load the library source and assign its contents to module. After that, we can refer to things in the library using module.component just as we refer to things in any other object. We can use whatever name we want for the module, which allows us to give short nicknames to libraries with long names; we will take advantage of this in future chapters.

require versus import

In 2015, a new version of JavaScript called ES6 introduced the keyword import for importing modules. It improves on the older require function in several ways, but Node still uses require by default. To tell it to use import, we have added "type": "module" at the top level of our Node package.json file.

Our little program uses the fs library which contains functions to create directories, read or delete files, etc. (Its name is short for "filesystem".) We tell the program what to list using command-line arguments, which Node automatically stores in an array called process.argv. process.argv[0] is the name of the program used to run our code (in this case node), while process.argv[1] is the name of our program (in this case list-dir-wrong.js); the rest of process.argv holds whatever arguments we gave at the command line when we ran the program, so process.argv[2] is the first argument after the name of our program ():

Command-line arguments in `process.argv`
How Node stores command-line arguments in process.argv.

If we run this program with the name of a directory as its argument, fs.readdir returns the names of the things in that directory as an array of strings. The program uses for (const name of results) to loop over the contents of that array. We could use let instead of const, but it's good practice to declare things as const wherever possible so that anyone reading the program knows the variable isn't actually going to vary—doing this reduces the cognitive load on people reading the program. Finally, console.log is JavaScript's equivalent of other languages' print command; its strange name comes from the fact that its original purpose was to create log messages in the browser console.

Unfortunately, our program doesn't work:

node list-dir-wrong.js .
internal/process/esm_loader.js:74
    internalBinding('errors').triggerUncaughtException(
                              ^

TypeError [ERR_INVALID_CALLBACK]: Callback must be a function. Received \
undefined
    at makeCallback (fs.js:168:11)
    at Object.readdir (fs.js:994:14)
    at /u/stjs/systems-programming/list-dir-wrong.js:4:20
    at ModuleJob.run (internal/modules/esm/module_job.js:152:23)
    at async Loader.import (internal/modules/esm/loader.js:166:24)
    at async Object.loadESM (internal/process/esm_loader.js:68:5) {
  code: 'ERR_INVALID_CALLBACK'
}

The error message comes from something we didn't write whose source we would struggle to read. If we look for the name of our file (list-dir-wrong.js) we see the error occurred on line 4; everything above that is inside fs.readdir, while everything below it is Node loading and running our program.

The problem is that fs.readdir doesn't return anything. Instead, its documentation says that it needs a callback function that tells it what to do when data is available, so we need to explore those in order to make our program work.

A theorem

  1. Every program contains at least one bug.
  2. Every program can be made one line shorter.
  3. Therefore, every program can be reduced to a single statement which is wrong.

— variously attributed

What is a callback function?

JavaScript uses a single-threaded programming model: as the introduction to this lesson said, it splits operations like file I/O into "please do this" and "do this when data is available". fs.readdir is the first part, but we need to write a function that specifies the second part.

JavaScript saves a reference to this function and calls with a specific set of parameters when our data is ready (). Those parameters defined a standard protocol for connecting to libraries, just like the USB standard allows us to plug hardware devices together.

Running callbacks
How JavaScript runs callback functions.

This corrected program gives fs.readdir a callback function called listContents:

import fs from 'fs'

const listContents = (err, files) => {
  console.log('running callback')
  if (err) {
    console.error(err)
  } else {
    for (const name of files) {
      console.log(name)
    }
  }
}

const srcDir = process.argv[2]
fs.readdir(srcDir, listContents)
console.log('last line of program')

Node callbacks always get an error (if there is any) as their first argument and the result of a successful function call as their second. The function can tell the difference by checking to see if the error argument is null. If it is, the function lists the directory's contents with console.log, otherwise, it uses console.error to display the error message. Let's run the program with the current working directory (written as '.') as an argument:

node list-dir-function-defined.js .
last line of program
running callback
Makefile
copy-file-filtered.js
copy-file-unfiltered.js
copy-file-unfiltered.out
copy-file-unfiltered.sh
copy-file-unfiltered.txt
figures
glob-all-files.js
...
x-check-arguments
x-counting-lines
x-destructuring-assignment
x-glob-patterns
x-rename-files
x-significant-entries
x-string-interpolation
x-trace-anonymous
x-trace-callback
x-where-is-node

Nothing that follows will make sense if we don't understand the order in which Node executes the statements in this program ():

  1. Execute the first line to load the fs library.

  2. Define a function of two parameters and assign it to listContents. (Remember, a function is just another kind of data.)

  3. Get the name of the directory from the command-line arguments.

  4. Call fs.readdir to start a filesystem operation, telling it what directory we want to read and what function to call when data is available.

  5. Print a message to show we're at the end of the file.

  6. Wait until the filesystem operation finishes (this step is invisible).

  7. Run the callback function, which prints the directory listing.

Callback execution order
When JavaScript runs callback functions.

What are anonymous functions?

Most JavaScript programmers wouldn't define the function listContents and then pass it as a callback. Instead, since the callback is only used in one place, it is more idiomatic to define it where it is needed as an anonymous function. This makes it easier to see what's going to happen when the operation completes, though it means the order of execution is quite different from the order of reading (). Using an anonymous function gives us the final version of our program:

import fs from 'fs'

const srcDir = process.argv[2]
fs.readdir(srcDir, (err, files) => {
  if (err) {
    console.error(err)
  } else {
    for (const name of files) {
      console.log(name)
    }
  }
})
Anonymous functions as callbacks
How and when JavaScript creates and runs anonymous callback functions.

Functions are data

As we noted above, a function is just another kind of data. Instead of being made up of numbers, characters, or pixels, it is made up of instructions, but these are stored in memory like anything else. Defining a function on the fly is no different from defining an array in-place using [1, 3, 5], and passing a function as an argument to another function is no different from passing an array. We are going to rely on this insight over and over again in the coming lessons.

How can we select a set of files?

Suppose we want to copy some files instead of listing a directory's contents. Depending on the situation we might want to copy only those files given on the command line or all files except some explicitly excluded. What we don't want to have to do is list the files one by one; instead, we want to be able to write patterns like *.js.

To find files that match patterns like that, we can use the glob module. (To glob (short for "global") is an old Unix term for matching a set of files by name.) The glob module provides a function that takes a pattern and a callback and does something with every filename that matched the pattern:

import glob from 'glob'

glob('**/*.*', (err, files) => {
  if (err) {
    console.log(err)
  } else {
    for (const filename of files) {
      console.log(filename)
    }
  }
})
copy-file-filtered.js
copy-file-unfiltered.js
copy-file-unfiltered.out
copy-file-unfiltered.sh
copy-file-unfiltered.txt
figures/anonymous-functions.pdf
figures/anonymous-functions.svg
figures/array-filter.pdf
figures/array-filter.svg
figures/callbacks.pdf
...
x-string-interpolation/problem.md
x-string-interpolation/solution.md
x-trace-anonymous/problem.md
x-trace-anonymous/solution.md
x-trace-anonymous/trace.js
x-trace-callback/problem.md
x-trace-callback/solution.md
x-trace-callback/trace.js
x-where-is-node/problem.md
x-where-is-node/solution.md

The leading ** means "recurse into subdirectories", while *.* means "any characters followed by '.' followed by any characters" (). Names that don't match *.* won't be included, and by default, neither are names that start with a '.' character. This is another old Unix convention: files and directories whose names have a leading '.' usually contain configuration information for various programs, so most commands will leave them alone unless told to do otherwise.

Matching filenames with `glob`
Using glob patterns to match filenames.

This program works, but we probably don't want to copy Emacs backup files whose names end with ~. We can get rid of them by filtering the list that glob returns:

import glob from 'glob'

glob('**/*.*', (err, files) => {
  if (err) {
    console.log(err)
  } else {
    files = files.filter((f) => { return !f.endsWith('~') })
    for (const filename of files) {
      console.log(filename)
    }
  }
})
copy-file-filtered.js
copy-file-unfiltered.js
copy-file-unfiltered.out
copy-file-unfiltered.sh
copy-file-unfiltered.txt
figures/anonymous-functions.pdf
figures/anonymous-functions.svg
figures/array-filter.pdf
figures/array-filter.svg
figures/callbacks.pdf
...
x-string-interpolation/problem.md
x-string-interpolation/solution.md
x-trace-anonymous/problem.md
x-trace-anonymous/solution.md
x-trace-anonymous/trace.js
x-trace-callback/problem.md
x-trace-callback/solution.md
x-trace-callback/trace.js
x-where-is-node/problem.md
x-where-is-node/solution.md

Array.filter creates a new array containing all the items of the original array that pass a test (). The test is specified as a callback function that Array.filter calls once once for each item. This function must return a Boolean that tells Array.filter whether to keep the item in the new array or not. Array.filter does not modify the original array, so we can filter our original list of filenames several times if we want to.

Using `Array.filter`
Selecting array elements using Array.filter.

We can make our globbing program more idiomatic by removing the parentheses around the single parameter and writing just the expression we want the function to return:

import glob from 'glob'

glob('**/*.*', (err, files) => {
  if (err) {
    console.log(err)
  } else {
    files = files.filter(f => !f.endsWith('~'))
    for (const filename of files) {
      console.log(filename)
    }
  }
})

However, it turns out that glob will filter for us. According to its documentation, the function takes an options object full of key-value settings that control its behavior. This is another common pattern in Node libraries: rather than accepting a large number of rarely-used parameters, a function can take a single object full of settings.

If we use this, our program becomes:

import glob from 'glob'

glob('**/*.*', { ignore: '*~' }, (err, files) => {
  if (err) {
    console.log(err)
  } else {
    for (const filename of files) {
      console.log(filename)
    }
  }
})

Notice that we don't quote the key in the options object. The keys in objects are almost always strings, and if a string is simple enough that it won't confuse the parser, we don't need to put quotes around it. Here, "simple enough" means "looks like it could be a variable name", or equivalently "contains only letters, digits, and the underscore".

No one knows everything

We combined glob.glob and Array.filter in our functions for more than a year before someone pointed out the ignore option for glob.glob. This shows:

  1. Life is short, so most of us find a way to solve the problem in front of us and re-use it rather than looking for something better.

  2. Code reviews aren't just about finding bugs: they are also the most effective way to transfer knowledge between programmers. Even if someone is much more experienced than you, there's a good chance you might have stumbled over a better way to do something than the one they're using (see point #1 above).

To finish off our globbing program, let's specify a source directory on the command line and include that in the pattern:

import glob from 'glob'

const srcDir = process.argv[2]

glob(`${srcDir}/**/*.*`, { ignore: '*~' }, (err, files) => {
  if (err) {
    console.log(err)
  } else {
    for (const filename of files) {
      console.log(filename)
    }
  }
})

This program uses string interpolation to insert the value of srcDir into a string. The template string is written in back quotes, and JavaScript converts every expression written as ${expression} to text. We could create the pattern by concatenating strings using srcDir + '/**/*.*', but most programmers find interpolation easier to read.

How can we copy a set of files?

If we want to copy a set of files instead of just listing them we need a way to create the paths of the files we are going to create. If our program takes a second argument that specifies the desired output directory, we can construct the full output path by replacing the name of the source directory with that path:

import glob from 'glob'

const [srcDir, dstDir] = process.argv.slice(2)

glob(`${srcDir}/**/*.*`, { ignore: '*~' }, (err, files) => {
  if (err) {
    console.log(err)
  } else {
    for (const srcName of files) {
      const dstName = srcName.replace(srcDir, dstDir)
      console.log(srcName, dstName)
    }
  }
})

This program uses destructuring assignment to create two variables at once by unpacking the elements of an array (). It only works if the array contains the enough elements, i.e., if both a source and destination are given on the command line; we'll add a check for that in the exercises.

Matching values with destructuring assignment
Assigning many values at once by destructuring.

A more serious problem is that this program only works if the destination directory already exists: fs and equivalent libraries in other languages usually won't create directories for us automatically. The need to do this comes up so often that there is a function called ensureDir to do it:

import glob from 'glob'
import fs from 'fs-extra'
import path from 'path'

const [srcRoot, dstRoot] = process.argv.slice(2)

glob(`${srcRoot}/**/*.*`, { ignore: '*~' }, (err, files) => {
  if (err) {
    console.log(err)
  } else {
    for (const srcName of files) {
      const dstName = srcName.replace(srcRoot, dstRoot)
      const dstDir = path.dirname(dstName)
      fs.ensureDir(dstDir, (err) => {
        if (err) {
          console.error(err)
        }
      })
    }
  }
})

Notice that we import from fs-extra instead of fs; the fs-extra module provides some useful utilities on top of fs. We also use path to manipulate pathnames rather than concatenating or interpolating strings because there are a lot of tricky edge cases in pathnames that the authors of that module have figured out for us.

Using distinct names

We are now calling our command-line arguments srcRoot and dstRoot rather than srcDir and dstDir. As we were writing this example we used dstDir as both the name of the top-level destination directory (from the command line) and the name of the particular output directory to create. JavaScript didn't complain because every function creates a new scope for variable definitions, and it's perfectly legal to give a variable inside a function the same name as something outside it. However, "legal" isn't the same thing as "comprehensible"; giving the variables different names makes the program easier for humans to read.

Our file copying program currently creates empty destination directories but doesn't actually copy any files. Let's use fs.copy to do that:

import glob from 'glob'
import fs from 'fs-extra'
import path from 'path'

const [srcRoot, dstRoot] = process.argv.slice(2)

glob(`${srcRoot}/**/*.*`, { ignore: '*~' }, (err, files) => {
  if (err) {
    console.log(err)
  } else {
    for (const srcName of files) {
      const dstName = srcName.replace(srcRoot, dstRoot)
      const dstDir = path.dirname(dstName)
      fs.ensureDir(dstDir, (err) => {
        if (err) {
          console.error(err)
        } else {
          fs.copy(srcName, dstName, (err) => {
            if (err) {
              console.error(err)
            }
          })
        }
      })
    }
  }
})

The program now has three levels of callback ():

  1. When glob has data, do things and then call ensureDir.

  2. When ensureDir completes, copy a file.

  3. When copy finishes, check the error status.

Three levels of callback
Three levels of callback in the running example.

Our program looks like it should work, but if we try to copy everything in the directory containing these lessons we get an error message:

rm -rf /tmp/out
mkdir /tmp/out
node copy-file-unfiltered.js ../node_modules /tmp/out 2>&1 | head -n 6
[Error: ENOENT: no such file or directory, chmod \
'/tmp/out/@nodelib/fs.stat/package.json'] {
  errno: -2,
  code: 'ENOENT',
  syscall: 'chmod',
  path: '/tmp/out/@nodelib/fs.stat/package.json'
}

The problem is that node_modules/fs.stat and node_modules/fs.walk match our globbing expression, but are directories rather than files. To prevent our program from trying to use fs.copy on directories, we must use fs.stat to get the properties of the thing whose name glob has given us and then check if it's a file. The name "stat" is short for "status", and since the status of something in the filesystem can be very complex, fs.stat returns an object with methods that can answer common questions.

Here's the final version of our file copying program:

import glob from 'glob'
import fs from 'fs-extra'
import path from 'path'

const [srcRoot, dstRoot] = process.argv.slice(2)

glob(`${srcRoot}/**/*.*`, { ignore: '*~' }, (err, files) => {
  if (err) {
    console.log(err)
  } else {
    for (const srcName of files) {
      fs.stat(srcName, (err, stats) => {
        if (err) {
          console.error(err)
        } else if (stats.isFile()) {
          const dstName = srcName.replace(srcRoot, dstRoot)
          const dstDir = path.dirname(dstName)
          fs.ensureDir(dstDir, (err) => {
            if (err) {
              console.error(err)
            } else {
              fs.copy(srcName, dstName, (err) => {
                if (err) {
                  console.error(err)
                }
              })
            }
          })
        }
      })
    }
  }
})

It works, but four levels of asynchronous callbacks is hard for humans to understand. will introduce a pair of tools that make code like this easier to read.

Exercises

Where is Node?

Write a program called wherenode.js that prints the full path to the version of Node is is run with.

Tracing callbacks

In what order does the program below print messages?

const red = () => {
  console.log('RED')
}

const green = (func) => {
  console.log('GREEN')
  func()
}

const blue = (left, right) => {
  console.log('BLUE')
  left(right)
}

blue(green, red)

Tracing anonymous callbacks

In what order does the program below print messages?

const blue = (left, right) => {
  console.log('BLUE')
  left(right)
}

blue(
  (callback) => {
    console.log('GREEN')
    callback()
  },
  () => console.log('RED')
)

Checking arguments

Modify the file copying program to check that it has been given the right number of command-line arguments and to print a sensible error message (including a usage statement) if it hasn't.

Significant entries

count-lines-histogram.js displays many zeroes and gives no visual sense of how large entries are. Modify it so that:

  1. When it is run with the --nonzero flag only non-zero values are shown.

  2. When it is run with the --graphical flag the numeric values are replaced with rows of asterisks.

  3. If both flags are given the program prints an error message instead of running.

Glob patterns

What filenames does each of the following glob patterns match?

Filtering arrays

Fill in the blank in the code below so that it runs correctly. Note: you can compare strings in JavaScript using <, >=, and other operators, so that (for example) person.personal > 'P' is true if someone's personal name starts with a letter that comes after 'P' in the alphabet.

const people = [
  { personal: 'Jean', family: 'Jennings' },
  { personal: 'Marlyn', family: 'Wescoff' },
  { personal: 'Ruth', family: 'Lichterman' },
  { personal: 'Betty', family: 'Snyder' },
  { personal: 'Frances', family: 'Bilas' },
  { personal: 'Kay', family: 'McNulty' }
]

const result = people.filter(____ => ____)

console.log(result)
[
  { personal: 'Jean', family: 'Jennings' },
  { personal: 'Ruth', family: 'Lichterman' },
  { personal: 'Frances', family: 'Bilas' }
]

String interpolation

Fill in the code below so that it prints the message shown.

const people = [
  { personal: 'Christine', family: 'Darden' },
  { personal: 'Mary', family: 'Jackson' },
  { personal: 'Katherine', family: 'Johnson' },
  { personal: 'Dorothy', family: 'Vaughan' }
]

for (const person of people) {
  console.log(`$____, $____`)
}
Darden, Christine
Jackson, Mary
Johnson, Katherine
Vaughan, Dorothy

Destructuring assignment

What is assigned to each named variable in each statement below?

  1. const first = [10, 20, 30]
  2. const [first, second] = [10, 20, 30]
  3. const [first, second, third] = [10, 20, 30]
  4. const [first, second, third, fourth] = [10, 20, 30]
  5. const {left, right} = {left: 10, right: 30}
  6. const {left, middle, right} = {left: 10, middle: 20, right: 30}

Counting lines

Write a program called lc that counts and reports the number of lines in one or more files and the total number of lines, so that lc a.txt b.txt displays something like:

a.txt 475
b.txt 31
total 506

Renaming files

Write a program called rename that takes three or more command-line arguments:

  1. A filename extension to match.
  2. An extension to replace it with.
  3. The names of one or more existing files.

When it runs, rename renames any files with the first extension to create files with the second extension, but will not overwrite an existing file. For example, suppose a directory contains a.txt, b.txt, and b.bck. The command:

rename .txt .bck a.txt b.txt

will rename a.txt to a.bck, but will not rename b.txt because b.bck already exists.