File Backup

Archiving files with directory structure

Terms defined: Application Programming Interface, Coordinated Universal Time, JavaScript Object Notation, SHA-1 hash, Time of check/time of use, collision, comma-separated values, cryptographic hash function, handler, hash code, hash function, mock object, pipe, race condition, stream, streaming API, timestamp, version control system

Now that we can test software we have something worth saving. A version control system like Git keeps track of changes to files so that we can recover old versions if we want to. Its heart is a way to archive files that:

  1. records which versions of which files existed at the same time (so that we can go back to a consistent previous state), and

  2. stores any particular version of a file only once, so that we don't waste disk space.

In this chapter we will build a tool for doing both tasks. It won't do everything Git does: in particular, it won't let us create and merge branches. If you would like to know how that works, please see Mary Rose Cook's excellent Gitlet project.

How can we uniquely identify files?

To avoid storing redundant copies of files, we need a way to tell when two files contain the same data. We can't rely on names because files can be renamed or moved over time; we could compare the files byte by byte, but a quicker way is to use a hash function that turns arbitrary data into a fixed-length string of bits ().

Hash functions
How hash functions speed up lookup.

A hash function always produces the same hash code for a given input. A cryptographic hash function has two extra properties:

  1. The output depends on the entire input: changing even a single byte results in a different hash code.

  2. The outputs look like random numbers: they are unpredictable and evenly distributed (i.e., the odds of getting any specific hash code are the same)

It's easy to write a bad hash function, but very hard to write one that qualifies as cryptographic. We will therefore use a library to calculate 160-bit SHA-1 hashes for our files. These are not random enough to keep data secret from a patient, well-funded attacker, but that's not what we're using them for: we just want hashes that are random to make collision extremely unlikely.

The Birthday Problem

The odds that two people share a birthday are 1/365 (ignoring February 29). The odds that they don't are therefore 364/365. When we add a third person, the odds that they don't share a birthday with either of the preceding two people are 363/365, so the overall odds that nobody shares a birthday are (365/365)×(364/365)×(363/365). If we keep calculating, there's a 50% chance of two people sharing a birthday in a group of just 23 people, and a 99.9% chance with 70 people.

We can use the same math to calculate how many files we need to hash before there's a 50% chance of a collision. Instead of 365 we use \(2^{160}\) (the number of values that are 160 bits long), and after checking Wikipedia and doing a few calculations with Wolfram Alpha, we calculate that we would need to have approximately \(10^{24}\) files in order to have a 50% chance of a collision. We're willing to take that risk…

Node's crypto module provides tools to create a SHA-1 hash. To use them, we create an object that keeps track of the current state of the hashing calculations, tell it how we want to encode (or represent) the hash value, and then feed it some bytes. When we are done, we call its .end method and then use its .read method to get the final result:

import crypto from 'crypto'

// create a SHA1 hasher
const hash = crypto.createHash('sha1')

// encode as hex (rather than binary)
hash.setEncoding('hex')

// send it some text
const text = process.argv[2]
hash.write(text)

// signal end of text
hash.end()

// display the result
const sha1sum = hash.read()
console.log(`SHA1 of "${text}" is ${sha1sum}`)

node hash-text.js something

SHA1 of "something" is 1af17e73721dbe0c40011b82ed4bb1a7dbe3ce29

Hashing a file instead of a fixed string is straightforward: we just read the file's contents and pass those characters to the hashing object:

import fs from 'fs'
import crypto from 'crypto'

const filename = process.argv[2]
const data = fs.readFileSync(filename, 'utf-8')

const hash = crypto.createHash('sha1').setEncoding('hex')
hash.write(data)
hash.end()
const sha1sum = hash.read()

console.log(`SHA1 of "${filename}" is ${sha1sum}`)

node hash-file.js hash-file.js

SHA1 of "hash-file.js" is c54c8ee3e576770d29ae2d0d73568e5a5c49eac0

However, it is more efficient to process the file as a stream:

import fs from 'fs'
import crypto from 'crypto'

const filename = process.argv[2]
const hash = crypto.createHash('sha1').setEncoding('hex')
fs.createReadStream(filename).pipe(hash)
hash.on('finish', () => {
  const final = hash.read()
  console.log('final', final)
})
console.log('program ends')

node hash-stream.js hash-stream.js

program ends
final dc9e6c231e243860dace2dbf52845b121062b60e

This kind of interface is called a streaming API because it is designed to process a stream of data one chunk at a time rather than requiring all of the data to be in memory at once. Many applications use streams so that programs don't have to read entire (possibly large) files into memory.

To start, this program asks the fs library to create a reading stream for a file and to pipe the data from that stream to the hashing object (). It then tells the hashing object what to do when there is no more data by providing a handler for the "finish" event. This is called asynchronously: as the output shows, the main program ends before the task handling the end of data is scheduled and run. Most programs also provide a handler for "data" events to do something with each block of data as it comes in; the hash object in our program does that for us.

Streaming file operations
Processing files as streams of chunks.

How can we back up files?

Many files only change occasionally after they're created, or not at all. It would be wasteful for a version control system to make copies each time the user wanted to save a snapshot of a project, so instead our tool will copy each unique file to something like abcd1234.bck, where abcd1234 is a hash of the file's contents. It will then store a data structure that records the filenames and hash keys for each snapshot. The hash keys tell it which unique files are part of the snapshot, while the filenames tell us what each file's contents were called when the snapshot was made (since files can be moved or renamed). To restore a particular snapshot, all we have to do is copy the saved .bck files back to where they were ().

Backup file storage
Organization of backup file storage.

We can build the tools we need to do this uses promises (). The main function creates a promise that uses the asynchronous version of glob to find files and then:

  1. checks that entries in the list are actually files;

  2. reads each file into memory; and

  3. calculates hashes for those files.


import fs from 'fs-extra-promise'
import glob from 'glob-promise'
import crypto from 'crypto'

const hashExisting = (rootDir) => {
  const pattern = `${rootDir}/**/*`
  return new Promise((resolve, reject) => {
    glob(pattern, {})
      .then(matches => Promise.all(
        matches.map(path => statPath(path))))
      .then(pairs => pairs.filter(
        ([path, stat]) => stat.isFile()))
      .then(pairs => Promise.all(
        pairs.map(([path, stat]) => readPath(path))))
      .then(pairs => Promise.all(
        pairs.map(([path, content]) => hashPath(path, content))))
      .then(pairs => resolve(pairs))
      .catch(err => reject(err))
  })
}

This function uses Promise.all to wait for the operations on all of the files in the list to complete before going on to the next step. A different design would combine stat, read, and hash into a single step so that each file would be handled independently and use one Promise.all at the end to bring them all together.

The first two helper functions that hashExisting relies on wrap asynchronous operation in promises:


const statPath = (path) => {
  return new Promise((resolve, reject) => {
    fs.statAsync(path)
      .then(stat => resolve([path, stat]))
      .catch(err => reject(err))
  })
}

const readPath = (path) => {
  return new Promise((resolve, reject) => {
    fs.readFileAsync(path, 'utf-8')
      .then(content => resolve([path, content]))
      .catch(err => reject(err))
  })
}

The final helper function calculates the hash synchronously, but we can use Promise.all to wait on those operations finishing anyway:


const hashPath = (path, content) => {
  const hasher = crypto.createHash('sha1').setEncoding('hex')
  hasher.write(content)
  hasher.end()
  return [path, hasher.read()]
}

Let's try running it:

import hashExisting from './hash-existing-promise.js'

const root = process.argv[2]
hashExisting(root).then(pairs => pairs.forEach(
  ([path, hash]) => console.log(path, hash)
))

node run-hash-existing-promise.js . | fgrep -v test/ | fgrep -v '~'

./backup.js 11422489e11be3d8ff76278503457665f6152ebe
./check-existing-files.js 66b933cf9e792e9a9204171d04e0f8b530ec3f4f
./figures/hash-function.pdf 0eb82de379a95ee2be3f00b38c0102e2f2f8170e
./figures/hash-function.svg 563996575d581f2a08e3e954d7faba4d189d0773
./figures/mock-fs.pdf 0b3bba44e69122ee53bcc9d777c186c84b7c2ff2
...
./x-from-to.md f0f63b3576042dfc0050029ddfcccc3c42fe275d
./x-io-streams.md 1fb4d8b7785c5e7b2f1e29588e2ba28d101ced1a
./x-json-manifests.md 223e0e4167acc6d4d81b76ba1287b90234c95e22
./x-mock-hashes.md 580edfc0cb8eaca4f3700307002ae10ee97af8d2
./x-pre-commit.md b7d945af4554fc0f64b708fe735417bee8b33eef

The code we have written is clearer than it would be with callbacks (try rewriting it if you don't believe this) but the layer of promises around everything still obscures its meaning. The same operations are easier to read when written using async and await:


const statPath = async (path) => {
  const stat = await fs.statAsync(path)
  return [path, stat]
}

const readPath = async (path) => {
  const content = await fs.readFileAsync(path, 'utf-8')
  return [path, content]
}

const hashPath = (path, content) => {
  const hasher = crypto.createHash('sha1').setEncoding('hex')
  hasher.write(content)
  hasher.end()
  return [path, hasher.read()]
}

const hashExisting = async (rootDir) => {
  const pattern = `${rootDir}/**/*`
  const options = {}
  const matches = await glob(pattern, options)
  const stats = await Promise.all(matches.map(path => statPath(path)))
  const files = stats.filter(([path, stat]) => stat.isFile())
  const contents = await Promise.all(
    files.map(([path, stat]) => readPath(path)))
  const hashes = contents.map(
    ([path, content]) => hashPath(path, content))
  return hashes
}

This version creates and resolves exactly the same promises as the previous one, but those promises are created for us automatically by Node. To check that it works, let's run it for the same input files:

import hashExisting from './hash-existing-async.js'

const root = process.argv[2]
hashExisting(root).then(
  pairs => pairs.forEach(([path, hash]) => console.log(path, hash)))

node run-hash-existing-async.js . | fgrep -v test/ | fgrep -v '~'

./backup.js 11422489e11be3d8ff76278503457665f6152ebe
./check-existing-files.js 66b933cf9e792e9a9204171d04e0f8b530ec3f4f
./figures/hash-function.pdf 0eb82de379a95ee2be3f00b38c0102e2f2f8170e
./figures/hash-function.svg 563996575d581f2a08e3e954d7faba4d189d0773
./figures/mock-fs.pdf 0b3bba44e69122ee53bcc9d777c186c84b7c2ff2
...
./x-from-to.md f0f63b3576042dfc0050029ddfcccc3c42fe275d
./x-io-streams.md 1fb4d8b7785c5e7b2f1e29588e2ba28d101ced1a
./x-json-manifests.md 223e0e4167acc6d4d81b76ba1287b90234c95e22
./x-mock-hashes.md 580edfc0cb8eaca4f3700307002ae10ee97af8d2
./x-pre-commit.md b7d945af4554fc0f64b708fe735417bee8b33eef

How can we track which files have already been backed up?

The second part of our backup tool keeps track of which files have and haven't been backed up already. It stores backups in a directory that contains backup files like abcd1234.bck and files describing the contents of particular snapshots. The latter are named ssssssssss.csv, where ssssssssss is the UTC timestamp of the backup's creation and the .csv extension indicates that the file is formatted as comma-separated values. (We could store these files as JSON, but CSV is easier for people to read.)

Time of check/time of use

Our naming convention for index files will fail if we try to create more than one backup per second. This might seem very unlikely, but many faults and security holes are the result of programmers assuming things weren't going to happen.

We could try to avoid this problem by using a two-part naming scheme ssssssss-a.csv, ssssssss-b.csv, and so on, but this leads to a race condition called time of check/time of use. If two users run the backup tool at the same time, they will both see that there isn't a file (yet) with the current timestamp, so they will both try to create the first one.

import glob from 'glob-promise'
import path from 'path'

const findNew = async (rootDir, pathHashPairs) => {
  const hashToPath = pathHashPairs.reduce((obj, [path, hash]) => {
    obj[hash] = path
    return obj
  }, {})

  const pattern = `${rootDir}/*.bck`
  const options = {}
  const existingFiles = await glob(pattern, options)

  existingFiles.forEach(filename => {
    const stripped = path.basename(filename).replace(/\.bck$/, '')
    delete hashToPath[stripped]
  })

  return hashToPath
}

export default findNew

To test our program, let's manually create testing directories with manufactured (shortened) hashes:

tree --charset unicode test

test
|-- bck-0-csv-0
|-- bck-1-csv-1
|   |-- 0001.csv
|   `-- abcd1234.bck
|-- bck-4-csv-2
|   |-- 0001.csv
|   |-- 3028.csv
|   |-- 3456cdef.bck
|   |-- abcd1234.bck
|   `-- bcde2345.bck
|-- test-backup.js
|-- test-find-mock.js
`-- test-find.js

3 directories, 10 files

We use Mocha to manage our tests. Every test is an async function; Mocha automatically waits for them all to complete before reporting results. To run them, we add the line:

"test": "mocha */test/test-*.js"

in the scripts section of our project's package.json file so that when we run npm run test, Mocha looks for files in test sub-directories of the directories holding our lessons.

Here are our first few tests:

import assert from 'assert'

import findNew from '../check-existing-files.js'

describe('pre-existing hashes and actual filesystem', () => {
  it('finds no pre-existing files when none given or exist', async () => {
    const expected = {}
    const actual = await findNew('file-backup/test/bck-0-csv-0', [])
    assert.deepStrictEqual(expected, actual,
      'Expected no files')
  })

  it('finds some files when one is given and none exist', async () => {
    const check = [['somefile.txt', '9876fedc']]
    const expected = { '9876fedc': 'somefile.txt' }
    const actual = await findNew('file-backup/test/bck-0-csv-0', check)
    assert.deepStrictEqual(expected, actual,
      'Expected one file')
  })

  it('finds nothing needs backup when there is a match', async () => {
    const check = [['alpha.js', 'abcd1234']]
    const expected = {}
    const actual = await findNew('file-backup/test/bck-1-csv-1', check)
    assert.deepStrictEqual(expected, actual,
      'Expected no files')
  })

  it('finds something needs backup when there is a mismatch', async () => {
    const check = [['alpha.js', 'a1b2c3d4']]
    const expected = { a1b2c3d4: 'alpha.js' }
    const actual = await findNew('file-backup/test/bck-1-csv-1', check)
    assert.deepStrictEqual(expected, actual,
      'Expected one file')
  })

  it('finds mixed matches', async () => {
    const check = [
      ['matches.js', '3456cdef'],
      ['matches.txt', 'abcd1234'],
      ['mismatch.txt', '12345678']
    ]
    const expected = { 12345678: 'mismatch.txt' }
    const actual = await findNew('file-backup/test/bck-4-csv-2', check)
    assert.deepStrictEqual(expected, actual,
      'Expected one file')
  })
})

and here is Mocha's report:


> stjs@1.0.0 test
> mocha */test/test-*.js "-g" "pre-existing hashes"

sh: mocha: command not found

How can we test code that modifies files?

The final thing our tool needs to do is copy the files that need copying and create a new index file. The code itself will be relatively simple, but testing will be complicated by the fact that our tests will need to create directories and files before they run and then delete them afterward (so that they don't contaminate subsequent tests).

A better approach is to use a mock object instead of the real filesystem. A mock object has the same interface as the function, object, class, or library that it replaces, but is designed to be used solely for testing. Node's mock-fs library provides the same functions as the fs library, but stores everything in memory (). This prevents our tests from accidentally disturbing the filesystem, and also makes tests much faster (since in-memory operations are thousands of times faster than operations that touch the disk).

Mock filesystem
Using a mock filesystem to simplify testing.

We can create a mock filesystem by giving the library a JSON description of the files and what they should contain:

import assert from 'assert'
import mock from 'mock-fs'

import findNew from '../check-existing-files.js'

describe('checks for pre-existing hashes using mock filesystem', () => {
  beforeEach(() => {
    mock({
      'bck-0-csv-0': {},
      'bck-1-csv-1': {
        '0001.csv': 'alpha.js,abcd1234',
        'abcd1234.bck': 'alpha.js content'
      },
      'bck-4-csv-2': {
        '0001.csv': ['alpha.js,abcd1234',
          'beta.txt,bcde2345'].join('\n'),
        '3024.csv': ['alpha.js,abcd1234',
          'gamma.png,3456cdef',
          'subdir/renamed.txt,bcde2345'].join('\n'),
        '3456cdef.bck': 'gamma.png content',
        'abcd1234.bck': 'alpha content',
        'bcde2345.bck': 'beta.txt became subdir/renamed.txt'
      }
    })
  })

  afterEach(() => {
    mock.restore()
  })

})

Mocha automatically calls beforeEach before running each tests, and afterEach after each tests completes (which is yet another protocol). All of the tests stay exactly the same, and since mock-fs replaces the functions in the standard fs library with its own, nothing in our application needs to change either.

We are finally ready to write the program that actually backs up files:

import fs from 'fs-extra-promise'

import hashExisting from './hash-existing-async.js'
import findNew from './check-existing-files.js'

const backup = async (src, dst, timestamp = null) => {
  if (timestamp === null) {
    timestamp = Math.round((new Date()).getTime() / 1000)
  }
  timestamp = String(timestamp).padStart(10, '0')

  const existing = await hashExisting(src)
  const needToCopy = await findNew(dst, existing)
  await copyFiles(dst, needToCopy)
  await saveManifest(dst, timestamp, existing)
}

const copyFiles = async (dst, needToCopy) => {
  const promises = Object.keys(needToCopy).map(hash => {
    const srcPath = needToCopy[hash]
    const dstPath = `${dst}/${hash}.bck`
    fs.copyFileAsync(srcPath, dstPath)
  })
  return Promise.all(promises)
}

const saveManifest = async (dst, timestamp, pathHash) => {
  pathHash = pathHash.sort()
  const content = pathHash.map(
    ([path, hash]) => `${path},${hash}`).join('\n')
  const manifest = `${dst}/${timestamp}.csv`
  fs.writeFileAsync(manifest, content, 'utf-8')
}

export default backup

The tests for this are more complicated than tests we have written previously because we want to check with actual file hashes. Let's set up some fixtures to run tests on:


import backup from '../backup.js'

const hashString = (data) => {
  const hasher = crypto.createHash('sha1').setEncoding('hex')
  hasher.write(data)
  hasher.end()
  return hasher.read()
}

const Contents = {
  aaa: 'AAA',
  bbb: 'BBB',
  ccc: 'CCC'
}

const Hashes = Object.keys(Contents).reduce((obj, key) => {
  obj[key] = hashString(Contents[key])
  return obj
}, {})

const Fixture = {
  source: {
    'alpha.txt': Contents.aaa,
    'beta.txt': Contents.bbb,
    gamma: {
      'delta.txt': Contents.ccc
    }
  },
  backup: {}
}

const InitialBackups = Object.keys(Hashes).reduce((set, filename) => {
  set.add(`backup/${Hashes[filename]}.bck`)
  return set
}, new Set())

and then run some tests:


describe('check entire backup process', () => {
  beforeEach(() => {
    mock(Fixture)
  })

  afterEach(() => {
    mock.restore()
  })

  it('creates an initial CSV manifest', async () => {
    await backup('source', 'backup', 0)

    assert.strictEqual((await glob('backup/*')).length, 4,
      'Expected 4 files')

    const actualBackups = new Set(await glob('backup/*.bck'))
    assert.deepStrictEqual(actualBackups, InitialBackups,
      'Expected 3 backup files')

    const actualManifests = await glob('backup/*.csv')
    assert.deepStrictEqual(actualManifests, ['backup/0000000000.csv'],
      'Expected one manifest')
  })

  it('does not duplicate files unnecessarily', async () => {
    await backup('source', 'backup', 0)
    assert.strictEqual((await glob('backup/*')).length, 4,
      'Expected 4 files after first backup')

    await backup('source', 'backup', 1)
    assert.strictEqual((await glob('backup/*')).length, 5,
      'Expected 5 files after second backup')
    const actualBackups = new Set(await glob('backup/*.bck'))
    assert.deepStrictEqual(actualBackups, InitialBackups,
      'Expected 3 backup files after second backup')

    const actualManifests = (await glob('backup/*.csv')).sort()
    assert.deepStrictEqual(actualManifests,
      ['backup/0000000000.csv', 'backup/0000000001.csv'],
      'Expected two manifests')
  })

  it('adds a file as needed', async () => {
    await backup('source', 'backup', 0)
    assert.strictEqual((await glob('backup/*')).length, 4,
      'Expected 4 files after first backup')

    await fs.writeFileAsync('source/newfile.txt', 'NNN')
    const hashOfNewFile = hashString('NNN')

    await backup('source', 'backup', 1)
    assert.strictEqual((await glob('backup/*')).length, 6,
      'Expected 6 files after second backup')
    const expected = new Set(InitialBackups)
      .add(`backup/${hashOfNewFile}.bck`)
    const actualBackups = new Set(await glob('backup/*.bck'))
    assert.deepStrictEqual(actualBackups, expected,
      'Expected 4 backup files after second backup')

    const actualManifests = (await glob('backup/*.csv')).sort()
    assert.deepStrictEqual(actualManifests,
      ['backup/0000000000.csv', 'backup/0000000001.csv'],
      'Expected two manifests')
  })
})

> stjs@1.0.0 test
> mocha */test/test-*.js "-g" "check entire backup process"



  check entire backup process
    ✓ creates an initial CSV manifest
    ✓ does not duplicate files unnecessarily
    ✓ adds a file as needed


  3 passing (18ms)

Design for test

One of the best ways—maybe the best way—to evaluate software design is by thinking about testability Feathers2004. We were able to use a mock filesystem instead of a real one because the filesystem has a well-defined API that is provided to us in a single library, so replacing it is a matter of changing one thing in one place. If you have to change several parts of your code in order to test it, the code is telling you to consolidate those parts into one component.

Exercises

Odds of collision

If hashes were only 2 bits long, then the chances of collision with each successive file assuming no previous collision are:

Number of Files Odds of Collision
1 0%
2 25%
3 50%
4 75%
5 100%

A colleague of yours says this means that if we hash four files, there's only a 75% chance of any collision occurring. What are the actual odds?

Streaming I/O

Write a small program using fs.createReadStream and fs.createWriteStream that copies a file piece by piece instead of reading it into memory and then writing it out again.

Sequencing backups

Modify the backup program so that manifests are numbered sequentially as 00000001.csv, 00000002.csv, and so on rather than being timestamped. Why doesn't this solve the time of check/time of use race condition mentioned earlier.

JSON manifests

  1. Modify backup.js so that it can save JSON manifests as well as CSV manifests based on a command-line flag.

  2. Write another program called migrate.js that converts a set of manifests from CSV to JSON. (The program's name comes from the term data migration.)

  3. Modify backup.js programs so that each manifest stores the user name of the person who created it along with file hashes, and then modify migrate.js to transform old files into the new format.

Testing line counting

Write tests for the line-counting functions of using Mocha and mock-fs. Did you find (at least) two bugs?

Mock hashes

  1. Modify the file backup program so that it uses a function called ourHash to hash files.
  2. Create a replacement that returns some predictable value, such as the first few characters of the data.
  3. Rewrite the tests to use this function.

How did you modify the main program so that the tests could control which hashing function is used?

Comparing manifests

Write a program compare-manifests.js that reads two manifest files and reports:

From one state to another

  1. Write a program called from-to.js that takes the name of a directory and the name of a manifest file as its command-line arguments, then adds, removes, and/or renames files in the directory to restore the state described in the manifest. The program should only perform file operations when it needs to, e.g., it should not delete a file and re-add it if the contents have not changed.

  2. Write some tests for from-to.js using Mocha and mock-fs.

File history

  1. Write a program called file-history.js that takes the name of a file as a command-line argument and displays the history of that file by tracing it back in time through the available manifests.

  2. Write tests for your program using Mocha and mock-fs.

Pre-commit hooks

Modify backup.js to load and run a function called preCommit from a file called pre-commit.js stored in the root directory of the files being backed up. If preCommit returns true, the backup proceeds; if it returns false or throws an exception, no backup is created.