File Backup

Archiving files with directory structure

Terms defined: collision, comma-separated values (CSV), Coordinated Universal Time (UTC), cryptographic hash function, handler, hash code, hash function, JavaScript Object Notation (JSON), mock object, pipe, race condition, stream, Time of check/time of use (ToCToU), timestamp, version control system

Now that we can test software, the next step is to be able to save it. A version control system like Git keeps track of changes to files so that we can recover old versions if we want to. At its core is a system for archiving files in a way that:

  1. records which versions of which files existed at the same time (so that we can go back to a consistent previous state), and

  2. stores any particular version of a file only once, so that we don't waste disk space.

In this chapter we will build a simple tool for doing both tasks.

How can we uniquely identify files?

To avoid storing redundant copies of files, we need a way to tell when two files contain the same data. We could compare the files byte by byte, but a quicker way is to use a hash function that turns arbitrary data into a fixed-length string of bits ().

Hash functions
How hash functions speed up lookup.

A hash function always produces the same hash code for a given input. A cryptographic hash function has two extra properties:

  1. The hash codes it produces look like random numbers: they are evenly distributed (i.e., the odds of getting any specific output value are exactly the same).

  2. The hash code depends on the entire input: changing even a single byte results in a different hash code.

It's easy to write a bad hash function, but very hard to write one that qualifies as cryptographic. We will therefore use a library to calculate 160-bit [SHA-1][sha_1] hashes for our files. These are not random enough to keep data secret from a patient, well-funded attacker, but that's not what we're using them for: we just want hashes that are random to make collision extremely unlikely.

The Birthday Problem

The odds that two people share a birthday are 1/365 (ignoring February 29). The odds that they don't are therefore 364/365. When we add a third person, the odds that they don't share a birthday with either of the preceding two people are 363/365, so the overall odds that nobody shares a birthday are (365/365)×(364/365)×(363/365). If we keep calculating, there's a 50% chance of two people sharing a birthday in a group of just 23 people, and a 99.9% chance with 70 people.

We can use the same math to calculate how many files we need to hash before there's a 50% chance of a collision. Instead of 365, we use 2160 (the number of values that are 160 bits long), and quickly get into "if every atom in the universe was a file there still wouldn't be collisions" territory.

Node's crypto module provides the functions we need to create a SHA-1 hash. To do this, we create an object that keeps track of the current state of the hashing calculations, tell it how we want to encode (or represent) the hash value, and then feed it some bytes. When we are done, we call its .end method and then use its .read method to get the final result:

import crypto from 'crypto'

// create a SHA1 hasher
const hash = crypto.createHash('sha1')

// encode as hex (rather than binary)
hash.setEncoding('hex')

// send it some text
const text = process.argv[2]
hash.write(text)

// signal end of text
hash.end()

// display the result
const sha1sum = hash.read()
console.log(`SHA1 of "${text}" is ${sha1sum}`)
node hash-text.js something
SHA1 of "something" is 1af17e73721dbe0c40011b82ed4bb1a7dbe3ce29

Given this, hashing a file is straightforward: we just read the file and pass its contents to the hashing object:

import fs from 'fs'
import crypto from 'crypto'

const filename = process.argv[2]
const data = fs.readFileSync(filename, 'utf-8')

const hash = crypto.createHash('sha1').setEncoding('hex')
hash.write(data)
hash.end()
const sha1sum = hash.read()

console.log(`SHA1 of "${filename}" is ${sha1sum}`)
node hash-file.js hash-file.js
SHA1 of "hash-file.js" is c54c8ee3e576770d29ae2d0d73568e5a5c49eac0

However, it is more efficient to process the file as a stream:

import fs from 'fs'
import crypto from 'crypto'

const filename = process.argv[2]
const hash = crypto.createHash('sha1').setEncoding('hex')
fs.createReadStream(filename).pipe(hash)
hash.on('finish', () => {
  const final = hash.read()
  console.log('final', final)
})
console.log('program ends')
node hash-stream.js hash-stream.js
program ends
final dc9e6c231e243860dace2dbf52845b121062b60e

Many libraries rely on streams so that programs don't have to read entire (possibly large) files into memory. To start, this program asks the fs library to create a reading stream for a file and to pipe the data from that stream to the hashing object (). It then tells the hashing object what to do when there is no more data by providing a handler for the "finish" event. This is called asynchronously: as the output shows, the main program ends before the task handling the end of data is scheduled and run.

Streaming file operations
Processing files as streams of chunks.

How can we back up files?

Many files don't change after they're created, or only change very slowly. It would be wasteful for a version control system to copy them all each time the user wanted to save a snapshot of a project, so instead our tool will copy each unique file to something like abcd1234.bck, where abcd1234 is a hash of the file's contents. It will then store a data structure that records the filenames and hash keys for each snapshot. The hash keys tell it which unique files are part of the snapshot, while the filenames tell us what each file's contents were called when the snapshot was made (since files can be moved or renamed). To restore a particular snapshot, all we have to do is copy the saved .bck files back to where they were ().

Backup file storage
Organization of backup file storage.

We can build the tools we need to do this uses promises (). The main function creates a promise that uses the asynchronous version of glob to find files and then:

  1. checks that entries in the list are actually files;

  2. reads each file into memory; and

  3. calculates hashes for those files.

import fs from 'fs-extra-promise'
import glob from 'glob-promise'
import crypto from 'crypto'

const hashExisting = (rootDir) => {
  const pattern = `${rootDir}/**/*`
  return new Promise((resolve, reject) => {
    glob(pattern, {})
      .then(matches => Promise.all(
        matches.map(path => statPath(path))))
      .then(pairs => pairs.filter(
        ([path, stat]) => stat.isFile()))
      .then(pairs => Promise.all(
        pairs.map(([path, stat]) => readPath(path))))
      .then(pairs => Promise.all(
        pairs.map(([path, content]) => hashPath(path, content))))
      .then(pairs => resolve(pairs))
      .catch(err => reject(err))
  })
}
...
export default hashExisting

Notice that this function uses Promise.all to wait for the operations on all of the files in the list to complete before going on to the next step. A different design would combine stat, read, and hash into a single step so that each file would be handled independently and there would be only one Promise.all at the end to bring them all together.

The first two helper functions that hashExisting relies on wrap asynchronouss operation in promises:

// <async>
const statPath = (path) => {
  return new Promise((resolve, reject) => {
    fs.statAsync(path)
      .then(stat => resolve([path, stat]))
      .catch(err => reject(err))
  })
}

const readPath = (path) => {
  return new Promise((resolve, reject) => {
    fs.readFileAsync(path, 'utf-8')
      .then(content => resolve([path, content]))
      .catch(err => reject(err))
  })
}
// </async>

// <hashPath>
const hashPath = (path, content) => {
  const hasher = crypto.createHash('sha1').setEncoding('hex')
  hasher.write(content)
  hasher.end()
  return [path, hasher.read()]
}
// </hashPath>

The final helper function calculates the hash synchronously, but we can use Promise.all to wait on those operations finishing anyway:

import fs from 'fs-extra-promise'
import glob from 'glob-promise'
import crypto from 'crypto'

const hashExisting = (rootDir) => {
  const pattern = `${rootDir}/**/*`
  return new Promise((resolve, reject) => {
    glob(pattern, {})
      .then(matches => Promise.all(
        matches.map(path => statPath(path))))
      .then(pairs => pairs.filter(
        ([path, stat]) => stat.isFile()))
      .then(pairs => Promise.all(
        pairs.map(([path, stat]) => readPath(path))))
      .then(pairs => Promise.all(
        pairs.map(([path, content]) => hashPath(path, content))))
      .then(pairs => resolve(pairs))
      .catch(err => reject(err))
  })
}

// <helpers>
// <async>
const statPath = (path) => {
  return new Promise((resolve, reject) => {
    fs.statAsync(path)
      .then(stat => resolve([path, stat]))
      .catch(err => reject(err))
  })
}

const readPath = (path) => {
  return new Promise((resolve, reject) => {
    fs.readFileAsync(path, 'utf-8')
      .then(content => resolve([path, content]))
      .catch(err => reject(err))
  })
}
// </async>

// <hashPath>
const hashPath = (path, content) => {
  const hasher = crypto.createHash('sha1').setEncoding('hex')
  hasher.write(content)
  hasher.end()
  return [path, hasher.read()]
}
// </hashPath>
// </helpers>

export default hashExisting

Let's try running it:

import hashExisting from './hash-existing-promise.js'

const root = process.argv[2]
hashExisting(root).then(pairs => pairs.forEach(
  ([path, hash]) => console.log(path, hash)
))
node run-hash-existing-promise.js . | fgrep -v test/ | fgrep -v '~'
./backup.js 11422489e11be3d8ff76278503457665f6152ebe
./check-existing-files.js 66b933cf9e792e9a9204171d04e0f8b530ec3f4f
./hash-existing-async.js 38518660463a80367f9719a870916eeab2933a8d
./hash-existing-promise.js 2b022a50294697eef84689b63ab46f0b899d14db
./hash-file.js c54c8ee3e576770d29ae2d0d73568e5a5c49eac0
./hash-file.out 7d5014143474e733daf263a94d4d43df30ba91cf
./hash-file.sh ff2f263c71ce98338adf4a8a65c297c9361e3e16
./hash-stream.js dc9e6c231e243860dace2dbf52845b121062b60e
./hash-stream.out af7195be4585e588f301cbfbddd5a47966cb704c
./hash-stream.sh 0dd6064f3005e832fef580f3de7fa0980fd81dee
...
./x-from-to/problem.md 47bb0e1ef733bc2b946f23bb03742272b34f7bf9
./x-from-to/solution.md 412b87759362bb14d417e02efe7c3607274e10ca
./x-io-streams/problem.md 298897364632cd64c8a50c0a08aa3e235335c0d4
./x-io-streams/solution.md 412b87759362bb14d417e02efe7c3607274e10ca
./x-json-manifests/problem.md 83ad4f0d55f420dbe694b72a26a89b9b4d92bea6
./x-json-manifests/solution.md 412b87759362bb14d417e02efe7c3607274e10ca
./x-mock-hashes/problem.md c1e815e888846fddd0e3d28640c0e20a8c665b25
./x-mock-hashes/solution.md 412b87759362bb14d417e02efe7c3607274e10ca
./x-pre-commit/problem.md b7d945af4554fc0f64b708fe735417bee8b33eef
./x-pre-commit/solution.md 412b87759362bb14d417e02efe7c3607274e10ca

The code we have writen is clearer than it would be with callbacks---if you don't believe this, try rewriting it---but the layer of promises around everything still obscures its meaning. Here are the same operations written using async and await:

import fs from 'fs-extra-promise'
import glob from 'glob-promise'
import crypto from 'crypto'

const statPath = async (path) => {
  const stat = await fs.statAsync(path)
  return [path, stat]
}

const readPath = async (path) => {
  const content = await fs.readFileAsync(path, 'utf-8')
  return [path, content]
}

const hashPath = (path, content) => {
  const hasher = crypto.createHash('sha1').setEncoding('hex')
  hasher.write(content)
  hasher.end()
  return [path, hasher.read()]
}

const hashExisting = async (rootDir) => {
  const pattern = `${rootDir}/**/*`
  const options = {}
  const matches = await glob(pattern, options)
  const stats = await Promise.all(matches.map(path => statPath(path)))
  const files = stats.filter(([path, stat]) => stat.isFile())
  const contents = await Promise.all(
    files.map(([path, stat]) => readPath(path)))
  const hashes = contents.map(
    ([path, content]) => hashPath(path, content))
  return hashes
}

export default hashExisting

This version creates and resolves exactly the same promises as the previous one, but those promises are created for us automatically by Node. To check that it works, let's run it for the same input files:

import hashExisting from './hash-existing-async.js'

const root = process.argv[2]
hashExisting(root).then(
  pairs => pairs.forEach(([path, hash]) => console.log(path, hash)))
node run-hash-existing-async.js . | fgrep -v test/ | fgrep -v '~'
./backup.js 11422489e11be3d8ff76278503457665f6152ebe
./check-existing-files.js 66b933cf9e792e9a9204171d04e0f8b530ec3f4f
./hash-existing-async.js 38518660463a80367f9719a870916eeab2933a8d
./hash-existing-promise.js 2b022a50294697eef84689b63ab46f0b899d14db
./hash-file.js c54c8ee3e576770d29ae2d0d73568e5a5c49eac0
./hash-file.out 7d5014143474e733daf263a94d4d43df30ba91cf
./hash-file.sh ff2f263c71ce98338adf4a8a65c297c9361e3e16
./hash-stream.js dc9e6c231e243860dace2dbf52845b121062b60e
./hash-stream.out af7195be4585e588f301cbfbddd5a47966cb704c
./hash-stream.sh 0dd6064f3005e832fef580f3de7fa0980fd81dee
...
./x-from-to/problem.md 47bb0e1ef733bc2b946f23bb03742272b34f7bf9
./x-from-to/solution.md 412b87759362bb14d417e02efe7c3607274e10ca
./x-io-streams/problem.md 298897364632cd64c8a50c0a08aa3e235335c0d4
./x-io-streams/solution.md 412b87759362bb14d417e02efe7c3607274e10ca
./x-json-manifests/problem.md 83ad4f0d55f420dbe694b72a26a89b9b4d92bea6
./x-json-manifests/solution.md 412b87759362bb14d417e02efe7c3607274e10ca
./x-mock-hashes/problem.md c1e815e888846fddd0e3d28640c0e20a8c665b25
./x-mock-hashes/solution.md 412b87759362bb14d417e02efe7c3607274e10ca
./x-pre-commit/problem.md b7d945af4554fc0f64b708fe735417bee8b33eef
./x-pre-commit/solution.md 412b87759362bb14d417e02efe7c3607274e10ca

How can we track which files have already been backed up?

The second part of our backup tool keeps track of which files have and haven't been backed up already. It stores backups in a directory that contains backup files like abcd1234.bck and files describing the contents of particular snapshots (). The latter are named ssssssssss.csv, where ssssssssss is the UTC timestamp of the backup's creation and the .csv extension indicates that the file is formatted as comma-separated values. (We could store these files as JSON, but CSV is easier for people to read.)

File backup directory structure
Storing backups and index files in the same directory.

Time of check/time of use

Our naming convention for index files will fail if we try to create more than one backup per second. This might seem very unlikely, but many faults and security holes are the result of programmers assuming things weren't going to happen.

We could try to avoid this problem by using a two-part naming scheme ssssssss-a.csv, ssssssss-b.csv, and so on, but this leads to a race condition called time of check/time of use (). If two users run the backup tool at the same time, they will both see that there isn't a file (yet) with the current timestamp, so they will both try to create the first one.

import glob from 'glob-promise'
import path from 'path'

const findNewFiles = async (rootDir, pathHashPairs) => {
  const hashToPath = pathHashPairs.reduce((obj, [path, hash]) => {
    obj[hash] = path
    return obj
  }, {})

  const pattern = `${rootDir}/*.bck`
  const options = {}
  const existingFiles = await glob(pattern, options)

  existingFiles.forEach(filename => {
    const stripped = path.basename(filename).replace(/\.bck$/, '')
    delete hashToPath[stripped]
  })

  return hashToPath
}

export default findNewFiles
Time of check/time of use race condition
Race condition when time of check and time of use are not the same.

To test our program, let's manually create testing directories with manufactured (shortened) hashes:

tree --charset unicode test
test
|-- bck-0-csv-0
|-- bck-1-csv-1
|   |-- 0001.csv
|   `-- abcd1234.bck
|-- bck-4-csv-2
|   |-- 0001.csv
|   |-- 3028.csv
|   |-- 3456cdef.bck
|   |-- abcd1234.bck
|   `-- bcde2345.bck
|-- test-backup.js
|-- test-find-mock.js
`-- test-find.js

3 directories, 10 files

We use Mocha to manage our tests. Every test is an async function; Mocha automatically waits for them all to complete before reporting results. To run them, we add the line:

"test": "mocha */test/test-*.js"

in the scripts section of our project's package.json file so that when we run npm run test, Mocha looks for files in test sub-directories of the directories holding our lessons.

Here are our first few tests:

import assert from 'assert'

import findNewFiles from '../check-existing-files.js'

describe('pre-existing hashes and actual filesystem', () => {
  it('finds no pre-existing files when none given or exist', async () => {
    const expected = {}
    const actual = await findNewFiles('file-backup/test/bck-0-csv-0', [])
    assert.deepStrictEqual(expected, actual,
      'Expected no files')
  })

  it('finds some files when one file is given and none exist', async () => {
    const check = [['somefile.txt', '9876fedc']]
    const expected = { '9876fedc': 'somefile.txt' }
    const actual = await findNewFiles('file-backup/test/bck-0-csv-0', check)
    assert.deepStrictEqual(expected, actual,
      'Expected one file')
  })

  it('finds nothing needs backup when there is a match', async () => {
    const check = [['alpha.js', 'abcd1234']]
    const expected = {}
    const actual = await findNewFiles('file-backup/test/bck-1-csv-1', check)
    assert.deepStrictEqual(expected, actual,
      'Expected no files')
  })

  it('finds something needs backup when there is a mismatch', async () => {
    const check = [['alpha.js', 'a1b2c3d4']]
    const expected = { a1b2c3d4: 'alpha.js' }
    const actual = await findNewFiles('file-backup/test/bck-1-csv-1', check)
    assert.deepStrictEqual(expected, actual,
      'Expected one file')
  })

  it('finds mixed matches', async () => {
    const check = [
      ['matches.js', '3456cdef'],
      ['matches.txt', 'abcd1234'],
      ['mismatch.txt', '12345678']
    ]
    const expected = { 12345678: 'mismatch.txt' }
    const actual = await findNewFiles('file-backup/test/bck-4-csv-2', check)
    assert.deepStrictEqual(expected, actual,
      'Expected one file')
  })
})

and here is Mocha's report:


> stjs@1.0.0 test /u/stjs
> mocha */test/test-*.js "-g" "pre-existing hashes"



  checks for pre-existing hashes using mock filesystem
    ✓ finds no pre-existing files when none given or exist
    ✓ finds some files when one file is given and none exist
    ✓ finds nothing needs backup when there is a match
    ✓ finds something needs backup when there is a mismatch
    ✓ finds mixed matches

  pre-existing hashes and actual filesystem
    ✓ finds no pre-existing files when none given or exist
    ✓ finds some files when one file is given and none exist
    ✓ finds nothing needs backup when there is a match
    ✓ finds something needs backup when there is a mismatch
    ✓ finds mixed matches


  10 passing (28ms)

How can we test code that modifies files?

The final thing our tool needs to do is copy the files that need copying and create a new index file. The code itself will be relatively simple, but testing will be complicated by the fact that our tests will need to create directories and files before they run and then delete them afterward (so that they don't contaminate subsequent tests).

A better approach is to use a mock object instead of the real filesystem. A mock object has the same interface as the function, object, class, or library that it replaces, but is designed to be used solely for testing ().

Mock filesystem
Using a mock filesystem to simplify testing.

Node's mock-fs library provides the same functions as the fs library, but stores everything in memory. This prevents our tests from accidentally disturbing the filesystem, and also makes tests much faster (since in-memory operations are thousands of times faster than operations that touch the actual filesystem).

We can create a mock filesystem by giving the library a JSON description of the files and what they should contain:

import assert from 'assert'
import mock from 'mock-fs'

import findNewFiles from '../check-existing-files.js'

describe('checks for pre-existing hashes using mock filesystem', () => {
  beforeEach(() => {
    mock({
      'bck-0-csv-0': {},
      'bck-1-csv-1': {
        '0001.csv': 'alpha.js,abcd1234',
        'abcd1234.bck': 'alpha.js content'
      },
      'bck-4-csv-2': {
        '0001.csv': ['alpha.js,abcd1234',
          'beta.txt,bcde2345'].join('\n'),
        '3024.csv': ['alpha.js,abcd1234',
          'gamma.png,3456cdef',
          'subdir/renamed.txt,bcde2345'].join('\n'),
        '3456cdef.bck': 'gamma.png content',
        'abcd1234.bck': 'alpha content',
        'bcde2345.bck': 'beta.txt became subdir/renamed.txt'
      }
    })
  })

  afterEach(() => {
    mock.restore()
  })
...
})

Mocha automatically calls beforeEach before running each tests, and afterEach after each tests completes. All of the tests stay exactly the same, and since mock-fs replaces the functions in the standard fs library with its own, nothing in our application needs to change either.

We are finally ready to write the program that actually backs up files:

import fs from 'fs-extra-promise'

import hashExisting from './hash-existing-async.js'
import findNewFiles from './check-existing-files.js'

const backup = async (src, dst, timestamp = null) => {
  if (timestamp === null) {
    timestamp = Math.round((new Date()).getTime() / 1000)
  }
  timestamp = String(timestamp).padStart(10, '0')

  const existing = await hashExisting(src)
  const needToCopy = await findNewFiles(dst, existing)
  await copyFiles(dst, needToCopy)
  await saveManifest(dst, timestamp, existing)
}

const copyFiles = async (dst, needToCopy) => {
  const promises = Object.keys(needToCopy).map(hash => {
    const srcPath = needToCopy[hash]
    const dstPath = `${dst}/${hash}.bck`
    fs.copyFileAsync(srcPath, dstPath)
  })
  return Promise.all(promises)
}

const saveManifest = async (dst, timestamp, pathHash) => {
  pathHash = pathHash.sort()
  const content = pathHash.map(
    ([path, hash]) => `${path},${hash}`).join('\n')
  const manifest = `${dst}/${timestamp}.csv`
  fs.writeFileAsync(manifest, content, 'utf-8')
}

export default backup

The tests for this are more complicated than tests we have written previously because we want to check with actual file hashes. Let's set up some fixtures to run tests on:

import backup from '../backup.js'

const hashString = (data) => {
  const hasher = crypto.createHash('sha1').setEncoding('hex')
  hasher.write(data)
  hasher.end()
  return hasher.read()
}

const Contents = {
  aaa: 'AAA',
  bbb: 'BBB',
  ccc: 'CCC'
}

const Hashes = Object.keys(Contents).reduce((obj, key) => {
  obj[key] = hashString(Contents[key])
  return obj
}, {})

const Fixture = {
  source: {
    'alpha.txt': Contents.aaa,
    'beta.txt': Contents.bbb,
    gamma: {
      'delta.txt': Contents.ccc
    }
  },
  backup: {}
}

const InitialBackups = Object.keys(Hashes).reduce((set, filename) => {
  set.add(`backup/${Hashes[filename]}.bck`)
  return set
}, new Set())

and then run some tests:

describe('check entire backup process', () => {
  beforeEach(() => {
    mock(Fixture)
  })

  afterEach(() => {
    mock.restore()
  })

  it('creates an initial CSV manifest', async () => {
    await backup('source', 'backup', 0)

    assert.strictEqual((await glob('backup/*')).length, 4,
      'Expected 4 files')

    const actualBackups = new Set(await glob('backup/*.bck'))
    assert.deepStrictEqual(actualBackups, InitialBackups,
      'Expected 3 backup files')

    const actualManifests = await glob('backup/*.csv')
    assert.deepStrictEqual(actualManifests, ['backup/0000000000.csv'],
      'Expected one manifest')
  })

  it('does not duplicate files unnecessarily', async () => {
    await backup('source', 'backup', 0)
    assert.strictEqual((await glob('backup/*')).length, 4,
      'Expected 4 files after first backup')

    await backup('source', 'backup', 1)
    assert.strictEqual((await glob('backup/*')).length, 5,
      'Expected 5 files after second backup')
    const actualBackups = new Set(await glob('backup/*.bck'))
    assert.deepStrictEqual(actualBackups, InitialBackups,
      'Expected 3 backup files after second backup')

    const actualManifests = (await glob('backup/*.csv')).sort()
    assert.deepStrictEqual(actualManifests,
      ['backup/0000000000.csv', 'backup/0000000001.csv'],
      'Expected two manifests')
  })

  it('adds a file as needed', async () => {
    await backup('source', 'backup', 0)
    assert.strictEqual((await glob('backup/*')).length, 4,
      'Expected 4 files after first backup')

    await fs.writeFileAsync('source/newfile.txt', 'NNN')
    const hashOfNewFile = hashString('NNN')

    await backup('source', 'backup', 1)
    assert.strictEqual((await glob('backup/*')).length, 6,
      'Expected 6 files after second backup')
    const expected = new Set(InitialBackups)
      .add(`backup/${hashOfNewFile}.bck`)
    const actualBackups = new Set(await glob('backup/*.bck'))
    assert.deepStrictEqual(actualBackups, expected,
      'Expected 4 backup files after second backup')

    const actualManifests = (await glob('backup/*.csv')).sort()
    assert.deepStrictEqual(actualManifests,
      ['backup/0000000000.csv', 'backup/0000000001.csv'],
      'Expected two manifests')
  })
})

OK, what did I break?

> stjs@1.0.0 test /u/stjs
> mocha */test/test-*.js "-g" "check entire backup process"


  check entire backup process
    1) creates an initial CSV manifest
    2) does not duplicate files unnecessarily
    3) adds a file as needed


  0 passing (4s)
  3 failing

  1) check entire backup process
       creates an initial CSV manifest:
     Error: ENXIO: no such device or address, read
  

  2) check entire backup process
       does not duplicate files unnecessarily:
     Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/u/stjs/file-backup/test/test-backup.js)
      at listOnTimeout (internal/timers.js:554:17)
      at processTimers (internal/timers.js:497:7)

  3) check entire backup process
       adds a file as needed:
     Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/u/stjs/file-backup/test/test-backup.js)
      at listOnTimeout (internal/timers.js:554:17)
      at processTimers (internal/timers.js:497:7)

Exercises

Odds of collision

If hashes were only 2 bits long, then the chances of collision with each successive file assuming no previous collision are:

Number of Files Odds of Collision
1 0%
2 25%
3 50%
4 75%
5 100%

A colleague of yours says this means that if we hash four files, there's only a 75% chance of any collision occurring. What are the actual odds?

Streaming I/O

Write a small program using fs.createReadStream and fs.createWriteStream that copies a file piece by piece instead of reading it into memory and then writing it out again.

Sequencing backups

Modify the backup program so that manifests are numbered sequentially as 00000001.csv, 00000002.csv, and so on rather than being timestamped. Why doesn't this solve the time of check/time of use race condition mentioned earlier.

JSON manifests

  1. Modify backup.js so that it can save JSON manifests as well as CSV manifests based on a command-line flag.

  2. Write another program called migrate.js that converts a set of manifests from CSV to JSON. (The program's name comes from the term data migration.)

  3. Modify backup.js programs so that each manifest stores the user name of the person who created it along with file hashes, and then modify migrate.js to transform old files into the new format.

Testing line counting

Write tests for the line-counting functions of using Mocha and mock-fs. Did you find (at least) two bugs?

Mock hashes

  1. Modify the file backup program so that it uses a function called ourHash to hash files.
  2. Create a replacement that returns some predictable value, such as the first few characters of the data.
  3. Rewrite the tests to use this function.

How did you modify the main program so that the tests could control which hashing function is used?

Comparing manifests

Write a program compare-manifests.js that reads two manifest files and reports:

From one state to another

  1. Write a program called from-to.js that takes the name of a directory and the name of a manifest file as its command-line arguments, then adds, removes, and/or renames files in the directory to restore the state described in the manifest. The program should only perform file operations when it needs to, e.g., it should not delete a file and re-add it if the contents have not changed.

  2. Write some tests for from-to.js using Mocha and mock-fs.

File history

  1. Write a program called file-history.js that takes the name of a file as a command-line argument and displays the history of that file by tracing it back in time through the available manifests.

  2. Write tests for your program using Mocha and mock-fs.

Pre-commit hooks

Modify backup.js to load and run a function called preCommit from a file called pre-commit.js stored in the root directory of the files being backed up. If preCommit returns true, the backup proceeds; if it returns false or throws an exception, no backup is created.