Data Tables

Loading, saving, and manipulating tables efficiently

Terms defined: column-major storage, data frame, garbage collection, heterogeneous, homogeneous, immutable, row-major storage, SQL, tagged data, test harness

In the previous chapter we said that operations in memory are thousands of times faster than operations that touch the filesystem, but how accurate is that? More generally, given two ways to implement something, how can we tell which is the most efficient?

The simplest way is usually to conduct some experiments. To see how to do this, we will take a look several ways to implement data tables (sometimes called data frames) like those in R's tidyverse, Python's Pandas library, or the DataForge library for JavaScript. A data table has one or more named columns and zero or more rows; each row has one value for each column, and all the values in a column have the same type ().

Data table structure
The structure of a data table.

The key operations on data tables are the same as those in SQL: filter, select, summarize, and join. These can be implemented in about 1000 lines of code, but their performance depends on how the data table is stored.

How can we implement data tables?

One way to store a table is row-major order, in which the values in each row are stored together in memory. This is sometimes also called heterogeneous storage because each "unit" of storage can contain values of different types. In JavaScript, we implement this as an array of objects ().

Another option is column-major or homogeneous order, in which all the values in a column are stored together. In JavaScript, this could be implemented using an object whose members are all arrays of the same length.

Row-major vs. column-major storage order
Row-major storage vs. column-major storage for data tables.

To find out which is better we will construct one of each, try some operations, record their execution times and memory use, and then compare them. The answer will depend on both the implementations and on what mix of operations we measure: for example, if one strategy works better for filter and another for select, the ratio of filters to selects may determine which is "best".

Immutability

All of our implementations will treat each data table as immutable: once we have created it, we will not modify its contents. This makes the programming easier (and safer, since shared data structures are a rich source of bugs), and doesn't actually have much impact on performance. For example, if we filter a data table stored in column major order, we can either move elements in memory to fill the holes created by deleted rows, or copy the values we want to keep to a new block of contiguous storage.

For our first experiment, let's build a row-major table with some number of columns. To keep it simple, we will repeat the values 0, 1, and 2 to fill the table.

export const buildRows = (nRows, labels) => {
  const result = []
  for (let iR = 0; iR < nRows; iR += 1) {
    const row = {}
    labels.forEach(label => {
      row[label] = iR
    })
    result.push(row)
  }
  return result
}

Next, we write filter and select for tables laid out this way. We need to provide a callback function to filter to determine which rows to keep like the callback for Array.filter; for selecting columns, we provide a list of the keys that identify the columns we want to keep. We expect filtering to be relatively fast, since it is recycling rows, while selecting should be relatively slow, since we have to construct a new set of arrays ().

const rowFilter = (table, func) => {
  return table.filter(row => func(row))
}

const rowSelect = (table, toKeep) => {
  return table.map(row => {
    const newRow = {}
    toKeep.forEach(label => {
      newRow[label] = row[label]
    })
    return newRow
  })
}
Row-major operations
Operations on row-major data tables.

Now let's do the same for column-major storage. Building the object that holds the columns is straightforward:

export const buildCols = (nRows, labels) => {
  const result = {}
  labels.forEach(label => {
    result[label] = []
    for (let iR = 0; iR < nRows; iR += 1) {
      result[label].push(iR)
    }
  })
  return result
}

Filtering is more complex, since the values in each row are scattered across several arrays, but selecting is just a matter of recycling the arrays we want in the new table. We expect selecting to be relatively fast, since only the references to the columns need to be copied, but filtering will be relatively slow since we are constructing multiple new arrays ().

const colFilter = (table, func) => {
  const result = {}
  const labels = Object.keys(result)
  labels.forEach(label => {
    result[label] = []
  })
  for (let iR = 0; iR < table.label_1.length; iR += 1) {
    if (func(table, iR)) {
      labels.forEach(label => {
        result[label].push(table[label][iR])
      })
    }
  }
  return result
}

const colSelect = (table, toKeep) => {
  const result = {}
  toKeep.forEach(label => {
    result[label] = table[label]
  })
  return result
}
Column-major operations
Operations on column-major data tables.

Not quite polymorphic

Our tests would be simpler to write if the two versions of filter and select took exactly the same parameters, but the row-testing functions for filter are different because of the differences in the ways the tables are stored. We could force them to be the same by (for example) packing the values for each row in the column-major implementation into a temporary object and passing that to the same filtering function we used for the row-major implementation, but that extra work would bias the performance comparison in row-major's favor.

How can we test the performance of our implementations?

Now that we have our tables and operations, we can build a test harness to run those operations on data tables of varying sizes. We arbitrarily decide to keep half of the columns and one-third of the rows; these ratios will affect our decision about which is better, so if we were doing this for a real application we would test what happens as these fractions vary. And as we said earlier, the relative performance will depend on the ratio of filters to selects; our balance should be based on data from whatever application we intend to support.

Our performance measurement program looks like this:

const RANGE = 3

const main = () => {
  const nRows = parseInt(process.argv[2])
  const nCols = parseInt(process.argv[3])
  const filterPerSelect = parseFloat(process.argv[4])

  const labels = [...Array(nCols).keys()].map(i => `label_${i + 1}`)
  const someLabels = labels.slice(0, Math.floor(labels.length / 2))
  assert(someLabels.length > 0,
    'Must have some labels for select (array too short)')

  const [rowTable, rowSize, rowHeap] = memory(buildRows, nRows, labels)
  const [colTable, colSize, colHeap] = memory(buildCols, nRows, labels)

  const rowFilterTime =
    time(rowFilter, rowTable,
      row => ((row.label_1 % RANGE) === 0))
  const rowSelectTime =
    time(rowSelect, rowTable, someLabels)
  const colFilterTime =
    time(colFilter, colTable,
      (table, iR) => ((table.label_1[iR] % RANGE) === 0))
  const colSelectTime =
    time(colSelect, colTable, someLabels)

  const ratio = calculateRatio(filterPerSelect,
    rowFilterTime, rowSelectTime,
    colFilterTime, colSelectTime)

  const result = {
    nRows,
    nCols,
    filterPerSelect,
    rowSize,
    rowHeap,
    colSize,
    colHeap,
    rowFilterTime,
    rowSelectTime,
    colFilterTime,
    colSelectTime,
    ratio
  }
  console.log(yaml.safeDump(result))
}

The functions that actually do the measurements use the microtime library to get microsecond level timing because JavaScript's Date only gives us millisecond-level resolution. We use object-sizeof to estimate memory how much memory our structures require; We also call process.memoryUsage() and look at the heapUsed value, but it may be affected by garbage collection and a host of other factors.

const memory = (func, ...params) => {
  const before = process.memoryUsage()
  const result = func(...params)
  const after = process.memoryUsage()
  const heap = after.heapUsed - before.heapUsed
  const size = sizeof(result)
  return [result, size, heap]
}

const time = (func, ...params) => {
  const before = microtime.now()
  func(...params)
  const after = microtime.now()
  return after - before
}

const calculateRatio = (f2S, rFilterT, rSelectT, cFilterT, cSelectT) => {
  return ((f2S * rFilterT) + rSelectT) / ((f2S * cFilterT) + cSelectT)
}

Let's run our program for a table with 100 rows and 3 columns and a 3:1 ratio of filter to select:

node table-performance.js 100 3 3
nRows: 100
nCols: 3
filterPerSelect: 3
rowSize: 6600
rowHeap: 26512
colSize: 2442
colHeap: 8536
rowFilterTime: 62
rowSelectTime: 88
colFilterTime: 110
colSelectTime: 39
ratio: 0.7425474254742548

What if we increase the table size to 10,000 rows by 30 columns with the same 3:1 filter/select ratio?

nRows: 10000
nCols: 30
filterPerSelect: 3
rowSize: 7020000
rowHeap: 18413464
colSize: 2400462
colHeap: -3198584
rowFilterTime: 1622
rowSelectTime: 11386
colFilterTime: 4680
colSelectTime: 88
ratio: 1.1503397508493771

And if we keep the table size the same but use a 10:1 filter/select ratio?

nRows: 10000
nCols: 30
filterPerSelect: 10
rowSize: 7020000
rowHeap: 18325136
colSize: 2400462
colHeap: -3375072
rowFilterTime: 2339
rowSelectTime: 11790
colFilterTime: 3498
colSelectTime: 65
ratio: 1.0038521900413753
Relative performance of operations on row-major and column-major data tables.
nRows nCols filterPerSelect rowFilterTime rowSelectTime colFilterTime colSelectTime
100 3 3 62 88 110 39
10000 30 3 1622 11386 4680 88
10000 30 10 2339 11790 3498 65

The results in show that column-major storage is better. It uses less memory (presumably because labels aren't duplicated), and the time required to construct new objects when doing select with row-major storage outweighs cost of appending to arrays when doing filter with column-major storage. Unfortunately, the code for column-major storage is a little more complicated to write, which is a cost that doesn't show up in experiments.

What is the most efficient way to save a table?

Our data tables are going to be stored in files of some kind. If one storage scheme is much more efficient than another and we are reading or writing frequently, that could change our mind about how to implement them. Two simple text-based schemes are row-oriented and column-oriented JSON, i.e., we just print the data structures we have.

Let's run the 10,000×30 test:

nRows: 10000
nCols: 30
rowStringTime: 50909
rowStringSize: 9393402
colStringTime: 11607
colStringSize: 2934164

The time needed for the row-major version is almost ten times greater than that needed for the column-major version; again, we assume that the redundant printing of the labels is at least partly to blame.

If that diagnosis is correct, then a packed version of row-major storage ought to be faster. We save the column headers once, then copy the data values into an array of arrays and save that:

const asPackedJson = (table) => {
  const temp = {}
  temp.keys = Object.keys(table[0])
  temp.values = table.map(row => temp.keys.map(k => row[k]))
  return JSON.stringify(temp)
}
nRows: 10000
nCols: 30
packedRowStringTime: 26858
packedRowStringSize: 2974084

These results show that packing the rows takes less time than turning the data structure we have into a string. Again, we assume this is because copying data takes less time than turning labels into strings over and over, but column-major storage is still the best approach.

Does binary storage improve performance?

Let's try one more strategy for storing our tables. JavaScript stores values in tagged data structures: some bits define the value's type, and other bits store the actual data ().

JavaScript object storage
How JavaScript uses tagged data structures to store objects.

We can save space by keeping track of the types ourselves and just storing the bits that represent the values. JavaScript has an ArrayBuffer class for exactly this purpose. It stores any value we want as a set of bits; we then access those bits through a view that presents the data as a particular type, such as Boolean (one byte per value) or number (64 bits per number). As shows, we can mix different types of data in a single ArrayBuffer, but it's up to us to keep track of which bytes belong to which values.

Packing objects for storage
Storing object values as bits with lookup information.

To store a column-major table, we will fill an ArrayBuffer with:

  1. Two integers that hold the table's dimensions.

  2. A string with the labels joined by newline characters. (We use newlines as a separator because we assume column labels can't contain them.)

  3. The numbers themselves.

const asBinary = (table) => {
  const labels = Object.keys(table)

  const nCols = labels.length
  const nRows = table[labels[0]].length
  const dimensions = new Uint32Array([nCols, nRows])

  const allLabels = labels.join('\n')
  const encoder = new TextEncoder()
  const encodedLabels = encoder.encode(allLabels)

  const dataSize = sizeof(0) * nCols * nRows
  const totalSize =
    dimensions.byteLength + encodedLabels.byteLength + dataSize

  const buffer = new ArrayBuffer(totalSize)
  const result = new Uint8Array(buffer)
  result.set(dimensions, 0)
  result.set(encodedLabels, dimensions.byteLength)

  let current = dimensions.byteLength + encodedLabels.byteLength
  labels.forEach(label => {
    const temp = new Float64Array(table[label])
    result.set(temp, current)
    current += temp.byteLength
  })

  return result
}
nRows: 10000
nCols: 30
packedColBinaryTime: 4740
packedColBinarySize: 2400268

Packing the data table saves time because copying bits is faster than turning numbers into characters, but it doesn't save as much space as expected. The reason is that double-precision numbers are 8 bytes long, but because we have chosen simple integer values for our tests, they can be represented by just 5 characters (which is 10 bytes). If we had "real" numbers, the storage benefit would probably be more pronounced.

Exercises

Varying filter behavior

How does our decision about which storage format is better change if we keep 1% of rows when filtering instead of one third? What if we keep 90% of rows?

Filtering by strings

Modify the comparison of filter and select to work with tables that contain columns of strings instead of columns of numbers and see how that changes performance. For testing, creating random 4-letter strings using the characters A-Z and then filter by:

Join performance

A join combines data from two tables based on matching keys. For example, if the two tables are:

Key Left
A a1
B b1
C c1

and:

Key Right
A a2
A a3
B b2

then the join is:

Key Left Right
A a1 a2
A a1 a3
B b1 b2

Write a test to compare the performance of row-wise vs. column-wise storage when joining two tables based on matching numeric keys. Does the answer depend on the fraction of keys that match?

Join optimization

The simplest way to join two tables is to look for matching keys using a double loop. An alternative is to build an index for each table and then use it to construct matches. For example, suppose the tables are:

Key Left
A a1
B b1
C c1

and:

Key Right
A a2
A a3
B b2

The first step is to create a Map showing where each key is found in the first table:

{A: [0], B: [1], C: [2]}

The second step is to create a similar Map for the second table:

{A: [0, 1], B: [2]}

We can then loop over the keys in one of the maps, look up values in the second map, and construct all of the matches.

Write a function that joins two tables this way. Is it faster or slower than using a double loop? How does the answer depend on the number of keys and the fraction that match?

Flipping storage

Our tests showed that storing row-oriented tables as JSON is much slower than storing column-oriented tables. Write a test to determine whether converting a row-oriented table to a column-oriented table and then saving the latter is faster than saving the row-oriented table directly.

Sparse storage

A sparse matrix is one in which most of the values are zero. Instead of storing them all, a program can use a map to store non-zero values and a lookup function to return zero for anything that isn't stored explicitly:

def spareMatrixGet(matrix, row, col) => {
  return matrix.contains(row, col)
    ? matrix.get(row, col)
    : 0
}

The same technique can be used if most of the entries in a data table are missing. Write a function that creates a sparse table in which a random 5% of the values are non-zero and the other 95% are zero, then compare the memory requirements and performance of filter and select for this implementation versus those of row-wise and column-wise storage.

Loading time

Modify the programs in this section to measure the time required to convert a data table from JSON or binary form back to a data structure.

Saving fixed-width strings

To improve performance, databases often store fixed-width strings, i.e., they limit the length of the strings in a column to some fixed size and pad strings that are shorter than that.

  1. Write a function that takes an array of strings and an integer with and creates an ArrayBuffer containing the strings padded to that width. The function should throw an exception if any of the strings are longer than the specified width.

  2. Write another function that takes an ArrayBuffer as input and returns an array of strings. This function should remove the padding so that strings shorter than the fixed width are restored to their original form.

Saving variable-width strings

Fixed-width storage is inefficient for large blocks of text such as contracts, novels, and resumés, since padding every document to the length of the longest will probably waste a lot of space. An alternative way to store these in binary is to save each entry as a (length, text) pair.

Diagram of (length, text) pairs.

  1. Write a function that takes a list of strings as input and returns an ArrayBuffer containing (length, text) pairs.

  2. Write another function that takes such an ArrayBuffer and returns an array containing the original text.

  3. Write tests with Mocha to confirm that your functions work correctly.

ASCII storage

The original ASCII standard specified a 7-bit character encoding for letters commonly used in English, and many data files still only use characters whose numeric codes are in the range 0-127.

  1. Write a function that takes an array of single-letter strings and returns an ArrayBuffer that stores them using one byte per character if all of the characters will fit into 7 bits, and multiple bytes per character if any of the characters require more than 7 bits.

  2. Write another function that takes an ArrayBuffer generated by the first function and re-creates the array of characters. The function must only take the ArrayBuffer as an argument, so the first element of the ArrayBuffer should indicate how to interpret the rest of its contents.

  3. Write tests with Mocha to check that your functions work correctly.