Data Tables

Storing and manipulating tables efficiently

Terms defined: JavaScript Object Notation, SQL, column-major storage, data frame, garbage collection, heterogeneous, homogeneous, immutable, row-major storage, tagged data, test harness

said that operations in memory are thousands of times faster than operations that touch the filesystem, but what about different in-memory operations—how do they compare with each other? Putting it another way, how can we tell which of several designs is going to be the most efficient?

The best answer is to conduct some experiments. To see how to do this, we will take a look several ways to implement data tables with one or more named columns and zero or more rows. Each row has one value for each column, and all the values in a column have the same type (). Data tables appear over and over again in programming, from spreadsheets and databases to the data frames in R's tidyverse packages, Python's Pandas library, or the DataForge library for JavaScript Davis2018.

Data table structure
The structure of a data table.

The key operations on data tables are those provided by SQL: filter, select, summarize, and join. These can be implemented in about five hundred lines of code, but their performance depends on how the data table is stored.

How can we implement data tables?

One way to store a table is row-major order, in which the values in each row are stored together in memory. This is sometimes also called heterogeneous storage because each "unit" of storage can contain values of different types. We can implement this design in JavaScript using an array of objects, each of which has the same keys ().

Another option is column-major or homogeneous order, in which all the values in a column are stored together. In JavaScript, this could be implemented using an object whose members are all arrays of the same length.

Row-major vs. column-major storage order
Row-major storage vs. column-major storage for data tables.

To find out which is better we will construct one of each, try some operations, record their execution times and memory use, and then compare them. Crucially, the answer will depend on both the implementations themselves and on what mix of operations we measure. For example, if one strategy works better for filter and another for select, the ratio of filters to selects may determine which is "best".

Immutability

All of our implementations will treat each data table as immutable: once we have created it, we will not modify its contents. This doesn't actually have much impact on performance an makes the programming easier and safer, since shared data structures are a rich source of bugs.

For our first experiment, let's build a row-major table with some number of columns. To keep it simple, we will repeat the values 0, 1, and 2 to fill the table.


export const buildRows = (nRows, labels) => {
  const result = []
  for (let iR = 0; iR < nRows; iR += 1) {
    const row = {}
    labels.forEach(label => {
      row[label] = iR
    })
    result.push(row)
  }
  return result
}

Next, we write filter and select for tables laid out this way. We need to provide a callback function to filter to determine which rows to keep like the callback for Array.filter; for selecting columns, we provide a list of the keys that identify the columns we want to keep. We expect filtering to be relatively fast, since it is recycling rows, while selecting should be relatively slow because we have to construct a new set of arrays ().


const rowFilter = (table, func) => {
  return table.filter(row => func(row))
}

const rowSelect = (table, toKeep) => {
  return table.map(row => {
    const newRow = {}
    toKeep.forEach(label => {
      newRow[label] = row[label]
    })
    return newRow
  })
}
Row-major operations
Operations on row-major data tables.

Now let's do the same for column-major storage. Building the object that holds the columns is straightforward:


export const buildCols = (nRows, labels) => {
  const result = {}
  labels.forEach(label => {
    result[label] = []
    for (let iR = 0; iR < nRows; iR += 1) {
      result[label].push(iR)
    }
  })
  return result
}

Filtering is more complex because the values in each row are scattered across several arrays, but selecting is just a matter of recycling the arrays we want in the new table. We expect selecting to be relatively fast, since only the references to the columns need to be copied, but filtering will be relatively slow since we are constructing multiple new arrays ().


const colFilter = (table, func) => {
  const result = {}
  const labels = Object.keys(result)
  labels.forEach(label => {
    result[label] = []
  })
  for (let iR = 0; iR < table.label_1.length; iR += 1) {
    if (func(table, iR)) {
      labels.forEach(label => {
        result[label].push(table[label][iR])
      })
    }
  }
  return result
}

const colSelect = (table, toKeep) => {
  const result = {}
  toKeep.forEach(label => {
    result[label] = table[label]
  })
  return result
}
Column-major operations
Operations on column-major data tables.

Not quite polymorphic

Our tests would be simpler to write if the two versions of filter and select took exactly the same parameters, but the row-testing functions for filter are different because of the differences in the ways the tables are stored. We could force them to be the same by (for example) packing the values for each row in the column-major implementation into a temporary object and passing that to the same filtering function we used for the row-major implementation, but that extra work would bias the performance comparison in row-major's favor.

How can we test the performance of our implementations?

Now that we have our tables and operations, we can build a test harness to run those operations on data tables of varying sizes. We arbitrarily decide to keep half of the columns and one-third of the rows; these ratios will affect our decision about which is better, so if we were doing this for a real application we would test what happens as these fractions vary. And as we said earlier, the relative performance will also depend on the how many filters we do for each select; our balance should be based on data from whatever application we intend to support.

Our performance measurement program looks like this:


const RANGE = 3

const main = () => {
  const nRows = parseInt(process.argv[2])
  const nCols = parseInt(process.argv[3])
  const filterPerSelect = parseFloat(process.argv[4])

  const labels = [...Array(nCols).keys()].map(i => `label_${i + 1}`)
  const someLabels = labels.slice(0, Math.floor(labels.length / 2))
  assert(someLabels.length > 0,
    'Must have some labels for select (array too short)')

  const [rowTable, rowSize, rowHeap] = memory(buildRows, nRows, labels)
  const [colTable, colSize, colHeap] = memory(buildCols, nRows, labels)

  const rowFilterTime =
    time(rowFilter, rowTable,
      row => ((row.label_1 % RANGE) === 0))
  const rowSelectTime =
    time(rowSelect, rowTable, someLabels)
  const colFilterTime =
    time(colFilter, colTable,
      (table, iR) => ((table.label_1[iR] % RANGE) === 0))
  const colSelectTime =
    time(colSelect, colTable, someLabels)

  const ratio = calculateRatio(filterPerSelect,
    rowFilterTime, rowSelectTime,
    colFilterTime, colSelectTime)

  const result = {
    nRows,
    nCols,
    filterPerSelect,
    rowSize,
    rowHeap,
    colSize,
    colHeap,
    rowFilterTime,
    rowSelectTime,
    colFilterTime,
    colSelectTime,
    ratio
  }
  console.log(yaml.safeDump(result))
}

The functions that actually do the measurements use the microtime library to get microsecond level timing because JavaScript's Date only gives us millisecond-level resolution. We use object-sizeof to estimate memory how much memory our structures require; we also call process.memoryUsage() and look at the heapUsed value to see how much memory Node is using while the program runs, but that may be affected by garbage collection and a host of other factors outside our control.


const memory = (func, ...params) => {
  const before = process.memoryUsage()
  const result = func(...params)
  const after = process.memoryUsage()
  const heap = after.heapUsed - before.heapUsed
  const size = sizeof(result)
  return [result, size, heap]
}

const time = (func, ...params) => {
  const before = microtime.now()
  func(...params)
  const after = microtime.now()
  return after - before
}

const calculateRatio = (f2S, rFilterT, rSelectT, cFilterT, cSelectT) => {
  return ((f2S * rFilterT) + rSelectT) / ((f2S * cFilterT) + cSelectT)
}

Let's run our program for a table with 100 rows and 3 columns and a 3:1 ratio of filter to select:

node table-performance.js 100 3 3
nRows: 100
nCols: 3
filterPerSelect: 3
rowSize: 6600
rowHeap: 26512
colSize: 2442
colHeap: 8536
rowFilterTime: 75
rowSelectTime: 111
colFilterTime: 137
colSelectTime: 48
ratio: 0.7320261437908496

What if we increase the table size to 10,000 rows by 30 columns with the same 3:1 filter/select ratio?

nRows: 10000
nCols: 30
filterPerSelect: 3
rowSize: 7020000
rowHeap: 18392064
colSize: 2400462
colHeap: -3473800
rowFilterTime: 2929
rowSelectTime: 15863
colFilterTime: 4529
colSelectTime: 104
ratio: 1.8004528522386969

And if we keep the table size the same but use a 10:1 filter/select ratio?

nRows: 10000
nCols: 30
filterPerSelect: 10
rowSize: 7020000
rowHeap: 18287160
colSize: 2400462
colHeap: -3645056
rowFilterTime: 2376
rowSelectTime: 15566
colFilterTime: 4380
colSelectTime: 90
ratio: 0.8960127591706539
Relative performance of operations on row-major and column-major data tables.
nRows nCols filterPerSelect rowFilterTime rowSelectTime colFilterTime colSelectTime
100 3 3 75 111 137 48
10000 30 3 2929 15863 4529 104
10000 30 10 2376 15566 4380 90

The results in show that column-major storage is better. It uses less memory (presumably because column labels aren't duplicated once per row) and the time required to construct new objects when doing select with row-major storage outweighs cost of appending to arrays when doing filter with column-major storage. Unfortunately, the code for column-major storage is a little more complicated to write, which is a cost that doesn't show up in experiments.

What is the most efficient way to save a table?

Data is valuable, so we are going to store data tables in files of some kind. If one storage scheme is much more efficient than another and we are reading or writing frequently, that could change our mind about which implementation to pick.

Two simple text-based schemes are row-oriented and column-oriented JSON—basically, just printing the data structures we have. Let's run the 10,000×30 test:

nRows: 10000
nCols: 30
rowStringTime: 57342
rowStringSize: 9393402
colStringTime: 13267
colStringSize: 2934164

The time needed for the row-major version is almost ten times greater than that needed for the column-major version; we assume that the redundant printing of the labels is mostly to blame, just as redundant storage of the labels was to blame for row-major's greater memory requirements.

If that diagnosis is correct, then a packed version of row-major storage ought to be faster. We save the column headers once, then copy the data values into an array of arrays and save that:


const asPackedJson = (table) => {
  const temp = {}
  temp.keys = Object.keys(table[0])
  temp.values = table.map(row => temp.keys.map(k => row[k]))
  return JSON.stringify(temp)
}
nRows: 10000
nCols: 30
packedRowStringTime: 29659
packedRowStringSize: 2974084

These results show that changing layout for storage is faster than turning the data structure we have into a string. Again, we assume this is because copying data takes less time than turning labels into strings over and over, but column-major storage is still the best approach.

Does binary storage improve performance?

Let's try one more strategy for storing our tables. JavaScript stores values in tagged data structures: some bits define the value's type while other bits store the value itself in a type-dependent way ().

JavaScript object storage
How JavaScript uses tagged data structures to store objects.

We can save space by keeping track of the types ourselves and just storing the bits that represent the values. JavaScript has an ArrayBuffer class for exactly this purpose. It stores any value we want as a set of bits; we then access those bits through a view that presents the data as a particular type, such as Boolean (one byte per value) or number (64 bits per number). As shows, we can mix different types of data in a single ArrayBuffer, but it's up to us to keep track of which bytes belong to which values.

Packing objects for storage
Storing object values as bits with lookup information.

To store a column-major table we will fill an ArrayBuffer with:

  1. Two integers that hold the table's size (number of rows and number of columns).

  2. A string with the column labels joined by newline characters. (We use newlines as a separator because we assume column labels can't contain them.)

  3. The numbers themselves.


const asBinary = (table) => {
  const labels = Object.keys(table)

  const nCols = labels.length
  const nRows = table[labels[0]].length
  const dimensions = new Uint32Array([nCols, nRows])

  const allLabels = labels.join('\n')
  const encoder = new TextEncoder()
  const encodedLabels = encoder.encode(allLabels)

  const dataSize = sizeof(0) * nCols * nRows
  const totalSize =
    dimensions.byteLength + encodedLabels.byteLength + dataSize

  const buffer = new ArrayBuffer(totalSize)
  const result = new Uint8Array(buffer)
  result.set(dimensions, 0)
  result.set(encodedLabels, dimensions.byteLength)

  let current = dimensions.byteLength + encodedLabels.byteLength
  labels.forEach(label => {
    const temp = new Float64Array(table[label])
    result.set(temp, current)
    current += temp.byteLength
  })

  return result
}
nRows: 10000
nCols: 30
packedColBinaryTime: 6074
packedColBinarySize: 2400268

Packing the data table saves time because copying bits is faster than turning numbers into characters, but it doesn't save as much space as expected. The reason is that double-precision numbers are 8 bytes long, but because we have chosen simple integer values for our tests, they can be represented by just 5 characters (which is 10 bytes). If we had "real" numbers the storage benefit would probably be more pronounced; once again, the result of our experiment depends on the test cases we choose.

Engineering

If science is the use of the experimental method to investigate the world, engineering is the use of the experimental method to investigate and improve the things that people build. Good software designers collect and analyze data all the time to find out whether one website design works better than another Kohavi2020 or to improve the performance of CPUs Patterson2017; a few simple experiments like these can sometimes save weeks or months of effort.

Exercises

Varying filter behavior

How does our decision about which storage format is better change if we keep 1% of rows when filtering instead of one third? What if we keep 90% of rows?

Filtering by strings

Modify the comparison of filter and select to work with tables that contain columns of strings instead of columns of numbers and see how that changes performance. For testing, creating random 4-letter strings using the characters A-Z and then filter by:

Join performance

A join combines data from two tables based on matching keys. For example, if the two tables are:

Key Left
A a1
B b1
C c1

and:

Key Right
A a2
A a3
B b2

then the join is:

Key Left Right
A a1 a2
A a1 a3
B b1 b2

Write a test to compare the performance of row-wise vs. column-wise storage when joining two tables based on matching numeric keys. Does the answer depend on the fraction of keys that match?

Join optimization

The simplest way to join two tables is to look for matching keys using a double loop. An alternative is to build an index for each table and then use it to construct matches. For example, suppose the tables are:

Key Left
A a1
B b1
C c1

and:

Key Right
A a2
A a3
B b2

The first step is to create a Map showing where each key is found in the first table:

{A: [0], B: [1], C: [2]}

The second step is to create a similar Map for the second table:

{A: [0, 1], B: [2]}

We can then loop over the keys in one of the maps, look up values in the second map, and construct all of the matches.

Write a function that joins two tables this way. Is it faster or slower than using a double loop? How does the answer depend on the number of keys and the fraction that match?

Flipping storage

Our tests showed that storing row-oriented tables as JSON is much slower than storing column-oriented tables. Write a test to determine whether converting a row-oriented table to a column-oriented table and then saving the latter is faster than saving the row-oriented table directly.

Sparse storage

A sparse matrix is one in which most of the values are zero. Instead of storing them all, a program can use a map to store non-zero values and a lookup function to return zero for anything that isn't stored explicitly:

def spareMatrixGet(matrix, row, col) => {
  return matrix.contains(row, col)
    ? matrix.get(row, col)
    : 0
}

The same technique can be used if most of the entries in a data table are missing. Write a function that creates a sparse table in which a random 5% of the values are non-zero and the other 95% are zero, then compare the memory requirements and performance of filter and select for this implementation versus those of row-wise and column-wise storage.

Loading time

Modify the programs in this section to measure the time required to convert a data table from JSON or binary form back to a data structure.

Saving fixed-width strings

To improve performance, databases often store fixed-width strings, i.e., they limit the length of the strings in a column to some fixed size and pad strings that are shorter than that.

  1. Write a function that takes an array of strings and an integer with and creates an ArrayBuffer containing the strings padded to that width. The function should throw an exception if any of the strings are longer than the specified width.

  2. Write another function that takes an ArrayBuffer as input and returns an array of strings. This function should remove the padding so that strings shorter than the fixed width are restored to their original form.

Saving variable-width strings

Fixed-width storage is inefficient for large blocks of text such as contracts, novels, and resumés, since padding every document to the length of the longest will probably waste a lot of space. An alternative way to store these in binary is to save each entry as a (length, text) pair.

  1. Write a function that takes a list of strings as input and returns an ArrayBuffer containing (length, text) pairs.

  2. Write another function that takes such an ArrayBuffer and returns an array containing the original text.

  3. Write tests with Mocha to confirm that your functions work correctly.

ASCII storage

The original ASCII standard specified a 7-bit character encoding for letters commonly used in English, and many data files still only use characters whose numeric codes are in the range 0–127.

  1. Write a function that takes an array of single-letter strings and returns an ArrayBuffer that stores them using one byte per character if all of the characters will fit into 7 bits, and multiple bytes per character if any of the characters require more than 7 bits.

  2. Write another function that takes an ArrayBuffer generated by the first function and re-creates the array of characters. The function must only take the ArrayBuffer as an argument, so the first element of the ArrayBuffer should indicate how to interpret the rest of its contents.

  3. Write tests with Mocha to check that your functions work correctly.