Virtual Machine

Assembling and running low-level code

Terms defined: Application Binary Interface, assembler, assembly code, bitwise operation, instruction pointer, instruction set, label (address in memory), op code, register, virtual machine, word (of memory)

Computers don't execute JavaScript directly. Instead, each processor has its own instruction set, and a compiler translates high-level languages into those instructions. Compilers often use an intermediate representation called assembly code that gives instructions human-readable names instead of numbers. To understand more about how JavaScript actually runs we will simulate a very simple processor with a little bit of memory. If you want to dive deeper, have a look at Bob Nystrom's excellent Crafting Interpreters. You may also enjoy Human Resource Machine, which asks you to solve puzzles of increasing difficulty using a processor almost as simple as ours.

What is the architecture of our virtual machine?

Our virtual machine has three parts, which are shown in for a program made up of 110 instructions:

  1. An instruction pointer (IP) that holds the memory address of the next instruction to execute. It is automatically initialized to point at address 0, which is where every program must start. This rule is part of the Application Binary Interface (ABI) for our virtual machine.

  2. Four registers named R0 to R4 that instructions can access directly. There are no memory-to-memory operations in our VM: everything happens in or through registers.

  3. 256 words of memory, each of which can store a single value. Both the program and its data live in this single block of memory; we chose the size 256 so that each address will fit in a single byte.

Virtual machine architecture
Architecture of the virtual machine.

The instructions for our VM are 3 bytes long. The op code fits into one byte, and each instruction may optionally include one or two single-byte operands. Each operand is a register identifier, a constant, or an address (which is just a constant that identifies a location in memory); since constants have to fit in one byte, the largest number we can represent directly is 256. uses the letters r, c, and a to indicate instruction format, where r indicates a register identifier, c indicates a constant, and a indicates an address.

Virtual machine op codes.
Instruction Code Format Action Example Equivalent
hlt 1 -- Halt program hlt process.exit(0)
ldc 2 rc Load immediate ldc R0 123 R0 := 123
ldr 3 rr Load register ldr R0 R1 R0 := RAM[R1]
cpy 4 rr Copy register cpy R0 R1 R0 := R1
str 5 rr Store register str R0 R1 RAM[R1] := R0
add 6 rr Add add R0 R1 R0 := R0 + R1
sub 7 rr Subtract sub R0 R1 R0 := R0 - R1
beq 8 ra Branch if equal beq R0 123 if (R0 === 0) PC := 123
bne 9 ra Branch if not equal bne R0 123 if (R0 !== 0) PC := 123
prr 10 r- Print register prr R0 console.log(R0)
prm 11 r- Print memory prm R0 console.log(RAM[R0])

We put our VM's architectural details in a file that can be shared by other components:

const OPS = {
  hlt: { code:  1, fmt: '--' }, // Halt program
  ldc: { code:  2, fmt: 'rv' }, // Load immediate
  ldr: { code:  3, fmt: 'rr' }, // Load register
  cpy: { code:  4, fmt: 'rr' }, // Copy register
  str: { code:  5, fmt: 'rr' }, // Store register
  add: { code:  6, fmt: 'rr' }, // Add
  sub: { code:  7, fmt: 'rr' }, // Subtract
  beq: { code:  8, fmt: 'rv' }, // Branch if equal
  bne: { code:  9, fmt: 'rv' }, // Branch if not equal
  prr: { code: 10, fmt: 'r-' }, // Print register
  prm: { code: 11, fmt: 'r-' }  // Print memory
}

const OP_MASK = 0xFF // select a single byte
const OP_SHIFT = 8   // shift up by one byte
const OP_WIDTH = 6   // op width in characters when printing

const NUM_REG = 4    // number of registers
const RAM_LEN = 256  // number of words in RAM

export {
  OPS,
  OP_MASK,
  OP_SHIFT,
  OP_WIDTH,
  NUM_REG,
  RAM_LEN
}

While there isn't a name for this design pattern, putting all the constants that define a system in one file instead of scattering them across multiple files makes them easier to find as well as ensuring consistency.

How can we execute these instructions?

As in previous chapters, we will split a class that would normally be written in one piece into several parts for exposition. We start by defining a class with an instruction pointer, some registers, and some memory along with a prompt for output:

import assert from 'assert'

import {
  OP_MASK,
  OP_SHIFT,
  NUM_REG,
  RAM_LEN
} from './architecture.js'

const COLUMNS = 4
const DIGITS = 8

class VirtualMachineBase {
  constructor () {
    this.ip = 0
    this.reg = Array(NUM_REG)
    this.ram = Array(RAM_LEN)
    this.prompt = '>>'
  }

}

export default VirtualMachineBase

A program is just an array of numbers representing instructions. To load one, we copy those numbers into memory and reset the instruction pointer and registers:


  initialize (program) {
    assert(program.length <= this.ram.length,
      'Program is too long for memory')
    for (let i = 0; i < this.ram.length; i += 1) {
      if (i < program.length) {
        this.ram[i] = program[i]
      } else {
        this.ram[i] = 0
      }
    }
    this.ip = 0
    this.reg.fill(0)
  }

In order to handle the next instruction, the VM gets the value in memory that the instruction pointer currently refers to and moves the instruction pointer on by one address. It then uses bitwise operations to extract the op code and operands from the instruction ():


  fetch () {
    assert((0 <= this.ip) && (this.ip < RAM_LEN),
      `Program counter ${this.ip} out of range 0..${RAM_LEN}`)
    let instruction = this.ram[this.ip]
    this.ip += 1
    const op = instruction & OP_MASK
    instruction >>= OP_SHIFT
    const arg0 = instruction & OP_MASK
    instruction >>= OP_SHIFT
    const arg1 = instruction & OP_MASK
    return [op, arg0, arg1]
  }
Unpacking instructions
Using bitwise operations to unpack instructions.

Semi-realistic

We always unpack two operands regardless of whether the instructions has them or not, since this is what a hardware implementation would be. We have also included assertions in our VM to simulate the way that real hardware includes logic to detect illegal instructions and out-of-bound memory addresses.

The next step is to extend our base class with one that has a run method. As its name suggests, this runs the program by fetching instructions and executing them until told to stop:

import assert from 'assert'

import {
  OPS
} from './architecture.js'

import VirtualMachineBase from './vm-base.js'

class VirtualMachine extends VirtualMachineBase {
  run () {
    let running = true
    while (running) {
      const [op, arg0, arg1] = this.fetch()
      switch (op) {
        case OPS.hlt.code:
          running = false
          break

        case OPS.ldc.code:
          this.assertIsRegister(arg0, op)
          this.reg[arg0] = arg1
          break


        default:
          assert(false, `Unknown op ${op}`)
          break
      }
    }
  }

  assertIsRegister (reg) {
    assert((0 <= reg) && (reg < this.reg.length),
      `Invalid register ${reg}`)
  }

  assertIsAddress (addr) {
    assert((0 <= addr) && (addr < this.ram.length),
      `Invalid register ${addr}`)
  }
}

export default VirtualMachine

Some instructions are very similar to others, so we will only look at three here. The first stores the value of one register in the address held by another register:


        case OPS.str.code:
          this.assertIsRegister(arg0, op)
          this.assertIsRegister(arg1, op)
          this.assertIsAddress(this.reg[arg1], op)
          this.ram[this.reg[arg1]] = this.reg[arg0]
          break

The first three lines check that the operation is legal; the fourth one uses the value in one register as an address, which is why it has nested array indexing.

Adding the value in one register to the value in another register is simpler:


        case OPS.add.code:
          this.assertIsRegister(arg0, op)
          this.assertIsRegister(arg1, op)
          this.reg[arg0] += this.reg[arg1]
          break

as is jumping to a fixed address if the value in a register is zero:


        case OPS.beq.code:
          this.assertIsRegister(arg0, op)
          this.assertIsAddress(arg1, op)
          if (this.reg[arg0] === 0) {
            this.ip = arg1
          }
          break

What do assembly programs look like?

We could figure out numerical op codes by hand, and in fact that's what the first programmers did. However, it is much easier to use an assembler, which is just a small compiler for a language that very closely represents actual machine instructions.

Each command in our assembly languages matches an instruction in the VM. Here's an assembly language program to print the value stored in R1 and then halt:

# Print initial contents of R1.
prr R1
hlt

Its numeric representation is:

00010a
000001

One thing the assembly language has that the instruction set doesn't is labels on addresses. The label loop doesn't take up any space; instead, it tells the assembler to give the address of the next instruction a name so that we can refer to that address as @loop in jump instructions. For example, this program prints the numbers from 0 to 2 ():

# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
ldc R0 0
ldc R1 3
loop:
prr R0
ldc R2 1
add R0 R2
cpy R2 R1
sub R2 R0
bne R2 @loop
hlt
000002
030102
00000a
010202
020006
010204
000207
020209
000001
Counting from 0 to 2
Flowchart of assembly language program to count up from 0 to 2.

Let's trace this program's execution ():

  1. R0 holds the current loop index.
  2. R1 holds the loop's upper bound (in this case 3).
  3. The loop prints the value of R0 (one instruction).
  4. The program adds 1 to R0. This takes two instructions because we can only add register-to-register.
  5. It checks to see if we should loop again, which takes three instructions.
  6. If the program doesn't jump back, it halts.
Trace counting program
Tracing registers and memory values for a simple counting program.

The implementation of the assembler mirrors the simplicity of assembly language. The main method gets interesting lines, finds the addresses of labels, and turns each remaining line into an instruction:


  assemble (lines) {
    lines = this.cleanLines(lines)
    const labels = this.findLabels(lines)
    const instructions = lines.filter(line => !this.isLabel(line))
    const compiled = instructions.map(instr => this.compile(instr, labels))
    const program = this.instructionsToText(compiled)
    return program
  }

  cleanLines (lines) {
    return lines
      .map(line => line.trim())
      .filter(line => line.length > 0)
      .filter(line => !this.isComment(line))
  }

  isComment (line) {
    return line.startsWith('#')
  }

To find labels, we go through the lines one by one and either save the label or increment the current address (because labels don't take up space):


  findLabels (lines) {
    const result = {}
    let index = 0
    lines.forEach(line => {
      if (this.isLabel(line)) {
        const label = line.slice(0, -1)
        assert(!(label in result),
          `Duplicate label ${label}`)
        result[label] = index
      } else {
        index += 1
      }
    })
    return result
  }

  isLabel (line) {
    return line.endsWith(':')
  }

To compile a single instruction we break the line into tokens, look up the format for the operands, and pack them into a single value:


  compile (instruction, labels) {
    const [op, ...args] = instruction.split(/\s+/)
    assert(op in OPS,
      `Unknown operation "${op}"`)
    let result = 0
    switch (OPS[op].fmt) {
      case '--':
        result = this.combine(
          OPS[op].code
        )
        break
      case 'r-':
        result = this.combine(
          this.register(args[0]),
          OPS[op].code
        )
        break
      case 'rr':
        result = this.combine(
          this.register(args[1]),
          this.register(args[0]),
          OPS[op].code
        )
        break
      case 'rv':
        result = this.combine(
          this.value(args[1], labels),
          this.register(args[0]),
          OPS[op].code
        )
        break
      default:
        assert(false,
          `Unknown instruction format ${OPS[op].fmt}`)
    }
    return result
  }

Combining op codes and operands into a single value is the reverse of the unpacking done by the virtual machine:


  combine (...args) {
    assert(args.length > 0,
      'Cannot combine no arguments')
    let result = 0
    for (const a of args) {
      result <<= OP_SHIFT
      result |= a
    }
    return result
  }

Finally, we need few utility functions:


  instructionsToText (program) {
    return program.map(op => op.toString(16).padStart(OP_WIDTH, '0'))
  }

  register (token) {
    assert(token[0] === 'R',
      `Register "${token}" does not start with 'R'`)
    const r = parseInt(token.slice(1))
    assert((0 <= r) && (r < NUM_REG),
      `Illegal register ${token}`)
    return r
  }

  value (token, labels) {
    if (token[0] !== '@') {
      return parseInt(token)
    }
    const labelName = token.slice(1)
    assert(labelName in labels,
      `Unknown label "${token}"`)
    return labels[labelName]
  }

Let's try assembling a program and display its output, the registers, and the interesting contents of memory. As a test, this program counts up to three:

# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
ldc R0 0
ldc R1 3
loop:
prr R0
ldc R2 1
add R0 R2
cpy R2 R1
sub R2 R0
bne R2 @loop
hlt
>> 0
>> 1
>> 2
R0 = 3
R1 = 3
R2 = 0
R3 = 0
0:   00000002  00030102  0000000a  00010202
4:   00020006  00010204  00000207  00020209
8:   00000001  00000000  00000000  00000000

How can we store data?

It is tedious to write interesting programs when each value needs a unique name. We can do a lot more once we have collections like arrays, so let's add those to our assembler. We don't have to make any changes to the virtual machine, which doesn't care if we think of a bunch of numbers as individuals or elements of an array, but we do need a way to create arrays and refer to them.

We will allocate storage for arrays at the end of the program by using .data on a line of its own to mark the start of the data section and then label: number to give a region a name and allocate some storage space ().

Storage allocation
Allocating storage for arrays in the virtual machine.

This enhancement only requires a few changes to the assembler. First, we need to split the lines into instructions and data allocations:


  assemble (lines) {
    lines = this.cleanLines(lines)
    const [toCompile, toAllocate] = this.splitAllocations(lines)
    const labels = this.findLabels(lines)
    const instructions = toCompile.filter(line => !this.isLabel(line))
    const baseOfData = instructions.length
    this.addAllocations(baseOfData, labels, toAllocate)
    const compiled = instructions.map(instr => this.compile(instr, labels))
    const program = this.instructionsToText(compiled)
    return program
  }

  splitAllocations (lines) {
    const split = lines.indexOf(DIVIDER)
    if (split === -1) {
      return [lines, []]
    } else {
      return [lines.slice(0, split), lines.slice(split + 1)]
    }
  }

Second, we need to figure out where each allocation lies and create a label accordingly:


  addAllocations (baseOfData, labels, toAllocate) {
    toAllocate.forEach(alloc => {
      const fields = alloc.split(':').map(a => a.trim())
      assert(fields.length === 2,
        `Invalid allocation directive "${alloc}"`)
      const [label, numWordsText] = fields
      assert(!(label in labels),
        `Duplicate label "${label}" in data allocation`)
      const numWords = parseInt(numWordsText)
      assert((baseOfData + numWords) < RAM_LEN,
        `Allocation "${label}" requires too much memory`)
      labels[label] = baseOfData
      baseOfData += numWords
    })
  }

And that's it: no other changes are needed to either compilation or execution. To test it, let's fill an array with the numbers from 0 to 3:

# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
# - R2: array index.
# - R3: temporary.
ldc R0 0
ldc R1 3
ldc R2 @array
loop:
str R0 R2
ldc R3 1
add R0 R3
add R2 R3
cpy R3 R1
sub R3 R0
bne R3 @loop
hlt
.data
array: 10
R0 = 3
R1 = 3
R2 = 14
R3 = 0
0:   00000002  00030102  000b0202  00020005
4:   00010302  00030006  00030206  00010304
8:   00000307  00030309  00000001  00000000
c:   00000001  00000002  00000000  00000000

How does it actually work?

Our VM is just another program. If you'd like to know what happens when instructions finally meet hardware, and how electrical circuits are able to do arithmetic, make decisions, and talk to the world, Patterson2017 has everything you want to know and more.

Exercises

Swapping values

Write an assembly language program that swaps the values in R1 and R2 without affecting the values in other registers.

Reversing an array

Write an assembly language program that starts with:

and reverses the array in place.

Increment and decrement

  1. Add instructions inc and dec that add one to the value of a register and subtract one from the value of a register respectively.

  2. Rewrite the examples to use these instructions. How much shorter do they make the programs? How much easier to read?

Using long addresses

  1. Modify the virtual machine so that the ldr and str instructions contain 16-bit addresses rather than 8-bit addresses and increase the virtual machine's memory to 64K words to match.

  2. How does this complicate instruction interpretation?

Operating on strings

The C programming language stored character strings as non-zero bytes terminated by a byte containing zero.

  1. Write a program that starts with the base address of a string in R1 and finishes with the length of the string (not including the terminator) in the same register.

  2. Write a program that starts with the base address of a string in R1 and the base address of some other block of memory in R2 and copies the string to that new location (including the terminator).

  3. What happens in each case if the terminator is missing?

Call and return

  1. Add another register to the virtual machine called SP (for "stack pointer") that is automatically initialized to the last address in memory.

  2. Add an instruction psh (short for "push") that copies a value from a register to the address stored in SP and then subtracts one from SP.

  3. Add an instruction pop (short for "pop") that adds one to SP and then copies a value from that address into a register.

  4. Using these instructions, write a subroutine that evaluates 2x+1 for every value in an array.

Disassembling instructions

A disassembler turns machine instructions into assembly code. Write a disassembler for the instruction set used by our virtual machine. (Since the labels for addresses are not stored in machine instructions, disassemblers typically generate labels like @L001 and @L002.)

Linking multiple files

  1. Modify the assembler to handle .include filename directives.

  2. What does your modified assembler do about duplicate label names? How does it prevent infinite includes (i.e., A.as includes B.as which includes A.as again)?

Providing system calls

Modify the virtual machine so that developers can add "system calls" to it.

  1. On startup, the virtual machine loads an array of functions defined in a file called syscalls.js.

  2. The sys instruction takes a one-byte constant argument. It looks up the corresponding function and calls it with the values of R0-R3 as parameters and places the result in R0.

Unit testing

  1. Write unit tests for the assembler.

  2. Once they are working, write unit tests for the virtual machine.