Virtual Machine
Assembling and running low-level code
Terms defined:
Computers don't execute JavaScript directly.
Each processor has its own
What is the architecture of our virtual machine?
Our
-
An
instruction pointer (IP) that holds the memory address of the next instruction to execute. It is automatically initialized to point at address 0, which is where every program must start. This rule is part of theApplication Binary Interface (ABI) for our virtual machine. -
Four
registers named R0 to R4 that instructions can access directly. There are no memory-to-memory operations in our VM: everything happens in or through registers. -
256
words of memory, each of which can store a single value. Both the program and its data live in this single block of memory; we chose the size 256 so that each address will fit in a single byte.
The instructions for our VM are 3 bytes long.
The r
, c
, and a
to indicate instruction format,
where r
indicates a register identifier,
c
indicates a constant,
and a
indicates an address.
Instruction | Code | Format | Action | Example | Equivalent |
---|---|---|---|---|---|
hlt |
1 | -- |
Halt program | hlt |
process.exit(0) |
ldc |
2 | rc |
Load immediate | ldc R0 123 |
R0 := 123 |
ldr |
3 | rr |
Load register | ldr R0 R1 |
R0 := RAM[R1] |
cpy |
4 | rr |
Copy register | cpy R0 R1 |
R0 := R1 |
str |
5 | rr |
Store register | str R0 R1 |
RAM[R1] := R0 |
add |
6 | rr |
Add | add R0 R1 |
R0 := R0 + R1 |
sub |
7 | rr |
Subtract | sub R0 R1 |
R0 := R0 - R1 |
beq |
8 | ra |
Branch if equal | beq R0 123 |
if (R0 === 0) PC := 123 |
bne |
9 | ra |
Branch if not equal | bne R0 123 |
if (R0 !== 0) PC := 123 |
prr |
10 | r- |
Print register | prr R0 |
console.log(R0) |
prm |
11 | r- |
Print memory | prm R0 |
console.log(RAM[R0]) |
We put our VM's architectural details in a file that can be shared by other components:
const OPS = {
hlt: { code: 1, fmt: '--' }, // Halt program
ldc: { code: 2, fmt: 'rv' }, // Load immediate
ldr: { code: 3, fmt: 'rr' }, // Load register
cpy: { code: 4, fmt: 'rr' }, // Copy register
str: { code: 5, fmt: 'rr' }, // Store register
add: { code: 6, fmt: 'rr' }, // Add
sub: { code: 7, fmt: 'rr' }, // Subtract
beq: { code: 8, fmt: 'rv' }, // Branch if equal
bne: { code: 9, fmt: 'rv' }, // Branch if not equal
prr: { code: 10, fmt: 'r-' }, // Print register
prm: { code: 11, fmt: 'r-' } // Print memory
}
const OP_MASK = 0xFF // select a single byte
const OP_SHIFT = 8 // shift up by one byte
const OP_WIDTH = 6 // op width in characters when printing
const NUM_REG = 4 // number of registers
const RAM_LEN = 256 // number of words in RAM
export {
OPS,
OP_MASK,
OP_SHIFT,
OP_WIDTH,
NUM_REG,
RAM_LEN
}
How can we execute these instructions?
As in previous chapters, we will split a class that would normally be written in one piece into several pieces for exposition. To start, we define a class with an instruction pointer, some registers, and some memory, along with a prompt for output:
import assert from 'assert'
import {
OP_MASK,
OP_SHIFT,
NUM_REG,
RAM_LEN
} from './architecture.js'
const COLUMNS = 4
const DIGITS = 8
class VirtualMachineBase {
constructor () {
this.ip = 0
this.reg = Array(NUM_REG)
this.ram = Array(RAM_LEN)
this.prompt = '>>'
}
...
}
export default VirtualMachineBase
A program is just an array of numbers. To load one, we copy those numbers into RAM and reset the instruction pointer and registers:
initialize (program) {
assert(program.length <= this.ram.length,
'Program is too long for memory')
for (let i = 0; i < this.ram.length; i += 1) {
if (i < program.length) {
this.ram[i] = program[i]
} else {
this.ram[i] = 0
}
}
this.ip = 0
this.reg.fill(0)
}
To handle the next instruction,
the VM gets the value in memory that the instruction pointer currently refers to
and moves the instruction pointer on by one address.
It then uses
fetch () {
assert((0 <= this.ip) && (this.ip < RAM_LEN),
`Program counter ${this.ip} out of range 0..${RAM_LEN}`)
let instruction = this.ram[this.ip]
this.ip += 1
const op = instruction & OP_MASK
instruction >>= OP_SHIFT
const arg0 = instruction & OP_MASK
instruction >>= OP_SHIFT
const arg1 = instruction & OP_MASK
return [op, arg0, arg1]
}
Semi-realistic
We always unpack two operands regardless of whether the instructions has them or not, since this is what a hardware implementation would be. We have also included assertions in our VM to simulate the way that real hardware includes logic to detect illegal instructions and out-of-bound memory addresses.
The next step is to extend our base class with one that has a run
method.
As its name suggests,
this runs the program by fetching instructions and taking action until told to stop:
import assert from 'assert'
import {
OPS
} from './architecture.js'
import VirtualMachineBase from './vm-base.js'
class VirtualMachine extends VirtualMachineBase {
run () {
let running = true
while (running) {
const [op, arg0, arg1] = this.fetch()
switch (op) {
case OPS.hlt.code:
running = false
break
case OPS.ldc.code:
this.assertIsRegister(arg0, op)
this.reg[arg0] = arg1
break
...
default:
assert(false, `Unknown op ${op}`)
break
}
}
}
assertIsRegister (reg) {
assert((0 <= reg) && (reg < this.reg.length),
`Invalid register ${reg}`)
}
assertIsAddress (addr) {
assert((0 <= addr) && (addr < this.ram.length),
`Invalid register ${addr}`)
}
}
export default VirtualMachine
Some instructions are very similar to others, so we will only look at three here. The first stores the value of one register in the address held by another register:
case OPS.str.code:
this.assertIsRegister(arg0, op)
this.assertIsRegister(arg1, op)
this.assertIsAddress(this.reg[arg1], op)
this.ram[this.reg[arg1]] = this.reg[arg0]
break
The first three lines check that the operation is legal; the fourth one uses the value in one register as an address, which is why it has nested array indexing.
Adding the value in one register to the value in another register is ever simpler:
case OPS.add.code:
this.assertIsRegister(arg0, op)
this.assertIsRegister(arg1, op)
this.reg[arg0] += this.reg[arg1]
break
as is jumping to a fixed address if the value in a register is zero:
case OPS.beq.code:
this.assertIsRegister(arg0, op)
this.assertIsAddress(arg1, op)
if (this.reg[arg0] === 0) {
this.ip = arg1
}
break
What do assembly programs look like?
We could figure out numerical op codes by hand,
and in fact that's what the first programmers did.
However,
it is much easier to turn a very simple language into those numbers
using an
# Print initial contents of R1.
prr R1
hlt
and this is its numeric representation:
00010a
000001
This program prints the numbers from 0 to 2
(
# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
ldc R0 0
ldc R1 3
loop:
prr R0
ldc R2 1
add R0 R2
cpy R2 R1
sub R2 R0
bne R2 @loop
hlt
000002
030102
00000a
010202
020006
010204
000207
020209
000001
The loop
doesn't take up any space,
but instead tells the assembler to give the address of the next instruction a name
so that we can refer to that address as @loop
.
Let's trace this program's execution
(
- R0 holds the current loop index.
- R1 holds the loop's upper bound (in this case 3).
- The loop prints the value of R0 (one instruction).
- The program adds 1 to R0. This takes two instructions because we can only add register-to-register.
- It checks to see if we should loop again, which takes three instructions.
- If the program doesn't jump back, it halts.
The implementation of the assembler mirrors the simplicity of assembly language. The main method gets interesting lines, finds the addresses of labels, and turns each remaining line into an instruction:
assemble (lines) {
lines = this.cleanLines(lines)
const labels = this.findLabels(lines)
const instructions = lines.filter(line => !this.isLabel(line))
const compiled = instructions.map(instr => this.compile(instr, labels))
const program = this.instructionsToText(compiled)
return program
}
cleanLines (lines) {
return lines
.map(line => line.trim())
.filter(line => line.length > 0)
.filter(line => !this.isComment(line))
}
isComment (line) {
return line.startsWith('#')
}
To find labels, we go through the lines one by one and either save the label or increment the current address (because labels don't take up space):
findLabels (lines) {
const result = {}
let index = 0
lines.forEach(line => {
if (this.isLabel(line)) {
const label = line.slice(0, -1)
assert(!(label in result),
`Duplicate label ${label}`)
result[label] = index
} else {
index += 1
}
})
return result
}
isLabel (line) {
return line.endsWith(':')
}
To compile a single instruction we break the line into tokens, look up the format for the operands, and then pack them into a single value:
compile (instruction, labels) {
const [op, ...args] = instruction.split(/\s+/)
assert(op in OPS,
`Unknown operation "${op}"`)
let result = 0
switch (OPS[op].fmt) {
case '--':
result = this.combine(
OPS[op].code
)
break
case 'r-':
result = this.combine(
this.register(args[0]),
OPS[op].code
)
break
case 'rr':
result = this.combine(
this.register(args[1]),
this.register(args[0]),
OPS[op].code
)
break
case 'rv':
result = this.combine(
this.value(args[1], labels),
this.register(args[0]),
OPS[op].code
)
break
default:
assert(false,
`Unknown instruction format ${OPS[op].fmt}`)
}
return result
}
Combining op codes and operands into a single value is the reverse of the unpacking done by the virtual machine:
combine (...args) {
assert(args.length > 0,
'Cannot combine no arguments')
let result = 0
for (const a of args) {
result <<= OP_SHIFT
result |= a
}
return result
}
Finally, we need few utility functions:
instructionsToText (program) {
return program.map(op => op.toString(16).padStart(OP_WIDTH, '0'))
}
register (token) {
assert(token[0] === 'R',
`Register "${token}" does not start with 'R'`)
const r = parseInt(token.slice(1))
assert((0 <= r) && (r < NUM_REG),
`Illegal register ${token}`)
return r
}
value (token, labels) {
if (token[0] !== '@') {
return parseInt(token)
}
const labelName = token.slice(1)
assert(labelName in labels,
`Unknown label "${token}"`)
return labels[labelName]
}
Let's try assembling a program and display its output, the registers, and the interesting contents of memory. This program counts up to three:
# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
ldc R0 0
ldc R1 3
loop:
prr R0
ldc R2 1
add R0 R2
cpy R2 R1
sub R2 R0
bne R2 @loop
hlt
>> 0
>> 1
>> 2
R0 = 3
R1 = 3
R2 = 0
R3 = 0
0: 00000002 00030102 0000000a 00010202
4: 00020006 00010204 00000207 00020209
8: 00000001 00000000 00000000 00000000
How can we store data?
It's hard to write interesting programs when each value needs a unique name.
We can do a lot more once we have collections like arrays,
so let's add those to our assembler.
(We don't have to make any changes to the virtual machine,
which doesn't care how we think about our data.)
We will allocate storage for arrays after the program
by using .data
on a line of its own to mark the start of the data section
and then label: number
to give a region a name and allocate some storage space
(
This enhancement only requires a few changes to the assembler. First, we need to split the lines into instructions and data allocations:
assemble (lines) {
lines = this.cleanLines(lines)
const [toCompile, toAllocate] = this.splitAllocations(lines)
const labels = this.findLabels(lines)
const instructions = toCompile.filter(line => !this.isLabel(line))
const baseOfData = instructions.length
this.addAllocations(baseOfData, labels, toAllocate)
const compiled = instructions.map(instr => this.compile(instr, labels))
const program = this.instructionsToText(compiled)
return program
}
splitAllocations (lines) {
const split = lines.indexOf(DIVIDER)
if (split === -1) {
return [lines, []]
} else {
return [lines.slice(0, split), lines.slice(split + 1)]
}
}
Second, we need to figure out where each allocation will lie and create a label accordingly:
addAllocations (baseOfData, labels, toAllocate) {
toAllocate.forEach(alloc => {
const fields = alloc.split(':').map(a => a.trim())
assert(fields.length === 2,
`Invalid allocation directive "${alloc}"`)
const [label, numWordsText] = fields
assert(!(label in labels),
`Duplicate label "${label}" in data allocation`)
const numWords = parseInt(numWordsText)
assert((baseOfData + numWords) < RAM_LEN,
`Allocation "${label}" requires too much memory`)
labels[label] = baseOfData
baseOfData += numWords
})
}
And that's it: no other changes needed to compilation or execution. To test it, let's fill an array with the numbers from 0 to 3:
# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
# - R2: array index.
# - R3: temporary.
ldc R0 0
ldc R1 3
ldc R2 @array
loop:
str R0 R2
ldc R3 1
add R0 R3
add R2 R3
cpy R3 R1
sub R3 R0
bne R3 @loop
hlt
.data
array: 10
R0 = 3
R1 = 3
R2 = 14
R3 = 0
0: 00000002 00030102 000b0202 00020005
4: 00010302 00030006 00030206 00010304
8: 00000307 00030309 00000001 00000000
c: 00000001 00000002 00000000 00000000
Exercises
Swapping values
Write an assembly language program that swaps the values in R1 and R2 without affecting the values in other registers.
Reversing an array
Write an assembly language program that starts with:
- the base address of an array in one word
- the length of the array N in the next word
- N values immediately thereafter
and reverses the array in place.
diagram of array layout
Increment and decrement
-
Add instructions
inc
anddec
that add one to the value of a register and subtract one from the value of a register respectively. -
Rewrite the examples to use these instructions. How much shorter do they make the programs? How much easier to read?
Using long addresses
-
Modify the virtual machine so that the
ldr
andstr
instructions contain 16-bit addresses rather than 8-bit addresses and increase the virtual machine's memory to 64K words to match. -
How does this complicate instruction interpretation?
Operating on strings
The C programming language stored character strings as non-zero bytes terminated by a byte containing zero.
Diagram of C string storage.
-
Write a program that starts with the base address of a string in R1 and finishes with the length of the string (not including the terminator) in the same register.
-
Write a program that starts with the base address of a string in R1 and the base address of some other block of memory in R2 and copies the string to that new location (including the terminator).
-
What happens in each case if the terminator is missing?
Call and return
-
Add another register to the virtual machine called SP (for "stack pointer") that is automatically initialized to the last address in memory.
-
Add an instruction
psh
(short for "push") that copies a value from a register to the address stored in SP and then subtracts one from SP. -
Add an instruction
pop
(short for "pop") that adds one to SP and then copies a value from that address into a register. -
Using these instructions, write a subroutine that evaluates
2x+1
for every value in an array.
Disassembling instructions
A @L001
and @L002
.)
Linking multiple files
-
Modify the assembler to handle
.include filename
directives. -
What does your modified assembler do about duplicate label names? How does it prevent infinite includes (i.e., A.as includes B.as which includes A.as again)?
Providing system calls
Modify the virtual machine so that developers can add "system calls" to it.
-
On startup, the virtual machine loads an array of functions defined in a file called
syscalls.js
. -
The
sys
instruction takes a one-byte constant argument. It looks up the corresponding function and calls it with the values of R0-R3 as parameters and places the result in R0.
Unit testing
-
Write unit tests for the assembler.
-
Once they are working, write unit tests for the virtual machine.