Benchmarks¶

The numbers below are real measurements, reported honestly — including where rbgo is slower. They are wall-clock (process startup included), best-of-8 on an Apple M4 Max (darwin/arm64). The same bench/*.rb is run through every runtime and each program's stdout is checked byte-identical against MRI before it is timed.

Six-runtime comparison¶

rbgo is the pure-Go bytecode interpreter; rbgo+AOT is the native binary from rbgo build (the program's integer-bound methods lowered to Go). References: MRI 4.0.5, MRI+YJIT, JRuby 10.1 (OpenJDK 25), TruffleRuby 34.0.1 (GraalVM CE Native). Times in ms, best of 8.

program	rbgo	rbgo+AOT	MRI	MRI+YJIT	JRuby	TruffleRuby
strings	40	40	40	40	1040	120
wordcount	120	120	80	80	1090	200
hash	250	260	80	80	1120	100
array	440	440	90	60	1160	60
blocks	560	600	250	220	1200	80
proc	690	680	160	140	1160	70
dispatch	1180	1150	220	170	1190	50
alloc	1250	1280	230	190	1190	90
loop	1490	10	360	360	1240	90
fib	2470	20	480	100	2770	150
mandelbrot	2930	2920	780	760	1270	90

rbgo+AOT is the standout: loop 10 ms and fib 20 ms — 18–24× faster than MRI+YJIT, the only runtime here that beats YJIT, via closed-world native lowering of integer-bound methods. (mandelbrot's float kernel is not yet AOT-lowered, so its AOT column matches the interpreter.)
rbgo interpreter runs ~3–6× MRI on compute-bound code and at parity where startup / I-O dominates (strings, wordcount).
TruffleRuby (GraalVM JIT) is the compute ceiling — e.g. mandelbrot 90 ms vs MRI 780 ms.
JRuby is dominated by ~1.0–1.2 s JVM startup; for these single-shot micro-benchmarks the startup is the story.

Reproduce: AOT=1 RUNS=8 JRUBY=jruby TRUFFLE=<path> bash bench/run.sh 8. The full write-up (methodology, profiling, where the time goes) is in BENCHMARKS.md.

Earlier startup-focused snapshot¶

An earlier, single-run snapshot (rbgo vs MRI+YJIT vs JRuby) — kept for the startup story:

Workload	rbgo	MRI 4.0.5	JRuby 10.1
startup (empty program)	0.02 s	0.05 s	1.06 s
fib(30)	0.61 s	0.14 s	1.80 s
fib(34)	1.68 s	0.38 s	2.72 s
loop sum 10M	0.68 s	0.28 s	1.31 s
string build 300k	0.07 s	0.07 s	1.16 s
array map+sort 300k	0.33 s	0.07 s	1.15 s

Startup is rbgo's superpower¶

rbgo starts in ~0.02 s — a single static Go binary, no separate runtime, no JVM — against MRI's ~0.05 s and JRuby's ~1.06 s. For an embedded interpreter or a CLI tool that is invoked often and exits quickly, this is the decisive number: the process is up and running before the alternatives have finished initialising.

It also colours the rest of the table. To read the compute cost of a workload, subtract this fixed startup from each column.

Interpreted compute¶

On raw interpreted compute, MRI leads — its C interpreter with YJIT is the reference for fast Ruby, and rbgo, a pure-Go bytecode VM, is a few× slower on tight numeric and allocation-heavy loops (fib, loop sum, array map+sort). String building is already at parity (0.07 s vs 0.07 s).

The JRuby numbers here are not a steady-state JIT comparison

JRuby's JVM JIT pays a large warm-up cost, and every workload in this table is short and therefore startup-dominated. These runs do not let the JIT warm up, so they are not a fair picture of JRuby's steady-state performance — JRuby competes on long-running, warm workloads, which this table deliberately does not contain.

rbgo's compute answer: AOT compilation¶

The interpreter is for embedding, portability and instant startup. When you need raw compute speed, the answer is the AOT compiler, rbgo build: it lowers hot methods to native Go — unboxed int64 kernels with a deopt guard back to the interpreter on overflow or ÷0. AOT-compiled, fib(30) runs ~4× faster than MRI + YJIT while staying correct for every input. See the AOT compiler doc.

So the positioning is straightforward:

rbgo is embeddable Ruby with instant startup and portability; when you need raw compute speed, AOT-compile the hot path.

The scientific stack (NDArray / FFT / Image) gets a further lift from go-asmgen-generated SIMD kernels across the 64-bit arches, keeping the heavy numeric paths fast while staying CGO=0.

Methodology caveats¶

Read these numbers as indicative, not a rigorous benchmark suite:

wall-clock includes process startup — subtract the startup row to compare compute;
they are single-run on one machine (Apple-silicon arm64), so treat them as a rough order of magnitude;
a fair JIT comparison needs warm / long-running workloads, which these short runs are not;
performance is validated and benchmarked across all six 64-bit architectures (amd64, arm64, riscv64, loong64, ppc64le, s390x), not just the one reported here — and on real hardware, not only qemu: amd64/arm64 natively, riscv64/ppc64le/loong64 on the GCC Compile Farm (cfarm95 RVV, cfarm112/cfarm433 POWER8E/POWER9, cfarm401 LoongArch), and s390x on the IBM LinuxONE Community Cloud. qemu is the CI gate; real silicon is the perf oracle, so the SIMD-accelerated paths (go-simd base64/securerandom/hex) report measured numbers, not llvm-mca estimates.