Benchmarks¶
The numbers below are real measurements, reported honestly — including where
rbgo is slower. They are wall-clock (process startup included), best-of-8
on an Apple M4 Max (darwin/arm64). The same bench/*.rb is run through every
runtime and each program's stdout is checked byte-identical against MRI before
it is timed.
Six-runtime comparison¶
rbgo is the pure-Go bytecode interpreter; rbgo+AOT is the native binary from
rbgo build (the program's integer-bound methods lowered to Go). References:
MRI 4.0.5, MRI+YJIT, JRuby 10.1 (OpenJDK 25), TruffleRuby 34.0.1
(GraalVM CE Native). Times in ms, best of 8.
| program | rbgo | rbgo+AOT | MRI | MRI+YJIT | JRuby | TruffleRuby |
|---|---|---|---|---|---|---|
| strings | 40 | 40 | 40 | 40 | 1040 | 120 |
| wordcount | 120 | 120 | 80 | 80 | 1090 | 200 |
| hash | 250 | 260 | 80 | 80 | 1120 | 100 |
| array | 440 | 440 | 90 | 60 | 1160 | 60 |
| blocks | 560 | 600 | 250 | 220 | 1200 | 80 |
| proc | 690 | 680 | 160 | 140 | 1160 | 70 |
| dispatch | 1180 | 1150 | 220 | 170 | 1190 | 50 |
| alloc | 1250 | 1280 | 230 | 190 | 1190 | 90 |
| loop | 1490 | 10 | 360 | 360 | 1240 | 90 |
| fib | 2470 | 20 | 480 | 100 | 2770 | 150 |
| mandelbrot | 2930 | 2920 | 780 | 760 | 1270 | 90 |
- rbgo+AOT is the standout:
loop10 ms andfib20 ms — 18–24× faster than MRI+YJIT, the only runtime here that beats YJIT, via closed-world native lowering of integer-bound methods. (mandelbrot's float kernel is not yet AOT-lowered, so its AOT column matches the interpreter.) - rbgo interpreter runs ~3–6× MRI on compute-bound code and at parity
where startup / I-O dominates (
strings,wordcount). - TruffleRuby (GraalVM JIT) is the compute ceiling — e.g.
mandelbrot90 ms vs MRI 780 ms. - JRuby is dominated by ~1.0–1.2 s JVM startup; for these single-shot micro-benchmarks the startup is the story.
Reproduce: AOT=1 RUNS=8 JRUBY=jruby TRUFFLE=<path> bash bench/run.sh 8. The full
write-up (methodology, profiling, where the time goes) is in
BENCHMARKS.md.
Earlier startup-focused snapshot¶
An earlier, single-run snapshot (rbgo vs MRI+YJIT vs JRuby) — kept for the startup story:
| Workload | rbgo | MRI 4.0.5 | JRuby 10.1 |
|---|---|---|---|
| startup (empty program) | 0.02 s | 0.05 s | 1.06 s |
| fib(30) | 0.61 s | 0.14 s | 1.80 s |
| fib(34) | 1.68 s | 0.38 s | 2.72 s |
| loop sum 10M | 0.68 s | 0.28 s | 1.31 s |
| string build 300k | 0.07 s | 0.07 s | 1.16 s |
| array map+sort 300k | 0.33 s | 0.07 s | 1.15 s |
Startup is rbgo's superpower¶
rbgo starts in ~0.02 s — a single static Go binary, no separate runtime, no JVM — against MRI's ~0.05 s and JRuby's ~1.06 s. For an embedded interpreter or a CLI tool that is invoked often and exits quickly, this is the decisive number: the process is up and running before the alternatives have finished initialising.
It also colours the rest of the table. To read the compute cost of a workload, subtract this fixed startup from each column.
Interpreted compute¶
On raw interpreted compute, MRI leads — its C interpreter with YJIT is the
reference for fast Ruby, and rbgo, a pure-Go bytecode VM, is a few× slower on
tight numeric and allocation-heavy loops (fib, loop sum, array map+sort).
String building is already at parity (0.07 s vs 0.07 s).
The JRuby numbers here are not a steady-state JIT comparison
JRuby's JVM JIT pays a large warm-up cost, and every workload in this table is short and therefore startup-dominated. These runs do not let the JIT warm up, so they are not a fair picture of JRuby's steady-state performance — JRuby competes on long-running, warm workloads, which this table deliberately does not contain.
rbgo's compute answer: AOT compilation¶
The interpreter is for embedding, portability and instant startup. When you need
raw compute speed, the answer is the AOT compiler, rbgo build: it lowers
hot methods to native Go — unboxed int64 kernels with a deopt guard back to
the interpreter on overflow or ÷0. AOT-compiled, fib(30) runs ~4× faster
than MRI + YJIT while staying correct for every input. See the
AOT compiler doc.
So the positioning is straightforward:
rbgo is embeddable Ruby with instant startup and portability; when you need raw compute speed, AOT-compile the hot path.
The scientific stack (NDArray / FFT / Image) gets a further lift from
go-asmgen-generated SIMD kernels across the 64-bit arches, keeping the heavy
numeric paths fast while staying CGO=0.
Methodology caveats¶
Read these numbers as indicative, not a rigorous benchmark suite:
- wall-clock includes process startup — subtract the startup row to compare compute;
- they are single-run on one machine (Apple-silicon arm64), so treat them as a rough order of magnitude;
- a fair JIT comparison needs warm / long-running workloads, which these short runs are not;
- performance is validated and benchmarked across all six 64-bit
architectures (amd64, arm64, riscv64, loong64, ppc64le, s390x), not just the
one reported here — and on real hardware, not only qemu: amd64/arm64
natively, riscv64/ppc64le/loong64 on the GCC Compile Farm (cfarm95 RVV,
cfarm112/cfarm433 POWER8E/POWER9, cfarm401 LoongArch), and s390x on the IBM
LinuxONE Community Cloud. qemu is the CI gate; real silicon is the perf oracle,
so the SIMD-accelerated paths (go-simd
base64/securerandom/hex) report measured numbers, not llvm-mca estimates.