Benchmark¶
Performance comparison of all available backends on the
Memorize EEG dataset
(71 channels × 319 500 samples, 100 EM iterations, 1 model, use_min_dll=False,
use_grad_norm=False).
Results¶
Tested on:
CPU - AMD Ryzen 9 7950X (16 physical cores, 62 GB RAM)
GPU - NVIDIA GeForce RTX 5070 Ti (15.5 GB VRAM)
Timing (milliseconds per iteration phase)
Backend |
Compiled |
Iter 1 (ms) |
Grad (ms) |
Newt (ms) |
Total (ms) |
|---|---|---|---|---|---|
Fortran |
yes |
580.00 |
581.22 |
746.60 |
66390 |
PyTorch/CPU |
no |
1592.46 |
1605.44 |
1599.50 |
160234 |
PyTorch/CPU |
yes |
7131.91 |
768.25 |
612.27 |
75389 |
PyTorch/CUDA |
no |
180.36 |
159.38 |
159.44 |
15962 |
PyTorch/CUDA |
yes |
906.81 |
128.88 |
114.19 |
12931 |
Fit quality and speedup (relative to Fortran baseline)
Backend |
Compiled |
Final LL |
Spd/iter |
Spd/total |
|---|---|---|---|---|
Fortran |
yes |
-2.360829 |
1.00 |
1.00 |
PyTorch/CPU |
no |
-2.360828 |
0.47 |
0.41 |
PyTorch/CPU |
yes |
-2.360836 |
1.22 |
0.88 |
PyTorch/CUDA |
no |
-2.360811 |
4.68 |
4.16 |
PyTorch/CUDA |
yes |
-2.360819 |
6.54 |
5.13 |
Notes¶
iter 1: wall time of the very first EM iteration. For compiled runs this includes
torch.compiletracing overhead (graph capture + kernel compilation), which is a one-time cost amortised over all subsequent iterations.grad ms: mean of iterations 2–50 (gradient phase, iter 1 excluded).
newt ms: mean of iterations 51–100 (Newton phase). The relative cost of Newton vs gradient steps varies by backend: Fortran and uncompiled CPU are slower in the Newton phase, while compiled backends seem to be faster (more effective kernel fusion on the Newton correction).
total (ms): sum of all iteration times including iter 1.
spd/iter: speedup vs Fortran based on Newton-phase mean ms/iter. Because compiled backends are faster in the Newton phase than the gradient phase, this speedup improves with longer runs: typical AMICA fits use 200–2000 iterations, where nearly all compute is Newton-phase, so
spd/iteris the more representative metric for real-world use.spd/total: speedup vs Fortran based on total time (iter 1 included); reflects the real-world cost of
torch.compiletracing overhead. > 1× means faster than Fortran, < 1× means slower.The Fortran binary writes per-iteration timings to
amicaout/out.txt; per-phase means are derived from those values.LL values are comparable across backends: same dataset, same number of iterations, same algorithm settings.
Running the Benchmark¶
python benchmarks/benchmark.py
python benchmarks/benchmark.py --output /tmp/my_results.csv
The script auto-detects all available backends (Fortran, CPU, CUDA, MPS) and
saves results to benchmarks/results.csv by default. The Fortran baseline
requires data/amica15ub and data/Memorize.fdt to be present.