Benchmark¶

Performance comparison of all available backends on the Memorize EEG dataset (71 channels × 319 500 samples, 100 EM iterations, 1 model, use_min_dll=False, use_grad_norm=False).

Results¶

Tested on:

CPU - AMD Ryzen 9 7950X (16 physical cores, 62 GB RAM)
GPU - NVIDIA GeForce RTX 5070 Ti (15.5 GB VRAM)

Timing (milliseconds per iteration phase)

Backend	Compiled	Iter 1 (ms)	Grad (ms)	Newt (ms)	Total (ms)
Fortran	yes	580.00	581.22	746.60	66390
PyTorch/CPU	no	1592.46	1605.44	1599.50	160234
PyTorch/CPU	yes	7131.91	768.25	612.27	75389
PyTorch/CUDA	no	180.36	159.38	159.44	15962
PyTorch/CUDA	yes	906.81	128.88	114.19	12931

Fit quality and speedup (relative to Fortran baseline)

Backend	Compiled	Final LL	Spd/iter	Spd/total
Fortran	yes	-2.360829	1.00	1.00
PyTorch/CPU	no	-2.360828	0.47	0.41
PyTorch/CPU	yes	-2.360836	1.22	0.88
PyTorch/CUDA	no	-2.360811	4.68	4.16
PyTorch/CUDA	yes	-2.360819	6.54	5.13

Notes¶

iter 1: wall time of the very first EM iteration. For compiled runs this includes torch.compile tracing overhead (graph capture + kernel compilation), which is a one-time cost amortised over all subsequent iterations.
grad ms: mean of iterations 2–50 (gradient phase, iter 1 excluded).
newt ms: mean of iterations 51–100 (Newton phase). The relative cost of Newton vs gradient steps varies by backend: Fortran and uncompiled CPU are slower in the Newton phase, while compiled backends seem to be faster (more effective kernel fusion on the Newton correction).
total (ms): sum of all iteration times including iter 1.
spd/iter: speedup vs Fortran based on Newton-phase mean ms/iter. Because compiled backends are faster in the Newton phase than the gradient phase, this speedup improves with longer runs: typical AMICA fits use 200–2000 iterations, where nearly all compute is Newton-phase, so spd/iter is the more representative metric for real-world use.
spd/total: speedup vs Fortran based on total time (iter 1 included); reflects the real-world cost of torch.compile tracing overhead. > 1× means faster than Fortran, < 1× means slower.
The Fortran binary writes per-iteration timings to amicaout/out.txt; per-phase means are derived from those values.
LL values are comparable across backends: same dataset, same number of iterations, same algorithm settings.

Running the Benchmark¶

python benchmarks/benchmark.py
python benchmarks/benchmark.py --output /tmp/my_results.csv

The script auto-detects all available backends (Fortran, CPU, CUDA, MPS) and saves results to benchmarks/results.csv by default. The Fortran baseline requires data/amica15ub and data/Memorize.fdt to be present.