01
Research Paper
Dense vs Sparse Pretraining at Tiny Scale: Active- vs Total-Parameter Matching
Abdalrahman Wael · March 2026
We study dense and mixture-of-experts transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. Under matched active compute, the sparse model wins; under matched total stored capacity, dense retains a small edge in this sub-25M-parameter regime.