Papers · Abdalrahman Wael

Research Paper

Dense vs Sparse Pretraining at Tiny Scale: Active- vs Total-Parameter Matching

Abdalrahman Wael · March 2026

We study dense and mixture-of-experts transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. Under matched active compute, the sparse model wins; under matched total stored capacity, dense retains a small edge in this sub-25M-parameter regime.

Read landing page Open PDF

Formal work, published.

Dense vs Sparse Pretraining at Tiny Scale: Active- vs Total-Parameter Matching