Skip to content
Papers

Formal work, published.

Research papers, reports, and longer technical writeups that are worth keeping in one place.

01
Research Paper

Dense vs Sparse Pretraining at Tiny Scale: Active- vs Total-Parameter Matching

Abdalrahman Wael · March 2026

We study dense and mixture-of-experts transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. Under matched active compute, the sparse model wins; under matched total stored capacity, dense retains a small edge in this sub-25M-parameter regime.