Research Paper
Dense vs Sparse Pretraining at Tiny Scale: Active- vs Total-Parameter Matching
Abdalrahman Wael · March 2026
Abstract
We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. In this sub-25M-parameter regime, MoE improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.
Metadata
- Authors
- Abdalrahman Wael
- Publication Date
- March 2026
- Format
- Searchable PDF
- Topics
- mixture-of-experts, transformers, pretraining, LLaMA, TinyStories