Parameterized Synthetic Text Generation with SimpleStories

Lennart Finke, Chandan Sreedhara, Thomas Dooms, Mat Allen, Juan Rodriguez, Noa Nabeshima, Thomas Marshall, Dan Braun

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Datasets and Benchmarks Track

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained tiny model suite then show improved sample efficiency and model interpretability in comparison with the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier with regards to the fewest-parameter language model that outputs grammatical English.