Code Concepts: A Large-Scale Synthetic Dataset for Improving LLM Pretraining
NVIDIA Develops Synthetic Dataset to Enhance LLM Pretraining
NVIDIA has developed a large-scale synthetic dataset for improving pretraining of large language models (LLMs). The dataset, called Code Concepts, consists of 15 million Python programming problems generated using a concept-driven data generation workflow. This workflow enables researchers to generate data aligned with desired model capabilities, addressing the challenge of improving model quality through data quality and specificity.
Concept-Driven Data Generation Workflow
The workflow centers on a curated taxonomy of programming knowledge derived from large-scale annotation of the Nemotron-Pretraining-Code-{v1,v2} datasets. This taxonomy encodes thousands of programming concepts organized hierarchically, from fundamental constructs to advanced algorithmic and data-structure patterns. Using this taxonomy, developers can perform targeted data generation through the combination and distillation of selected concepts, enabling experimenters to control difficulty, diversity, and conceptual balance across generated data.
Application of Code Concepts Dataset
NVIDIA applied the workflow to create the Code Concepts dataset, which was used to enhance foundational Python programming skills in LLM pretraining. The dataset was generated by identifying 91 core concepts most relevant to the HumanEval benchmark and combining them to create approximately 15 million synthetic Python programming problems. Each problem was validated to consist of working Python code using Python’s ast.parse function.
The inclusion of the Code Concepts dataset in the final 100 billion tokens of the Nemotron-Nano-v3 pretraining yields a six-point gain on the HumanEval benchmark, demonstrating the effectiveness of the concept-driven data generation workflow.
Conclusion
NVIDIA’s Code Concepts dataset and concept-driven data generation workflow provide a scalable solution for generating high-quality, targeted data for LLM pretraining. This approach enables researchers to improve model quality through data quality and specificity, addressing a significant challenge in LLM development.
Tools We Use for Working with AI:









