Code Concepts: A Large-Scale Synthetic Dataset for Improving LLM Pretraining

NVIDIA Develops Synthetic Dataset to Enhance LLM Pretraining

NVIDIA has developed a large-scale synthetic dataset for improving pretraining of large language models (LLMs). The dataset, called Code Concepts, consists of 15 million Python programming problems generated using a concept-driven data generation workflow. This workflow enables researchers to generate data aligned with desired model capabilities, addressing the challenge of improving model quality through data quality and specificity.

Concept-Driven Data Generation Workflow

The workflow centers on a curated taxonomy of programming knowledge derived from large-scale annotation of the Nemotron-Pretraining-Code-{v1,v2} datasets. This taxonomy encodes thousands of programming concepts organized hierarchically, from fundamental constructs to advanced algorithmic and data-structure patterns. Using this taxonomy, developers can perform targeted data generation through the combination and distillation of selected concepts, enabling experimenters to control difficulty, diversity, and conceptual balance across generated data.

Application of Code Concepts Dataset

NVIDIA applied the workflow to create the Code Concepts dataset, which was used to enhance foundational Python programming skills in LLM pretraining. The dataset was generated by identifying 91 core concepts most relevant to the HumanEval benchmark and combining them to create approximately 15 million synthetic Python programming problems. Each problem was validated to consist of working Python code using Python’s ast.parse function.

The inclusion of the Code Concepts dataset in the final 100 billion tokens of the Nemotron-Nano-v3 pretraining yields a six-point gain on the HumanEval benchmark, demonstrating the effectiveness of the concept-driven data generation workflow.

Conclusion

NVIDIA’s Code Concepts dataset and concept-driven data generation workflow provide a scalable solution for generating high-quality, targeted data for LLM pretraining. This approach enables researchers to improve model quality through data quality and specificity, addressing a significant challenge in LLM development.

Originally reported by HuggingFace Blog. Rewritten by AI Universe News.

Code Concepts: A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds

ByAI Universe

Code Concepts: A Large-Scale Synthetic Dataset for Improving LLM Pretraining

NVIDIA Develops Synthetic Dataset to Enhance LLM Pretraining

Concept-Driven Data Generation Workflow

Application of Code Concepts Dataset

Conclusion

By AI Universe

Related Post

Why Meta Had to Reinvent the Battery to Make AI Glasses Actually Work

Space Data Centers Sound Revolutionary — But the Physics Say Otherwise

Google’s Gemini-SQL2 Nears Human Accuracy in Text-to-SQL, but Expert Oversight Remains Crucial

Leave a Reply Cancel reply

You missed

DeepSeek Cuts AI Generation Time Up To 85% With New Optimization Framework

OpenAI and Broadcom Forge a Path to Bespoke AI Silicon

Why Meta Had to Reinvent the Battery to Make AI Glasses Actually Work

A Community-Built Kernel Just Outperformed AMD’s Own Attention Library on Every Single Test