NVIDIA Releases Nemotron-Cascade 2 LLM
NVIDIA has announced the release of Nemotron-Cascade 2, an open-weight LLM designed for specialized reasoning capabilities. This model achieves Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals. Nemotron-Cascade 2 is a 30B Mixture-of-Experts (MoE) model with 3B activated parameters, focusing on maximizing ‘intelligence density’.
The primary value proposition of Nemotron-Cascade 2 lies in its specialized performance in mathematical reasoning, coding, alignment, and instruction following. Its performance excels in these targeted categories compared to the recently released Qwen3.5-35B-A3B (February 2026) and the larger Nemotron-3-Super-120B-A12B.
Nemotron-Cascade 2 Performance and Training
Nemotron-Cascade 2 outperforms Qwen3.5-35B-A3B on AIME 2025 and HMMT Feb25. It also leads on LiveCodeBench v6 and IOI 2025. The model scores significantly higher on ArenaHard v2 and IFBench. The reasoning capabilities stem from its post-training pipeline, starting from the Nemotron-3-Nano-30B-A3B-Base model. During SFT, NVIDIA’s research team utilized a meticulously curated dataset where samples were packed into sequences of up to 256K tokens.
Following SFT, the model underwent Cascade RL, which applies sequential, domain-wise training. This method prevents catastrophic forgetting by allowing hyperparameters to be tailored to specific domains without destabilizing others. The pipeline includes stages for instruction-following (IF-RL), multi-domain RL, RLHF, long-context RL, and specialized Code and SWE RL. Integration of MOPD during the Cascade RL process was crucial.
Dataset and Advanced Techniques
The training dataset for Nemotron-Cascade 2 included 1.9M Python reasoning traces and 1.3M Python tool-calling samples for competitive coding. Additionally, it incorporated 816K samples for mathematical natural language proofs and a specialized Software Engineering (SWE) blend consisting of 125K agentic and 389K agentless samples. MOPD assembly uses the best-performing intermediate ‘teacher’ models—already derived from the same SFT initialization—to provide a dense token-level distillation advantage.
The NVIDIA research team stated, “MOPD is substantially more sample-efficient than sequence-level reward algorithms like Group Relative Policy Optimization (GRPO).” Nemotron-Cascade 2 supports two primary operating modes through its chat template: Thinking Mode and Non-Thinking Mode. For agentic tasks, the model utilizes a structured tool-calling protocol within the system prompt, with available tools listed within tags for verifiable execution feedback.
✨ Intelligent Curation Note
This article was processed by AI Universe’s Intelligent Curation system. We’ve decoded complex technical jargon and distilled dense data into this high-impact briefing.
Estimated time saved: ~1 minutes of reading.
Tools We Use for Working with AI:








