3,311 21 hours ago

NVIDIA Nemotron 3 Super is a 120B open MoE model activating just 12B parameters to deliver maximum compute efficiency and accuracy for complex multi-agent applications.

tools thinking cloud 120b
ollama run nemotron-3-super:120b

Details

22 hours ago

95acc78b3ffd · 87GB ·

nemotron_h_moe
·
124B
·
Q4_K_M
NVIDIA Software and Model Evaluation License IMPORTANT NOTICE – PLEASE READ AND AGREE BEFORE USING
{ "temperature": 1, "top_p": 0.95 }

Readme

NVIDIA Nemotron™ is a family of open models with open weights, training data, and recipes, delivering leading efficiency and accuracy for building specialized AI agents.

Nemotron-3-Super is a large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Like other models in the family, it responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model’s reasoning capabilities can be configured through a flag in the chat template.

The model has 12B active parameters and 120B parameters in total.

The supported languages include: English, French, German, Italian, Japanese, Spanish, and Chinese

This model is ready for commercial use.

nemotron 3 super

Benchmarks

Benchmark Nemotron-3-Super Nemotron-3-Super FP8 Nemotron-3-Super NVFP4
General Knowledge
MMLU-Pro 83.73 83.63 83.33
Reasoning
HMMT Feb25 (with tools) 94.73 94.38 95.36
GPQA (no tools) 79.23 79.36 79.42
LiveCodeBench (v6 2024-08↔2025-05) 78.69 78.44 78.44
LiveCodeBench (v5 2024-07↔2024-12) 81.19 80.99 80.56
SciCode (subtask) 42.05 41.38 40.83
HLE (no tools) 18.26 17.42 17.42
Agentic
Terminal Bench (hard subset) 25.78 26.04 24.48
TauBench V2
Airline 56.25 56.25 54.75
Retail 62.83 63.05 63.38
Telecom 64.36 63.93 63.27
Average 61.15 61.07 60.46
Chat & Instruction Following
IFBench (prompt) 72.58 72.32 73.30
Scale AI Multi-Challenge 55.23 54.35 52.8
Arena-Hard-V2 (Hard Prompt) 73.88 76.06 76.00
Long Context
AA-LCR 58.31 57.69 58.06
RULER-500 @ 128k (500 samples per task) 96.79 96.85 95.99
RULER-500 @ 256k (500 samples per task) 96.60 96.33 96.52
RULER-500 @ 512k (500 samples per task) 96.09 95.66 96.23
Multilingual
MMLU-ProX (avg over languages) 79.35 79.21 79.37