TESTING V2
Charts
| Model | Fertility |
|---|---|
| Sarvam-2 | 2.27 |
| Gemma 3 | 3.07 |
| GPT-OSS Series | 3.91 |
| Nemotron-3-Nano | 4.88 |
| Mistral-Small | 5.55 |
| Qwen3 Series | 7.80 |
| GLM-4.7 | 8.32 |
| GLM-4.5 | 8.89 |
| Olmo-3.1 | 9.38 |
Tokeniser Fertility
| Competitor | Fluency | Language / Script | Usefulness | Verbosity | Average |
|---|---|---|---|---|---|
| GPT-OSS-20B | 2430/2640 | 2331/2640 | 2082/2640 | 1691/2640 | 2134/2640 |
| Nemotron-3-Nano-30B | 2562/2640 | 2564/2640 | 2392/2640 | 2086/2640 | 2401/2640 |
| Qwen3-30B-A3B | 2372/2640 | 2259/2640 | 2219/2640 | 1966/2640 | 2204/2640 |
| Gemma-3-27B-IT | 2258/2640 | 1806/2640 | 2292/2640 | 2126/2640 | 2120/2640 |
| Mistral-3.2-24B | 2565/2640 | 2542/2640 | 2473/2640 | 2325/2640 | 2476/2640 |
| GLM-4.7-Flash | 2418/2640 | 2193/2640 | 2320/2640 | 2105/2640 | 2259/2640 |
| OLMo-3.1-32B-Think | 2556/2640 | 2549/2640 | 2454/2640 | 2308/2640 | 2467/2640 |
| Average vs All | 2452/2640 | 2321/2640 | 2319/2640 | 2087/2640 | 2294/2640 |
Table 2b: Sarvam 30B vs OSS Models — STEM Overall Win Rate
| Competitor | Fluency | Language / Script | Usefulness | Verbosity | Average |
|---|---|---|---|---|---|
| GPT-OSS-20B | 2055/2200 | 2009/2200 | 2063/2200 | 1849/2200 | 1994/2200 |
| Nemotron-3-Nano-30B | 2160/2200 | 2159/2200 | 2162/2200 | 2068/2200 | 2137/2200 |
| Qwen3-30B-A3B | 1828/2200 | 1780/2200 | 1864/2200 | 1709/2200 | 1795/2200 |
| Gemma-3-27B-IT | 1549/2200 | 1522/2200 | 1615/2200 | 1474/2200 | 1540/2200 |
| Mistral-3.2-24B | 2159/2200 | 2155/2200 | 2163/2200 | 2126/2200 | 2151/2200 |
| GLM-4.7-Flash | 2069/2200 | 1961/2200 | 2062/2200 | 1955/2200 | 2012/2200 |
| OLMo-3.1-32B-Think | 2081/2200 | 2101/2200 | 2097/2200 | 1939/2200 | 2054/2200 |
| Average vs All | 1986/2200 | 1955/2200 | 2004/2200 | 1874/2200 | 1955/2200 |
Table 2a: Sarvam 30B vs OSS Models — Chat Overall Win Rate
| Competitor | Fluency | Language / Script | Usefulness | Verbosity | Average |
|---|---|---|---|---|---|
| Qwen3-Next-80B-A3B | 2382/2640 | 2202/2640 | 2304/2640 | 2040/2640 | 2232/2640 |
| GLM-4.5-Air | 2458/2640 | 2277/2640 | 2377/2640 | 2118/2640 | 2308/2640 |
| GPT-OSS-120B | 2468/2640 | 2421/2640 | 2158/2640 | 1541/2640 | 2147/2640 |
| Average vs All | 2436/2640 | 2300/2640 | 2280/2640 | 1900/2640 | 2229/2640 |
Table 1b: Sarvam 105B vs OSS Models — STEM Overall Win Rate
| Competitor | Fluency | Language / Script | Usefulness | Verbosity | Average |
|---|---|---|---|---|---|
| Qwen3-Next-80B-A3B | 1981/2200 | 1876/2200 | 2011/2200 | 1811/2200 | 1920/2200 |
| GLM-4.5-Air | 2104/2200 | 1980/2200 | 2097/2200 | 1887/2200 | 2017/2200 |
| GPT-OSS-120B | 2105/2200 | 2042/2200 | 2084/2200 | 1795/2200 | 2006/2200 |
| Average vs All | 2063/2200 | 1966/2200 | 2064/2200 | 1831/2200 | 1981/2200 |
Table 1a: Sarvam 105B vs OSS Models — Chat Overall Win Rate
| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking |
|---|---|---|---|---|
| Math500 | 98.6 | 97.2 | 97.0 | 98.2 |
| Live Code Bench v6 | 71.7 | 59.5 | 72.3 | 68.7 |
| MMLU | 90.6 | 87.3 | 90.0 | 90.0 |
| MMLU Pro | 81.7 | 81.4 | 80.8 | 82.7 |
| MILU Overall Avg | 82.8 | 82.4 | 80.8 | 85.7 |
| MILU Indic Avg | 82.2 | 82.0 | 80.1 | 85.3 |
| Writing Bench | 80.5 | 83.8 | 86.5 | 84.6 |
| Arena Hard v2 | 71.0 | 68.1 | 88.5 | 68.2 |
| IF Eval | 84.8 | 83.5 | 85.4 | 88.9 |
General Benchmarks
| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT OSS 120B | Qwen3-Next-80B-A3B-Thinking |
|---|---|---|---|---|
| GPQA diamond | 78.7 | 75.0 | 80.1 | 77.2 |
| AIME 25 | 88.3 | 96.7 | 83.3 | 90.0 |
| + tools | 87.8 | | | |
| Beyond AIME | 69.1 | 61.5 | 51.0 | 68.0 |
| HMMT (Feb 25) | 85.8 | 69.2 | 90.0 | 73.9 |
| HMMT (Nov 25) | 85.8 | 75.0 | 90.0 | 80.0 |
Reasoning Heavy Benchmarks
| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT OSS 120B | Qwen3-Next-80B-A3B-Thinking |
|---|---|---|---|---|
| BrowseComp | 49.5 pass@16 | 21.3 pass@64 | - | 38.0 |
| SWE Bench Verified | 44.1 | 57.6 | 62.4 | 60.9 |
| Tau2 (avg.) | 68.3 | 53.2 | 65.8 | 55.0 |
| Terminal Bench | TODO | 30.0 | 18.7 | - |
Agentic Benchmarks
| Benchmark | Sarvam-105B | Deepseek R1 0528 | Gemini-2.5-Flash | o4-mini | Claude 4 Sonnet |
|---|---|---|---|---|---|
| AIME25 | 88.3 | 87.5 | 72.0 | 92.7 | 70.5 |
| HMMT Feb 2025 | 85.8 | 79.4 | 64.2 | 83.3 | 75.6 |
| GPQA Diamond | 78.7 | 81.0 | 82.8 | 81.4 | 75.4 |
| Live Code Bench v6 | 71.7 | 73.3 | 61.9 | 80.2 | 55.9 |
| MMLU Pro | 81.7 | 85.0 | 82.0 | 81.9 | 83.7 |
| Browse Comp | 49.5 | 3.2 | 20.0 | 28.3 | 14.7 |
| SWE Bench Verified | 44.0 | 57.6 | 48.9 | 68.1 | 66.6 |
| Tau2 Bench | 68.3 | 62.0 | 49.7 | 65.9 | 64.0 |
| HLE | 11.2 | 8.5 | 12.1 | 14.28 | 9.6 |
Comparison with Larger Models
| Benchmark | Sarvam-30B | Gemma 27B It | Mistral-3.2-24B | OLMo 3.1 32B Think | Nemotron-3-Nano-30B-A3B | Qwen3-30B-Thinking-2507 | GLM 4.7 Flash | GPT-OSS-20B |
|---|---|---|---|---|---|---|---|---|
| Math500 | 0.970 | 0.874 | 0.6942 | 0.962 | 0.980 | 0.976 | 0.970 | 0.942 |
| Humaneval | | 0.884 | 0.929 | 0.951 | 0.976 | 0.957 | 0.963 | 0.957 |
| MBPP | | 0.818 | 0.783 | 0.587 | 0.919 | 0.943 | 0.918 | 0.953 |
| Live Code Bench v6 | | 0.280 | 0.260 | 0.730 | 0.683 | 0.660 | 0.640 | 0.610 |
| MMLU | | 0.812 | 0.805 | 0.864 | 0.840 | 0.884 | 0.869 | 0.853 |
| MMLU Pro | | 0.681 | 0.6906 | 0.720 | 0.783 | 0.809 | 0.736 | 0.750 |
| MILU | ||||||||
| Arena Hard v2 | | 0.501 | 0.431 | 0.420 | 0.677 | 0.721 | 0.581 | 0.629 |
| Writing Bench | | 0.714 | 0.703 | 0.757 | 0.8372 | 0.850 | 0.792 | 0.7913 |
General Benchmarks (30B)
| Benchmark | Sarvam-30B | OLMo 3.1 32B | Nemotron-3-Nano-30B | Qwen3-30B-Thinking-2507 | GLM 4.7 Flash | GPT-OSS-20B |
|---|---|---|---|---|---|---|
| GPQA diamond | | 0.575 | 0.606 | 0.730 | 0.750 | 0.734 |
| + tools | | - | 0.752 | - | 0.715 | 0.742 |
| AIME 25 | | 0.781 | 0.817 | 0.891 | 0.992 | 0.850 |
| + tools | | 0.767 | 0.916 | 0.883 | 0.917 | 0.987 |
| HMMT (Feb 25) | | 0.517 | 0.850 | 0.714 | 0.850 | 0.767 |
| Beyond AIME | | 0.485 | 0.640 | 0.610 | 0.600 | 0.460 |
Reasoning Heavy Benchmarks (30B)
| Benchmark | Sarvam-30B | Nemotron-3-Nano-30B | Qwen3-30B-Thinking-2507 | GLM 4.7 Flash | GPT-OSS-20B |
|---|---|---|---|---|---|
| BrowseComp | | 0.238 | 0.029 | 0.428 | 0.283 |
| SWE-Bench Verified | | 0.388 | 0.220 | 0.592 | 0.340 |
| Tau2 (avg.) | | 0.490 | 0.477 | 0.795 | 0.477 |
Agentic Benchmarks (30B)
Curious what else we're building? Explore our APIs and start creating.
Curious what else we're building?
Explore our APIs and start creating.