# Speech-to-Text API for Indian Languages

Sarvam Speech-to-Text (Saaras v3) delivers accurate transcription across 22 Indian languages. It handles background noise, multiple accents, code-mixed speech, and mid-sentence language switches -- built for the complexity of real Indian audio.

> Trust every transcript, even with background noise, multiple accents, or mid-sentence language switches.

---

## Model

**Saaras v3** -- the latest speech-to-text model from Sarvam. Auto-detects the spoken language and provides accurate transcription across all 22 supported Indian languages. Handles code-mixed audio seamlessly and is optimized for both real-time and batch processing.

---

## Key Capabilities

### Seamless Code-Mixing

Understands when speakers switch between Hindi, English, and regional languages mid-sentence. Indian conversations naturally blend languages -- Saaras v3 transcribes them without forcing a single-language constraint.

### Telephony-Optimized

Handles real call center audio: 8kHz sample rates, background noise, and multiple speakers. Purpose-built for the audio quality found in actual phone conversations.

### Noisy Audio Handling

Background noise, cross-talk, poor connections. The models maintain accuracy even in challenging acoustic conditions common in field recordings, street interviews, and crowded environments.

### Speaker Diarization

Identifies and labels different speakers in audio. Ideal for meeting transcriptions, interviews, and call center analytics where you need to know who said what. Available via the Batch API.

### Smart Punctuation

Automatic punctuation and formatting for readable transcripts. Converts raw speech into properly punctuated, well-formatted text.

### Real-Time Transcription

Stream audio and get transcriptions as they happen. Sub-250ms median latency for live voice agent scenarios.

### Transcription and Translation

Transcribe and translate in a single API call. Go from spoken Indian language audio to English text in one step.

### Domain Prompting

Boost accuracy for specialized vocabulary. Provide domain-specific context to improve transcription of technical terms, product names, and industry jargon.

### 22 Indian Languages with Automatic Detection

Comprehensive coverage across all major Indian languages with seamless code-mixing support. No need to specify the language upfront -- the model detects it automatically.

---

## Output Modes

Sarvam STT supports four output modes for different use cases:

| Mode | Description | Example Output |
|---|---|---|
| Transcribe | With formatting and number normalization | mera phone number hai 9840950950 |
| Translate | From Indic languages to English | My phone number is 9840950950 |
| Transliteration | Indian languages written in English letters | mera phone number hai 9840950950 |
| Verbatim | Preserves fillers and spoken numbers | mera phone number hai nau aath chaar zero nau paanch zero nau paanch zero |

---

## Performance Metrics

| Metric | Value |
|---|---|
| Median latency | < 250ms |
| Minutes transcribed | 100M+ |
| Uptime | > 99.5% |

---

## API Types

Three API variants are available:

### REST API
- Synchronous processing for files under 30 seconds.

### Batch API
- Asynchronous processing for files up to 1 hour.
- Supports speaker diarization and timestamps.

### Streaming API
- Real-time transcription via WebSocket.
- Designed for live voice agent interactions.

---

## Supported Audio Formats

10+ formats supported including:

- MP3
- WAV
- AAC
- OGG
- Opus
- FLAC
- M4A
- AMR
- WMA
- WebM

The Streaming API supports WAV and raw PCM formats at 16kHz sample rate.

---

## Integrations

### Plug and Play

Deploy a voice agent in under 10 minutes with LiveKit or Pipecat integrations.

---

## Use Cases

### Voice Agents

Real-time transcription for live voice agents and customer interactions.

- Customer support
- Sales and lead qualification
- Edtech tutors
- Social and companion bots

### Post-Call Analytics

Analyze call transcripts to extract insights, sentiment, and actionable intelligence.

- Sentiment analysis
- Quality assurance and training
- Compliance and record keeping
- Performance metrics and insights

### Subtitling and Captioning

Generate accurate subtitles for video content across languages.

- OTT platforms
- YouTube creators
- Corporate training
- Live broadcasts

---

## Supported Languages

All 22 Eighth Schedule languages of India are supported with native script transcription and code-mixing:

| Language | Native Script | Code |
|---|---|---|
| Hindi | हिन्दी | hi-IN |
| Bengali | বাংলা | bn-IN |
| Tamil | தமிழ் | ta-IN |
| Telugu | తెలుగు | te-IN |
| Marathi | मराठी | mr-IN |
| Gujarati | ગુજરાતી | gu-IN |
| Kannada | ಕನ್ನಡ | kn-IN |
| Malayalam | മലയാളം | ml-IN |
| Assamese | অসমীয়া | as-IN |
| Urdu | اردو | ur-IN |
| Sanskrit | संस्कृतम् | sa-IN |
| Nepali | नेपाली | ne-IN |
| Dogri | डोगरी | doi-IN |
| Bodo | बड़ो | brx-IN |
| Punjabi | ਪੰਜਾਬੀ | pa-IN |
| Odia | ଓଡ଼ିଆ | or-IN |
| Konkani | कोंकणी | kok-IN |
| Maithili | मैथिली | mai-IN |
| Sindhi | سنڌي | sd-IN |
| Kashmiri | कॉशुर | ks-IN |
| Manipuri | মৈতৈলোন্ | mni-IN |
| Santali | ᱥᱟᱱᱛᱟᱲᱤ | sat-IN |

The API also handles English with Indian accents and code-mixed speech (e.g., Hinglish, Tanglish, Benglish).

---

## Getting Started

1. Sign up at the Sarvam Dashboard: https://dashboard.sarvam.ai
2. Get your API key.
3. Start transcribing with REST, Batch, or Streaming API.

---

## FAQ

**What languages are supported?**
Sarvam Speech to Text supports 22 Indian languages including Hindi, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, Punjabi, Odia, and English (Indian accent). The API supports automatic language detection and handles code-mixed audio seamlessly.

**What is Saaras v3?**
Saaras v3 is the latest speech-to-text model that auto-detects the spoken language and provides accurate transcription across all 22 supported Indian languages. It handles code-mixed audio seamlessly and is optimized for both real-time and batch processing.

**What API types are available?**
Three variants: REST API for files under 30 seconds with synchronous processing, Batch API for files up to 1 hour with speaker diarization and timestamps, and Streaming API for real-time transcription via WebSocket.

**What audio formats are supported?**
10+ formats including MP3, WAV, AAC, OGG, Opus, FLAC, M4A, AMR, WMA, and WebM. The Streaming API supports WAV and raw PCM formats at 16kHz sample rate.

**Does it support speaker diarization?**
Yes, the Batch API supports speaker diarization. It identifies and labels different speakers in audio. This is ideal for meeting transcriptions, interviews, and call center analytics where you need to know who said what.