Domestic AI startup Sarvam AI recently announced the Sarvam Edge models. This suite of on-device AI models puts it in direct competition with offerings from
Google and
OpenAI in the Indian-language AI space. Unlike cloud-based models from these global players, Sarvam Edge runs entirely on consumer devices, covering speech recognition, speech synthesis, and translation, without requiring an internet connection. The company's pitch is straightforward: AI that works anywhere, costs nothing per query, and keeps user data on the device. Here’s everything that we know about Sarvam AI’s Edge models:
What is Sarvam Edge and how do the AI models work
In a blog post, the Bengaluru-based company explains that Sarvam Edge is a collection of compact AI models built to run directly on consumer hardware rather than on remote servers. The initiative aims to bring AI functionality to users in India, including those in areas with unreliable internet connectivity. The company says it is developing Edge in collaboration with global device manufacturers.
After Day 1 Glitches at AI Summit, Ashwini Vaishnaw Says Sorry, Assures Better Arrangements
The speech recognition model supports 10 Indian languages within a single 74-million-parameter model that occupies approximately 294MB on a device. It can automatically identify the language being spoken, without requiring the user to select it.
The model processes speech at about 8.5x real-time and provides a time-to-first-token of less than 300 milliseconds on a Qualcomm Snapdragon 8 Gen 3 chip.
Hindi, Gujarati, Kannada, Punjabi, and Telugu were among the languages where the Edge model outperformed Google Cloud STT in benchmarks on the Vistaar dataset, which covers 59 test environments across domains like news and education.
The speech synthesis model has a device footprint of about 60 MB and 24 million parameters. Eight speakers and ten languages are supported in a single model, and each speaker's voice identity remains constant across languages. The model generates its first audio output on a Samsung Galaxy S25 Ultra in 260 milliseconds, which is roughly 5.2 times faster than real time.
The model achieves a mean character error rate of 0.0173 on a standard benchmark, indicating that synthesised speech closely matches the intended text across languages. Custom voice cloning is also supported — a new voice can be added using about one hour of audio data and deployed within the same 60MB model file.
The translation model has 150 million parameters and an on-device footprint of around 334MB. It handles bidirectional translation across 110 language pairs, including 10 Indian languages and English, without routing through an intermediate language.
On a Snapdragon 8 Gen 3 processor, it produces a first token in roughly 200 milliseconds and streams at around 30 tokens per second. On the FloRes benchmark, the model outperforms Meta's NLLB-600M, which is four times larger, across all tested Indian languages.
Because all processing occurs on the device, no user data is sent to external servers. There is also no per-query cost, which Sarvam says makes AI tools viable for education, small businesses, and assistive applications where cloud pricing would otherwise be a barrier.