Sarvam AI was founded with the vision of empowering India’s population by building full-stack generative AI solutions, changing the way more than a billion people across the country interact with technology. The company used Llama to develop enterprise voice AI agents with improved reasoning capabilities that are proficient in 10 Indian languages.
Sarvam leveraged Llama to develop Shuka v1, India’s first open source audio language model. Llama serves as the decoder in Shuka, processing audio tokens generated by Sarvam’s audio encoder. The tokens capture phonetic and linguistic nuances from audio inputs, which Llama decodes into text-based responses. The setup allows Shuka to interpret and respond to voice queries in Indian languages accurately and efficiently.
“Llama is pivotal in ensuring that Shuka’s responses are contextually relevant and linguistically accurate, even in languages like Gujarati, Hindi, Kannada, and Marathi, where voice models are limited,” says Dr. Pratyush Kumar, co-founder of Sarvam AI. “Developing voice-first applications are critical in countries like India where users prefer to interact via voice rather than text for certain applications.”
Shuka offers a workable approach for voice-first AI in regional languages and is a breakthrough in multilingual audio comprehension. Businesses can more easily communicate with customers in Gujarati, Hindi, Kannada, Marathi, and other Indic languages through accessible voice-based interactions.
“Because the model can natively decode audio in various languages, it opens up new possibilities for conversational AI applications such as education and customer support,” Kumar says. “And because Shuka is open source, government departments and regulated industries can use it by deploying it on their own premises, without worrying about sensitive data being shared with any third party.”
The Sarvam team chose the 8B-Instruct version of Llama 3 for the v1 model because of its balanced trade-off between computational efficiency and accuracy, making it ideal for decoding Indic languages in a low-resource setting.
The team’s initial interest in Llama was sparked by its performance in text-based tasks. They explored adapting the model to decode audio inputs when combined with Sarvam’s custom audio encoder for Indic languages, which the Llama model was not extensively trained on. The goal was to extend Llama’s capabilities from text-only models to a multimodal solution that could interpret speech in Indic languages.
When Llama’s potential for audio applications became apparent, the team executed their plan quickly. By combining Llama with Sarvam’s Saaras v1 encoder and a custom 60M-parameter projector layer, the team extended Llama’s utility to handle audio inputs.
To adapt Llama to work effectively with audio inputs, the team trained a projector layer with around 60 million parameters to bridge the gap between the audio representations generated by Sarvam’s audio encoder and Llama’s text embeddings. The projector layer enables the seamless transformation of audio data into a format that Llama can interpret as text.
Since training resources were limited, the team took a frugal approach, fine-tuning only the projector layer and leaving the rest of Llama and Saaras frozen—a strategy that minimized resource use.
“It would have taken a great deal to produce Shuka if Llama hadn’t been made available as open source software,” Kumar says. “We were able to focus on innovation in the audio encoder and projector layer and effectively construct a state-of-the-art audio-text model.”
Fine-tuning involved training the projector on a dataset spanning Indic languages, focusing on creating audio tokens compatible with Llama’s embedding space. This approach required generating high-quality question-answer pairs specific to Sarvam’s QA datasets, which were subsequently processed through Llama 3 to produce gold-standard answers.
Shuka v1 was able to achieve a balance between accuracy and efficiency through careful fine-tuning of the projector, enabling it to maintain linguistic response accuracy without requiring extensive retraining of the entire Llama model.
As Llama continues to evolve, Sarvam plans to leverage newer versions to expand Shuka’s capabilities, potentially supporting a broader set of languages and larger training datasets.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Foundational models
Our approach
Latest news
Foundational models