Large Language Model

Scaling an AI coding assistant built with Llama to hundreds of thousands of users

January 22, 2025
7 minute read

Codeium’s AI-enabled coding assistant is already beloved by hundreds of thousands of daily active users, and the company is on a mission to help developers and organizations dream bigger with Codeium. While Codeium has many modalities, its Chat and Command feature in its Integrated Development Environment (IDE) extensions leverages Llama for its free and premium offerings.

Codeium’s IDE plug-ins have this feature so that developers can converse with a code-aware AI that can address a wide range of use cases, including documentation, explanations, and unit tests. Codeium Chat can generate whole functions and applications, and for developers diving into a foreign codebase, it can explain everything needed with the push of a button. Developers can also chat with the assistant to fix a bug, add a new feature, refactor, translate existing code, or enhance visuals. Codeium Command can modify code directly, leveraging Llama to quickly take in context and write code to show the differences between lines.

“In every organization, writing code is a major bottleneck—we’re aiming to help reduce bottlenecks in both personal and business use cases with developers,” says Jeff Wang, Head of Business at Codeium. “We’ve scaled Llama models to hundreds of thousands of users. In addition to coding efficiency, another big benefit is decreased employee onboarding time as a result of having these tools. We have had customers go from three to six months to three to six weeks to onboard new engineers.”

While Codeium had trained its own autocomplete and chat models for customers, the company found that offering a fine-tuned Llama 3.1 model gave it a new open source option for a family of large general-purpose models that it could control. It employs multiple Llama 3.1 instruct variants (70B and 405B), finding that they performed better in the zero-shot tasks the team was interested in and showed a slight advantage as a foundation for fine-tuning.

Integrating Llama

Codeium Chat and Command integrated within the user's Integrated Development Environment to deliver these models. It indexes the complete codebase to deliver context-aware responses through a sophisticated reasoning approach incorporating retrieval-augmented generation and reranking. Later, Codeium would release “Riptide,” which uses an even more sophisticated retrieval mechanism.

The Codeium Chat base model for free users is based on Llama 3.1 70B, and in a paid tier, users have options for chat such as unlimited Llama 3.1 405B. While Codeium deploys its own models to enterprise, customers can self-host and have the option to choose these models or plug into other ecosystems for Codeium self-hosted Chat. For SaaS users, Llama models offer a far lower cost of ownership than running an API with closed source models.

“We have successfully served Llama models to thousands of engineers using a single GPU, even alongside other models,” Wang says. “In our Enterprise deployments, one H100 can support up to a thousand engineers, including our own autocomplete models in the same instance.”

Discussions with partners indicate that Codeium’s ability to serve Llama at scale is a competitive advantage, thanks to various optimizations the team has implemented on the hardware as well as context and inference processes across models. Llama models for Codeium’s editor tasks are 90% more cost effective, 3x lower latency, and more accurate than any comparable model.

Codeium Chat would later evolve to Cascade with their new IDE.

Windsurf and Cascade

In late November 2024, Codeium released Windsurf, the first truly agentic IDE and one of the first generally accessible agentic products. Cascade, the agent within Windsurf, can perform multi-step reasoning, make multi-file edits, and generally take action on behalf of the developer. “By also leveraging Codeium’s existing deep contextual awareness capabilities, Windsurf can not only create applications from zero to one, but also make complicated multi-file edits in production codebases, all quickly enough to keep the human in the loop and in flow state,” says Anshul Ramachandran, from the founding team. “To achieve the latency and quality we desired, we fine-tuned multiple Llama-based models for various tasks. That turned into a magical experience that everyone, from those with zero coding background to seasoned developers, could benefit from.”

Windsurf has grown to hundreds of thousands of daily active users within just the first couple of months, helping to usher in the agentic age of AI.

Looking to the future

Utilizing an open source model has proved crucial for Codeium. Many tasks demand a blend of high quality and low latency, making it essential to manage the complete model fine-tuning and serving stack. The open source community has contributed to different aspects of Codeium’s fine-tuning and serving stack, and the standardization of Llama architecture models has propelled this progress.

“Llama offers a great code generation model right out of the box, with lots of potential and flexibility for further fine-tuning,” Ramachandran says. “We’re always trying to move to where the industry moves, and we hope that Llama can continue to close gaps as the Llama ecosystem continues to grow and we try to build new products.”

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023