Untukmu.AI, an Indonesian platform that offers personalized gift recommendations for birthdays, anniversaries, and corporate holidays, has developed a unique approach to protecting the privacy of their customers’ data, leveraging Llama on edge. The team designed a semi-decentralized personal assistant that ensures the company won’t have access to customer data. According to Puja Romulus, the company’s Senior ML Engineer, data is stored only on customers’ edge devices and accounts.
“My colleagues challenged me to fully deploy the model on edge devices,” Romulus says. “However, even after quantization, the model’s compute and memory needs were still too high, and the output quality degraded. I proposed split inference—processing a small part of inference on the edge and the bulk on our servers. This approach addresses privacy concerns while keeping things computationally feasible, allowing us to use the original model without quantization.”
Large language models (LLMs) excel at extracting valuable insights from unstructured text, but this process requires feeding data into the model, which is a challenge for users who want to protect sensitive information. While deploying open weight models on-premises can mitigate privacy concerns, the high resource demands often make this impractical for edge applications. Split inference is an attractive engineering solution because it ensures service providers and third-party partners aren’t able to access customer data while still enabling insightful analysis.
When the Untukmu.AI team started to explore open weight models for privacy-preserving applications, they evaluated several options for entity extraction and product recommendation tasks. Llama 3.1 8B emerged as the best choice because the team liked the balance between output quality and resource efficiency. The Llama 3.1 8B model performed well for the team’s needs and, best of all, didn’t require fine-tuning. The fact that the Llama roadmap included potential future multimodal versions like the rumored 405B (which had not been released at the time) cemented the decision.
In split inference processing, Llama has 32 transformer layers split across two checkpoints. The first checkpoint goes up to the first transformer layer, while the remaining 31 layers and the output layer are from the second checkpoint. These checkpoints run separately on the edge and in the cloud.
Before loading the checkpoints, it’s crucial to ensure the model architecture aligns with each split. This means the original model.py
should be divided into model_edge.py
for the first layer and model_cloud.py
for the remaining layers.
The process begins with a prompt from either Untukmu.AI or a third party, typically consisting of predefined questions. These prompts are sent to the user, where they are merged with the user’s data and sanitized before processing.
The merged prompt is then run at the edge to generate the first-layer tensor. This tensor is sent to the server, where the remaining computations predict the next token, which is then sent back to the edge. This iterative process continues, requiring ongoing communication between the edge and server, until a stop token is detected. Once the stop token is found, the server converts the list of tokens into full text, which is then sent back to both the user and the third party as the final response to the original prompt.
The image below captures the end of the process, at which point the user receives full text output. In the right panel, the last iteration of split inference is shown. It begins processing from the second transformer layer (since layer 0 is processed at the edge) and continues through to layer 31 and the output layer.
The left panel of the user interface displays three message summaries: the system prompt, the customer’s profile, and product recommendations. The system prompt is the initial prompt from Untukmu.AI, followed by personal information stored on the customer’s device, and finally, three product recommendations generated through split inference. This summary provides transparency into the process, ensuring that the user has full visibility over their data.
Untukmu.AI then uses the assistant’s response to search its partner database, matching the customer with suitable gift recommendations. Untukmu.AI’s system can be adapted by other third-party vendors, such as insurance companies or advertisers, to offer personalized products and services to the same person.
The company’s data visibility policy states that users have full access to all of their information, allowing them to monitor how their data is used. Untukmu.AI, as the service provider, has access to everything except the customer’s personal data, and third-party providers can’t see customers’ personal data or the Untukmu.AI prompt but can view its own prompt and the resulting output.
Industries that manage large volumes of unorganized data while needing to protect sensitive user information can greatly benefit from deploying Llama for split inference processing. With the end of the cookie era and growing concerns about privacy and protecting Personally Identifiable Information (PII), edge deployment is shifting from a niche solution to a necessity.
Untukmu.AI’s current priority is implementing split inference––particularly with larger Llama models––to ensure high-quality output, while preventing customer data from being fully exposed to service providers or third-party systems. While split inference is their primary focus, the company says they’re always exploring innovative ways to protect their customers’ data.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Join us in the pursuit of what’s possible with AI.
Foundational models
Latest news
Foundational models