Multimodal generative AI systems

Last updated Dec 12, 2023

Some generative artificial intelligence (AI) systems use only one type of input, such as text, and produce only one type of output, such as text. Other AI systems accept multiple types of inputs, such as text and images, and can produce various forms of output. These are called multimodal AI systems. The information here focuses on these more complex systems, which function in products such as Ray-Ban Meta smart glasses.

Usage tips
Data usage
What to be aware of

How it works

Multimodal generative AI systems typically rely on models that combine types of inputs, such as images, videos, audio, and words provided as a prompt. It then converts them into an output, which may also include text-based responses, images, videos and/or audio. These models are trained by analyzing large amounts of text and many images, videos or audio recordings. The models learn patterns and the association between text descriptions and corresponding images, video or audio recordings.

This process involves multiple steps, which are described below.



The first step is to provide an input to the system, which may consist of written or verbal prompts, images, video and/or audio.

In the case of Ray-Ban Meta smart glasses, the device’s AI system must be invoked, for example by saying “Hey Meta,” followed by a prompt describing the question or topic you are interested in. For example, while looking at a tree you could say, “Hey Meta, look and tell me what kind of tree this is.” This will trigger the glasses to take a photo and speech-recognition software to convert your spoken words into text, which can be sent to the model.


Safety mechanisms

Safety mechanisms analyze all inputs to detect harmful, offensive or inappropriate content that could produce problematic responses. For all inputs, our existing safety and responsibility guidelines apply.


Model processing

Next, the prompt, image or video and/or audio are passed to the AI model for interpretation and output generation. In the case of Ray-Ban Meta smart glasses, the captured image and the text produced by spoken words are passed to the AI model.

During this step, the model uses the knowledge it gained during training, where it learned patterns and language from a vast amount of data and images, to generate a coherent and relevant output.

How does the model generate an output?

  • Each type of input (the prompt plus an image, video and/or audio) is processed and then combined to incorporate information from all input types.

  • For text output:

    • A language model predicts the word that is most likely to come next based on the combined information from the input.

    • Typically, the second word of the response is generated by analyzing the input along with the predicted first word of the response. The model then analyzes the new sequence to predict the next word. This process is repeated until the complete response is formed.


Output processing

The output that the model generates might undergo processing for refinement and enhancement. For example, the model might select the most relevant and appropriate text-based response to improve quality. It also might apply additional safety measures to help prevent the generation of harmful or offensive outputs.


Output delivery

Finally, the model provides an output.

When using Ray-Ban Meta smart glasses, text-based responses are delivered in the accompanying app and through the speakers in the glasses. These, along with any images, are provided in the accompanying app.

Note that the output may vary even if the same inputs are used. This may be due to the intentionally dynamic nature of the model or because of the output processing step described above.

Also note that some words in the prompt or parts of the image, video or audio may be more important for output generation than are others. To illustrate this point and to see how this kind of generative AI model works, refer to the interactive demo below.

Now, you try it

Complete the prompt below to see how the model builds a response.

Usage tips

You can experience the type of generative AI system that uses multiple types of inputs and outputs by using Ray-Ban Meta smart glasses. When using that product, you have multiple options to control, customize and enhance your experience. Here are some tips. Options shown here may not be available to everyone.

Compose your prompt thoughtfully

Your prompt is the most important control you have over the output you will receive from the system. Try different prompts to see how they change the system’s output.

Use clear and specific prompts

For example, you might say, “Hey Meta, look and tell me what I can make for dinner with these ingredients.”

View your stored information

You can view and download information you’ve provided to the AI system.

Manage your privacy

You have options to view and manage your privacy settings. You can also choose to delete media from your gallery, which will also delete it from Meta’s cloud.

Customize location services

You can adjust your location permissions to match your preferences.

Use best practices

While using Ray-Ban Meta smart glasses, it’s important to be mindful of others around you and to respect their comfort and privacy. The capture LED light lets people know whenever you’re using the camera.

Choose whether to use Meta AI

By turning voice controls on or off, you can decide whether or not to use Meta AI. You can change this setting whenever you choose.

Refine the AI system’s responses

If the AI system’s response is not ideal, provide instructions in one or more steps for how you want the response to change. For example, “Make that shorter and use a friendlier tone,” followed by, “Make it even shorter.”

Data usage

A large amount of data is required to teach effective generative AI models, so multiple sources are used for training. These sources include information that is publicly available online and licensed information, as well as information from Meta’s products and services. More details on how we use information from Meta’s products and services are available in our Privacy Policy.

When we collect public information from the internet or license data from other providers to train our models, it may include personal information. For example, a public blog post may include the author’s name and contact information. When we do get personal information as part of the public and licensed data that we use to train our models, we don’t specifically link it to any Meta account. To learn more, visit the Privacy Center.

What to be aware of

Meta’s multimodal AI technology is still advancing, and there are important limits you should understand about how it works. For example, the text generated by AI may not be relevant, accurate or appropriate. Some of the reasons for this are:

  • Models are capable of generating human-like content through predictions based on patterns they learned during development. However, they lack the ability to verify the accuracy or reliability of the outputs they produce. Carefully review responses for accuracy. Remember that AIs aren’t human, even though they may respond in ways that seem like real people.

  • Models may generate responses that include fabricated or entirely fictional information. In other words, the model "hallucinates" content that does not originate from the data used to train it. Some examples of this may include:

    • Creating fictional events, people, places or sources of information

    • Providing details or explanations that are not based on facts

    • Claiming to be a real person

  • Models may produce output that is offensive due to limitations of the data on which they were trained, as well as the hallucinations mentioned above. If you see anything that concerns you, provide feedback in the app you’re using.

  • The output that a model generates may not be up-to-date. Learn more in the Privacy Center.

  • Our language models were trained primarily on data in English, so performance may vary when using other languages to interact with our generative AI features.