Meta introduces Chameleon, a state-of-the-art multimodal model

Meta introduces Chameleon, a state-of-the-art multimodal model

As competition in the generative AI field shifts toward multimodal models, Meta has released a preview of what could be its answer to the models released by frontier labs. Chameleon, its new family of models, has been designed to be natively multimodal instead of putting together components with different modalities.

While Meta has not released the models yet, their reported experiments show that Chameleon achieves state-of-the-art performance in various tasks, including image captioning and visual question answering (VQA), while remaining competitive in text-only tasks.

Early-Fusion Multimodal Models

The popular way to create multimodal foundation models is to patch together models that have been trained for different modalities. This approach is called “late fusion,” in which the AI system receives different modalities, encodes them with separate models, and then fuses the encodings for inference. While late fusion works well, it limits the ability of the models to integrate information across modalities and generate sequences of interleaved images and text.

Chameleon's Innovative Architecture

Chameleon uses an “early-fusion token-based mixed-modal” architecture, which means it has been designed from the ground up to learn from an interleaved mixture of images, text, code, and other modalities. Chameleon transforms images into discrete tokens, as language models do with words. It also uses a unified vocabulary that consists of text, code, and image tokens. This makes it possible to apply the same transformer architecture to sequences that contain both image and text tokens.

According to the researchers, the most similar model to Chameleon is Google Gemini, which also uses an early-fusion token-based approach. However, Gemini uses separate image decoders in the generation phase, whereas Chameleon is an end-to-end model that both processes and generates tokens.

“Chameleon’s unified token space allows it to seamlessly reason over and generate interleaved image and text sequences, without the need for modality-specific components,” the researchers write.

Challenges and Training Techniques

While early fusion is very appealing, it presents significant challenges when training and scaling the model. To overcome these challenges, the researchers employed a series of architectural modifications and training techniques. In their paper, they share the details about the different experiments and their effects on the model.

The training of Chameleon takes place in two stages, with a dataset containing 4.4 trillion tokens of text, image-text pairs, and sequences of text and images interleaved. The researchers trained a 7-billion- and 34-billion-parameter version of Chameleon on more than 5 million hours of Nvidia A100 80GB GPUs.

Chameleon in Action

According to the experiments reported in the paper, Chameleon can perform a diverse set of text-only and multimodal tasks. On visual question answering (VQA) and image captioning benchmarks, Chameleon-34B achieves state-of-the-art performance, outperforming models like Flamingo, IDEFICS, and Llava-1.5.

According to the researchers, Chameleon matches the performance of other models with “much fewer in-context training examples and with smaller model sizes, in both pre-trained and fine-tuned model evaluations.”

One of the tradeoffs of multimodality is a performance drop in single-modality requests. For example, vision-language models tend to have lower performance on text-only prompts. But Chameleon remains competitive on text-only benchmarks, matching models like Mixtral 8x7B and Gemini-Pro on commonsense reasoning and reading comprehension tasks.

Interestingly, Chameleon can unlock new capabilities for mixed-modal reasoning and generation, especially when the prompts expect mixed-modal responses with text and images interleaved. Experiments with human-evaluated responses show that overall, users preferred the multimodal documents generated by Chameleon.

The Future of Multimodal AI

In the past week, both OpenAI and Google revealed new models that provide rich multimodal experiences. However, they have not released much detail on the models. If Meta continues to follow its playbook and release the weights for Chameleon, it could become an open alternative to private models.

Early fusion can also inspire new directions for research on more advanced models, especially as more modalities are added to the mix. For example, robotics startups are already experimenting with the integration of language models into robotics control systems. It will be interesting to see how early fusion can also improve robotics foundation models.

“Chameleon represents a significant step towards realizing the vision of unified foundation models capable of flexibly reasoning over and generating multimodal content,” the researchers write.

To learn more about innovative AI technologies and stay updated on the latest trends, visit Uptrend Systems. For more insights on how these advancements can benefit your business, check out our blog.