Artificial Intelligence

Salesforce AI releases Blip3-O: a fully open source unified multi-model that builds clip embedding and traffic matching to understand images and generation

The focus of multimodal modeling is to build systems to understand and generate content in visual and text formats. These models are designed to use natural language cues to interpret visual scenes and produce new images. With interest in bridging vision and language, researchers are working to integrate image recognition and image generation capabilities into a unified system. This approach eliminates the need for separate pipelines and opens the way for more coherent and intelligent interactions across modalities.

The main challenge in this field is to develop architectures that can both handle understanding and generate without compromising the quality of either. Models require mastering complex visual concepts and producing high-quality images that match user prompts. The difficulty lies in identifying the appropriate picture representation and training procedures that support both tasks. This problem becomes more obvious when it is expected that the same model interprets detailed text descriptions and generates visually accurate outputs based on them. It requires consistency of semantic understanding and pixel-level synthesis.

Previous methods typically used variant autoencoder (VAE) or clip-based encoder to represent images. VAE is effective for reconstruction, but encoding lower-level features often results in less information representations. Clip-based encoder provides advanced semantic embeddings by learning from large-scale image text pairs. However, clipping is not used for image reconstruction and is used to generate challenges unless paired with a model such as a diffusion decoder. In terms of training, mean square error (MSE) is widely used for simplicity, but tends to produce deterministic outputs. To improve the diversity and quality of power generation, the researchers turned to stream matching, which introduced controlled randomness and better simulated the continuity of image features.

Researchers at Salesforce Research, in collaboration with the University of Maryland and several academic institutions, introduced Blip3-O, a family of unified multi-models. The model adopts a two-stage training strategy, first understanding image comprehension and then image generation. The proposed system utilizes clip embedding to represent images and integrate them with diffusion transformers to synthesize new visual outputs. Unlike previous joint training methods, the sequential method can independently maintain the intensity of each task. Training the diffusion module while maintaining the backbone freeze of automatic regression to avoid task interference. To improve alignment and visual fidelity, the team also curated the Blip3O-60K, a high-quality instruction-regulating dataset that prompts GPT-4O to create datasets, including scenes, objects, gestures, and text, in various visual categories. They developed two model versions: an 8 billion parameter model that trains proprietary and public data, and 4 billion versions using only open source data.

Blip3-O’s image generation pipeline is built on the QWEN2.5-VL large language model. Processing prompts to generate visual features by stream matching diffusion transformer. The transformer is based on the Lumina-Next architecture, embedded and grouped in 3D rotational position – the concerns about speed and quality are optimized. The model encodes each image into 64 fixed-length semantic vectors, which supports compact storage and efficient decoding regardless of resolution. The research team used large-scale datasets of 25 million images from sources such as CC12M, SA-1B, and ToursionDB to train the model. They expanded it with 30 million proprietary samples for the 8B model. This also includes a 60k guided adjustment sample generated by GPT-4O, covering challenging tips such as complex gestures and landmarks.

In terms of performance, Blip3-O shows the highest score in multiple benchmarks. The image generation comparison of the 8b model reached 0.84, while the wise score of reasoning ability was 0.62. Image comprehension scored 1682.6 on MME perception, 647.1 for MME cognition was 50.6 on MMMU, and 83.1 on VQAV2 and TextVQA datasets. Human evaluations comparing Blip3-O 8B with Janus Pro 7b showed that Blip3-O’s visual quality was preferred for 50.4%, while the time for rapid arrangement was 51.5%. These results were supported by statistically significant P values ​​(5.05E-06 and 1.16E-05), indicating the superiority of BLIP3-O in subjective quality assessment.

This study outlines clear solutions to the dual challenges of image understanding and arising from it. Clip embedding, stream matching and sequential training strategies show how to solve problems in an orderly manner. The BLIP3-O model provides the latest results and introduces an effective and open approach to unified multi-model modeling.


View paper, github pages and models. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 90K+ ml reddit.


Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button