AI Concept Takeaway: VLM – Shawn’s Blog

My takeaway series follow a Q&A format to explain AI concepts at three levels:

Conceptual Level

Anyone with general knowledge can understand them.

Implementation Level

For anyone who wants to dive into the code implementation details of the concept.

Mathematical Level

For anyone who wants to understand the mathematics behind the technique.

What is VLM?

Vision-Language Model (VLM) refers to a class of AI models designed to understand and generate content that involves both visual and textual information. These models are capable of processing and integrating data from images (or videos) and text at the same time, enabling them to perform tasks that require a combination of visual and linguistic understanding.

What are the common visual-language tasks?

Vision Question Answering (VQA): Answering questions about the content of an image.
Image Captioning: Generating descriptive text for an image.

What is the common way of processing visual and textual data together?

The flow of processing visual and textual data together in VLMs typically involves:

Process visual data and text data separately.
Combine the processed information in a joint space.
Further process the combined information for the final task.

Throughout the years, various paradigms have been proposed to achieve this, including:

Dual-Encoder: These models use separate encoders for images and text, and then combine their outputs in a representation space. Examples are CLIP, ALIGN, SigLIP, EVA-CLIP.
Cross-Modal Retrieval: This approach focuses on retrieving relevant information from one modality (e.g., text) based on a query from another modality (e.g., image). Examples are BLIP, BLIP-2, Flamingo, PaLI.
Align-then-Decode: The images are first encoded into embeddings, which are then projected into tokens. A language model (like GPT) is then used to decode the aligned embeddings into coherent text. Examples are LLaVA, MiniGPT-4, InstructBLIP, Qwen-VL.
Multimodal Transformers (End-to-end VLM): These models extend the transformer architecture to handle multiple modalities (e.g., images and text) simultaneously. They typically use attention mechanisms to learn relationships between different types of data. Examples are GPT-4o, Gemini 1.5, Qwen2-VL, InternVL2, Kosmos-3.