Vision Language Models : Introducing BLIVA 

Vision Language Models : Introducing BLIVA 

In recent times, Large Language Models (LLMs) have been pivotal in natural language understanding, excelling in various tasks. Vision Language Models (VLMs), like OpenAI's GPT-4 in 2023, have made significant strides in addressing visual question-answering tasks by combining language understanding with visual comprehension. While existing methods connect LLMs with visual encoders for tasks involving images, challenges arise in interpreting text within images, a common occurrence in daily life.

To overcome this limitation, the researchers introduce BLIVA, a multimodal LLM that combines learned query embeddings with the LLM and image-encoded patch embeddings for a more comprehensive understanding of text-rich images. This strategic integration aims to enhance the model's ability to capture textual details within images, addressing a critical aspect of human visual perception.

Vision Language Models (VLMs) merge language understanding with visual capabilities but face challenges when it comes to interpreting images with embedded text. To overcome this limitation, the study introduces BLIVA, an enhanced model derived from InstructBLIP with a Visual Assistant.

BLIVA improves text-rich Visual Question-Answering (VQA) benchmarks substantially, achieving up to a 17.76% enhancement in the OCR-VQA benchmark and up to 7.9% in general VQA benchmarks like Visual Spatial Reasoning. The model achieves an impressive 17.72% overall improvement in a comprehensive multimodal Large Language Model (LLM) benchmark (MME) when compared to the baseline InstructBLIP.

What sets BLIVA apart is its ability to capture intricate details by incorporating query embeddings from InstructBLIP and directly projecting encoded patch embeddings into the LLM. This innovative approach, inspired by LLaVA, addresses the limitations of traditional image information extraction methods that rely on fixed query embeddings.

Empirical evidence underscores BLIVA's prowess in decoding real-world images, showcasing a significant capability irrespective of text presence. This not only enhances its performance in text-rich scenarios but also positions BLIVA as a versatile solution with broad industry applications.

Multimodal instruction tuning-

Key Points About Multimodal Instruction Tuning:
  1. Improves Generalization: Instruction tuning helps language models perform better on tasks they haven't seen before. It's particularly useful in the field of Natural Language Processing (NLP).
  2. Data Collection Methods:
  • Conversion of Existing Datasets: Some researchers convert existing NLP datasets into an instruction format.
  • Generation by LLMs: Others use large language models (LLMs) to create new instruction data.
  1. Application in Multimodal Settings:
  • Image-Based Instruction Tuning Examples:
  • MiniGPT-4: Uses human-curated instruction data during the fine-tuning stage.
  • LLaVA: Creates multimodal instruction data by prompting GPT-4 with image captions and bounding box coordinates.
  • mPLUG-Owl: Utilizes a mix of text-only and multimodal instruction data for fine-tuning.
  • These methods are designed to enhance the model's ability to understand and process both text and visual information.
  1. Enhancing Existing Models:
  • MultimodalGPT and OpenFlamingo: These models use various instruction templates that include both vision and language data.
  • OFA Fine Tuning: A benchmark dataset with 62 diverse multimodal tasks is used for fine-tuning the OFA model.

In summary, multimodal instruction tuning is a method of training language models to be more effective in tasks that require understanding both text and images. This is achieved by using specially designed datasets and training methods that combine instructions with multimodal data. The goal is to make these models more versatile and capable in a variety of tasks, particularly those involving complex interactions of text and visual content.

Method Architecture Overview-

The paper describes an advanced multimodal large language model (LLM) architecture called BLIVA, which integrates two types of end-to-end multimodal LLMs:

  1. Models Utilizing Learned Query Embeddings for LLM:
  • Example Models: MiniGPT-4 and Flamingo.
  • Working Principle: MiniGPT-4 uses a module from BLIP-2 to extract image features, while Flamingo uses a Perceiver Resampler to reduce image features into a fixed number of visual outputs for LLM.
  • Limitations: These models can better understand the vision encoder but may miss critical information from encoded patch embeddings.
  1. Models Employing Image-Encoded Patch Embeddings:
  • Example Model: LLaVA.
  • Working Principle: Connects its vision encoder to the LLM using an MLP (Multi-Layer Perceptron).
  • Limitations: These models might have limited capability in capturing all necessary information for LLM due to the use of a linear projection layer.

BLIVA's Unique Approach:

  • Integration of Both Techniques: BLIVA incorporates both learned query embeddings (aligned with LLM) and image-encoded patch embeddings (carrying richer image information).
  • Mechanism: It uses a vision tower for encoding visual representations into patch embeddings, then processes them through a Q-former for refined query embeddings and a projection layer.
  • Input to LLM: The model combines these two types of embeddings and appends them to the question text embedding as the final input to the LLM.
  • Inference Techniques: For generating outputs, BLIVA uses beam search, and for classification and multi-choice VQA (Visual Question Answering) benchmarks, it uses a vocabulary ranking method.

Two-Stage Training Scheme:

  1. Pre-Training Stage: Aligns the LLM with visual information using image-text pairs from image captioning datasets for global image descriptions.
  2. Post-Pre-Training Stage: Focuses on enhancing the model's ability to discern finer image details and respond to human queries, using instruction tuning data for better alignment with human values and the visual embeddings.

Notable Aspects:

  • BLIVA, in contrast to BLIP-2, uses a more compact pre-training dataset for its visual assistant branch.
  • It adopts a language model loss as its training objective, teaching the model to generate subsequent tokens based on the given context.
  • Additionally, for commercial use, BLIVA integrates with FlanT5 XXL (named BLIVA (FLanT5XXL) in the paper).