Conectando un codificador visual con un LLM: el modelo de IA LLaVA

Investigadores de la universidad de Winsconsin-Madison han conseguido entrenar a un modelo de lenguaje, junto con un codificador de imágenes para obtener un modelo de lenguaje (una inteligencia artificial capaz de entender y generar lenguaje) que es capaz de reconocer no solo texto, sino también imágenes. Este avance nos trae un paso más cerca de un modelo de IA general que sea capaz de entender el mismo contexto que nosotros y ayude en nuestra toma de decisiones con la misma información. El modelo demostró poder seguir instrucciones con una imagen como parte de su contexto, llegando a resultados similares al GPT-4 de openAI con una fracción de los parámetros.

Humans have started interacting with the world through the two best pillars of language and vision. This is all because of the super good capabilities of the recently popularized Large Language Models (LLMs). LLMs have taken the world by storm with their significantly increasing performance. LLMs like GPT-3, T5, PaLM, etc., have started imitating humans by learning to read, summarize and generate textual data.

Researchers in the field of Artificial Intelligence have been developing a general-purpose assistant that can effectively follow multimodal vision-and-language instructions aligned with human intent to complete real-world tasks easily. For this, language-augmented foundation vision models in open-world visual understanding are being developed to perform tasks such as classification, detection, segmentation, captioning, visual generation, and editing. With the release of GPT-4 by OpenAI, the transformer model behind the famous chatbot, ChatGPT, and its multimodal capabilities of it have proved to be a good addition to the list of LLMs.

In a recent research paper, the authors have presented the first attempt to use GPT-4 to generate multimodal language-image instruction-following data. The team has introduced LLaVA, a Large Language and Vision Assistant, an end-to-end trained large multimodal model connecting a vision encoder and Vicuna for general-purpose visual and language understanding. Vicuna is an open-source chatbot with 13B parameters which has been trained by fine-tuning LLaMA on user-shared conversations.

LLaVa is an attempt to extend instruction tuning to the multimodal space. The main objective is to enable users to have their real-time tasks completed with the help of a visual assistant that can effectively follow multimodal vision-and-language instructions aligned with human intent. The significant contributions made by the team are as follows –

Multimodal instruction-following data – The team has presented a data reformation perspective and pipeline to convert image-text pairs into the instruction-following format with the help of the GPT-4 model.
Large multimodal models – The team has developed a large multimodal model by connecting the open-set visual encoder of CLIP with the language decoder LLaMA and fine-tuning them end-to-end on the generated instructional vision-language data.
The empirical study tries to validate the effectiveness of user-generated data for LMM instruction tuning. It even suggests practical tips for building a general-purpose instruction-following visual agent.
SOTA performance has been achieved with the help of GPT-4 on the Science QA multimodal reasoning dataset.
Open-Source nature – The project is open source, and the generated multimodal instruction data, the codebase for data generation and model training, the model checkpoint, and a visual chat demo are open to the public for access and can be accessed at https://github.com/haotian-liu/LLaVA.

LLaVA has demonstrated impressive multimodal chat abilities and achieved an 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, LLaVA and GPT-4 synergy achieved a new SOTA accuracy of 92.53%. The results make LLaVA a promising approach and a great contribution to the released language models.

Fuente: https://www.marktechpost.com