ScreenAI: A visual language model for UI and visually-situated language understanding

Google Research Introduces ScreenAI for UI and Infographic Understanding

Google Research has announced the introduction of ScreenAI, a novel multimodal model designed to understand and interact with screen user interfaces (UIs) and infographics. This development addresses the significant roles these visual elements play in human communication and human-machine interaction. UIs and infographics share common design principles and visual languages, presenting an opportunity to develop a unified model capable of comprehending, reasoning, and engaging with these interfaces. Infographics and UIs pose a distinct modeling challenge due to their inherent complexity and diverse presentation formats.

ScreenAI improves upon the PaLI architecture and incorporates the flexible patching strategy from pix2struct. This strategy allows ScreenAI to effectively process images of varying aspect ratios by maintaining their native proportions. The model is trained on a diverse mixture of datasets and tasks, including a novel Screen Annotation task. This task requires the model to accurately identify UI element information, such as their type, location, and descriptive details, on a given screen. The release of three new datasets—Screen Annotation, ScreenQA Short, and Complex ScreenQA—is also part of this initiative, aimed at evaluating layout understanding and QA capabilities.

ScreenAI Architecture and Training Methodology

The architecture of ScreenAI is fundamentally based on PaLI. It utilizes a vision transformer (ViT) to generate image embeddings and a multimodal encoder that processes the concatenated image and text embeddings. The training process for ScreenAI is conducted in two distinct stages: pre-training and fine-tuning. During pre-training, self-supervised learning is employed to automatically generate data labels. In the fine-tuning stage, the ViT remains frozen.

ScreenAI is trained using publicly accessible web pages, leveraging a programmatic exploration approach previously applied to the RICO dataset for mobile applications. A layout annotator, built upon the DETR model, is integrated into the process, alongside an icon classifier capable of distinguishing 77 different icon types. To generate descriptive captions for images, the PaLI image captioning model is utilized. Furthermore, an optical character recognition (OCR) engine is applied to extract and annotate textual content present within the interfaces. For pre-training data enhancement, PaLM 2 is used to generate input-output pairs, thereby increasing the diversity of the training data.

Extensive Fine-Tuning and Benchmark Performance

Following pre-training, ScreenAI undergoes fine-tuning using a comprehensive set of publicly available QA, summarization, and navigation datasets. It is also fine-tuned on datasets specifically designed for UI-related tasks, including Chart QA, DocVQA, Multi page DocVQA, InfographicVQA, OCR VQA, WebSRC, and ScreenQA datasets. For navigation-specific tasks, ScreenAI is fine-tuned with datasets such as Referring Expressions, MoTIF, Mug, and Android in the Wild. The model is further fine-tuned using Screen2Words for screen summarization and Widget Captioning for describing individual UI elements.

ScreenAI has demonstrated state-of-the-art results on various benchmarks, achieving top performance on UI- and infographic-based tasks. It exhibits best-in-class performance on Chart QA, DocVQA, and InfographicVQA, and achieves state-of-the-art results on WebSRC and MoTIF. The model also shows competitive performance on Screen2Words and OCR-VQA. Evaluations have been conducted on new benchmark datasets, including the newly released Screen Annotation, ScreenQA Short, and Complex ScreenQA. The research team also explored scaling capabilities by increasing model size, observing that performance consistently improves with size and has not yet saturated, even at the largest size of 5B parameters.

✨ Intelligent Curation Note

This article was processed by AI Universe’s Intelligent Curation system. We’ve decoded complex technical jargon and distilled dense data into this high-impact briefing.
Estimated time saved: ~5 minutes of reading.

Analysis based on reports from Google Research. Written by AI Universe News.