Skip to content

OmniParser for pure vision-based GUI agent

Posted on:October 26, 2024 at 04:41 AM

OmniParser

OmniParser is a compact screen parsing module that can convert UI screenshots into structured elements.

Table of contents

Open Table of contents

Introduction

Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI).

Limiting factor in screen parsing

When using only vision input to act as general agents across diverse operating systems, one of the primary limiting factors is the absence of a robust technique for screen parsing. The screen parsing should be capable of the following points.

  1. Reliably identifying interactable icons within the user interface
  2. Understanding the semantics of various elements in a screenshot and accurately associating the intended action with the corresponding region on the screen.

OmniParser

OmniParser is a compact screen parsing module released by Microsoft. The main features of the modules are:

When used with GPT-4V, it significantly improves the agent’s capability to generate precisely grounded actions for interface regions.

Creation of OmniParser

Curation of Specialised Datasets

The development of OmniParser began with the creation of two datasets

Interactable icon detection dataset

This dataset is curated from popular web pages and annotated to highlight clickable and actionable regions

Icon description dataset

This dataset is designed to associate each UI element with its corresponding function. This dataset serves as a key component for training models to understand the semantics of detected elements.

Fine-Tuning Detection and Captioning Models

OmniParser leverages two complementary models

Detection model

Fine-tuned on the interactable icon dataset which reliably identifies actionable regions within a screenshot

Captioning models

They are trained on the icon description dataset, which extracts the functional semantics of the detected elements, generating a contextually accurate description of their intended actions.

Benchmark Performance

Average Performance Across Benchmarks

The performance measured across different benchmarks is promising.

Average Performance Across Benchmarks

OmniParser with Other vision language models

Plugin Ready for Other Vision Language Models

Use Cases

As OmniParser is capable of identifying screen regions with actionable items it opens up possibilities for accurate and fast automation with different use cases.

Automatic Test Case extractions

OmniParser can be used to extract test cases from UI screenshots and generate test scripts for automation testing.

Automate UI flows for repetitive tasks faster

OmniParser can be used to automate repetitive tasks by identifying the regions in the UI and generating the scripts to automate the tasks.

Identify optimal user flows for navigation

OmniParser can be used to identify the optimal user flows for navigation in the UI and suggest improvements. Along with that, it can also be used to identify the regions where the user spends more time and suggest improvements.

Conclusion

OmniParser is a compact screen parsing module that can convert UI screenshots into structured elements. It can be used with a variety of models to create agents capable of taking action on UIs. When used with GPT-4V, it significantly improves the agent’s capability to generate precisely grounded actions for interface regions. The performance measured across different benchmarks is promising. On most of the benchmarks like ScreenSpot, Mind2Web, AITW and Windows Agent Arena OmniParser + GPT-4V achieves better performance compared to GPT-4V agent that uses extra information extracted from HTML. OmniParser can be used in various use cases like automatic test case extractions, automating UI flows for repetitive tasks faster and identifying optimal user flows for navigation. It opens up possibilities for accurate and fast automation with different use cases.