Skip to main content

Olympus: A new framework for managing multiple computer vision tasks

Posted:

DPhil student Yuanze Lin and Professor Ronald Clark have collaborated with Microsoft to introduce Olympus, a new framework designed to handle multiple computer vision tasks efficiently within a unified system. 

Computer vision has advanced significantly in handling specific tasks like object detection, segmentation, and classification. However, applying these models in complex real-world scenarios, such as autonomous vehicles, healthcare, and security, presents challenges. Each task usually requires its own model, making the management of diverse tasks inefficient and resource-intensive.   

The new framework was designed by Yuanze Lin, Professor Ronald Clark, and Professor Philip Torr from the Department of Engineering Science, together with Microsoft researchers Yunsheng Li, Dongdong Chen, and Weijian Xu.  

Olympus aims to overcome the limitations of existing approaches, such as multitask learning models that often struggle with resource allocation, task balancing, and performance on complex or niche tasks. Central to the method is a multimodal control mechanism that leverages a Multimodal Large Language Model (MLLM) to interpret user instructions and efficiently route tasks to specialised models. This adaptive routing dynamically delegates tasks based on user input, optimising both computational efficiency and accuracy.  

The system is capable of managing up to 20 simultaneous tasks. Additionally, the method is able to perform a ‘chain of actions’ with up to five sequential steps, making it useful for real-world problems such as image editing and multi-step decision making that require multiple actions. 

With Olympus, we've tried to move away from computer vision models with narrow capabilities to ones that can solve complex vision tasks with step-by-step reasoning. Professor Ronald Clark 

While further testing is required to address edge cases and latency, Olympus offers a promising solution for simplifying the management of diverse vision tasks. By bridging gaps in traditional models and multitask learning systems, it paves the way for more integrated and flexible computer vision solutions across various fields.