Overview: Mano Leads a Breakthrough in GUI Agent AI
In 2025, the AI community has sharpened its focus on agents that can operate real-world interfaces. MiningLamp Technology, a leading Chinese company in enterprise-level large models and data intelligence, announces a watershed achievement: its specialized GUI Large Model Mano has achieved state-of-the-art (SOTA) performance on two major benchmarks, Mind2Web and OSWorld. The breakthrough rests on two core innovations—online reinforcement learning and automated data acquisition—that together establish a scalable, self-evolving paradigm for GUI agent development.
Mano’s performance marks a milestone in the ability of AI systems to interact with human-made interfaces, moving beyond static task execution toward adaptive, real-time GUI manipulation. The company highlights Mano’s superior results in precise element identification and robust multi-step task execution, essential capabilities for truly useful GUI agents that can operate on phones, desktops, and web apps.
Benchmark Spotlight: Mind2Web and OSWorld
Mind2Web tasks Mano with 137 websites and over 2,350 real-world tasks, testing its ability to locate target elements within dynamic DOMs and execute complete action sequences. Mano excels in Element Accuracy (Ele.Acc), Step Success Rate (Step SR), and Operational F1 Score (Op.F1), demonstrating that it can translate perception into actionable operations with high fidelity. These metrics signal a meaningful leap toward reliable GUI automation across diverse web environments.
On the desktop side, OSWorld-Verified presents 369 cross-application tasks across browsers, office software, and other desktop apps. In the Foundation E2E GUI & Specialized Model evaluation, Mano raises the overall success rate to 41.6% (±0.7%), outperforming contemporary models like Qwen, GUI-Owl, and OpenCua. The OSWorld results underscore Mano’s robustness in real-world desktop interactions, a domain traditionally more challenging than browser-based tasks.
Technical Innovations Driving Mano’s Success
1) Online Reinforcement Learning: A New Paradigm for GUI Interaction
Mano pioneers “online reinforcement learning” in the GUI domain, departing from the offline-first training paradigm. The model trains in real time through an automated explorer that gathers up-to-date interaction data and learns from fresh trajectories. This approach maintains a dynamic balance between exploration and exploitation, allowing Mano to adapt to evolving interfaces and layouts.
MiningLamp builds a simulation-environment pool, including Browser Use Agent (BUA) and Computer Use Agent (CUA) environments, to expose Mano to a broad range of real-world scenarios. The framework uses an online sampling + offline filtering strategy: collect trajectories, then filter out noisy data to keep the training signal strong and the task distribution well-calibrated. Ablation studies show a significant boost in OSWorld-Verified performance after introducing online RL, with an increase of 7.9 points, achieving 41.6.
2) Intelligent Exploration: Automated, Real-World Trajectory Data
A core obstacle in GUI automation has been the costly collection of high-quality interactive trajectories. Mano addresses this with an automated data-capture pipeline and a scalable virtual environment cluster that simulates diverse interactive scenarios. For each target app, Mano generates prioritized target lists and filters out rarely used functions, guiding efficient exploration with clear contextual cues for subsequent actions.
In the web domain, a customized Chrome plugin called Mano-C extracts interactive elements, capturing spatial coordinates and semantic attributes. For desktop environments, Chrome-like element data is complemented by A11y Tree parsing and OmniParseV2 filtering to expand coverage. Large models generate semantic labels, functional descriptions, and interaction categories to produce structured supervision data for training.
The data-collection loop features a prompt-based exploration module that intelligently selects elements while enforcing constraints to prevent loops or redundant branches. A DFS exploration strategy coupled with screenshot capture and trajectory annotation ensures a continuous loop of data generation and refinement until exploration depth limits are reached.
Implications: A More Practical GUI AI Era
Mano’s breakthroughs suggest a future where GUI agents can learn directly from real-world interactions and continuously improve without reliance on exhaustive curated datasets. The combination of online RL, automated data capture, and a robust simulation ecosystem positions Mano as a practical platform for building GUI-aware agents across web and desktop environments. As the field embraces these methods, expect faster, more reliable automation across software suites, browsers, and enterprise tools.
What’s Next
MiningLamp’s ongoing work will likely expand the Mano ecosystem, refine online training loops, and broaden the scope of gradient-free exploration to handle even more complex interface scenarios. For organizations seeking to deploy GUI agents that can navigate, interpret, and manipulate GUI-based workflows, Mano represents a compelling benchmark and a roadmap toward scalable, self-improving AI assistants.