Microsoft’s Magma AI Controls Software and Robots

Microsoft Research has unveiled Magma, an advanced AI foundation model integrating visual and language processing to control software interfaces and robotic systems. Unlike previous multimodal AI models, Magma can both interpret data and take action, making it a step toward agentic AI capable of executing multistep tasks autonomously.
Developed in collaboration with KAIST, the University of Maryland, and other institutions, Magma merges perception and control into a single AI system. This differentiates it from previous models like Google’s PALM-E and OpenAI’s Operator, which rely on separate systems for perception and action.
Magma introduces two key innovations: Set-of-Mark, which assigns labels to interactive objects in digital and physical environments, and Trace-of-Mark, which learns movement patterns from video data. These features enable it to navigate UIs and direct robotic arms effectively.
Microsoft claims Magma outperforms existing AI models in key benchmarks, including an 80.0 score on VQAv2 (higher than GPT-4V’s 77.2) and leading performance in robot manipulation tasks. However, challenges remain in complex step-by-step decision-making.
Microsoft plans to release Magma’s training and inference code on GitHub, allowing researchers to refine and expand its capabilities. If successful, Magma could push AI beyond text-based interactions, enabling AI agents to autonomously operate software and robotics—a significant leap toward more advanced AI automation.
As AI evolves, agentic models like Magma are becoming mainstream, reflecting a shift from skepticism to widespread research and development in AI-driven automation.
Related News: https://airguide.info/category/air-travel-business/artificial-intelligence/