OpenAI introduces operator & agents

PLUS: Humanity’s last exam: The one test AI couldn’t beat

Happy Monday, AI family, and welcome back to our newsletter.

In today’s edition:

  • OpenAI introduces operator & agents

  • Humanity’s last exam: The one test AI couldn’t beat

  • Plus trending AI tools, posts, and resources

Ready, set, go…

OpenAI introduces operator & agents

OpenAI has introduced Operator, its first AI agent designed to automate tasks by independently navigating web browsers and interacting with websites, marking a significant step toward realizing CEO Sam Altman’s vision for AI tools in 2025.

Here's what you need to know:

  • This model mimics human interaction with web interfaces, such as filling forms and clicking buttons, and requires user supervision for sensitive actions like banking or entering credit card information.

  • Operator runs on a new AI model called Computer-Using Agent (CUA), which uses GPT-4o’s vision skills with advanced reasoning to "see" websites through screenshots and interact with them by clicking, scrolling, and tapping, all without needing special integration.

  • OpenAI demoed the feature during a live stream, demonstrating its capabilities in performing tasks like booking reservations, ordering groceries, and buying tickets to sporting events.

  • It scores 58.1% on WebArena benchmark for tasks like online shopping or content management on simulated websites but performs better on real-world sites, hitting an 87% success rate on platforms like Amazon and Google Maps. However, when it comes to more complex tasks in the OSWorld benchmark, like combining PDFs from emails, its success rate drops to 38.1%.

  • It’s available in research preview to Pro users in the U.S. at operator.chatgpt.com⁠.

Why it matters:

It’s like having a super cheap virtual assistant that works without any training. I’m sure we’ll see an AI agent using this kind of setup till the end of this year. It’ll combine workflows to solve big problems, and good prompt engineering will be key to making it work well.

SIDE UPDATES

A leaked memo reveals Apple’s top AI priorities for the year: transforming Siri into "LLM Siri" by spring 2026 and enhancing its AI models. The revamped Siri is expected to debut in iOS 19.4, marking a significant leap in its capabilities. Meanwhile, Apple’s AI models, criticized for inaccuracies, are under intense scrutiny. To address this, the company has temporarily paused summaries in iOS 18.3 for certain apps, focusing on improving the underlying technology to meet user expectations.

DeepSeek's latest reasoning model, R1, has caught the attention of the tech world. R1 rivals OpenAI’s o1 model on key benchmarks, yet it was trained for just $5.6 million (a fraction of the hundreds of millions spent by leading U.S. firms). What’s even more impressive? This achievement comes despite U.S. sanctions limiting Chinese companies’ access to advanced chips. DeepSeek’s AI assistant has already climbed to the top of the Apple App Store’s free apps chart, showcasing its widespread appeal and cutting-edge innovation.

By leveraging reinforcement learning instead of fine-tuning, DeepSeek achieved superior quality and cost efficiency. R1 excels in coding and mathematics, hinting at potential for broader applications. While it’s still unclear whether its superintelligence in these areas can translate to others, R1’s innovations are poised to reshape AI reasoning on a global scale.

Hugging Face has launched SmolVLM-256M and SmolVLM-500M, compact visual AI models designed for devices with less than 1GB of RAM. These models excel at complex tasks across various media types, including diagram analysis and document comprehension. In benchmarks like AI2D, which tests grade-school science diagram understanding, they outperformed much larger models. This breakthrough brings high-quality AI performance to everyday devices, making advanced visual AI more accessible than ever before.

AI BENCHMARK

Humanity’s last exam: The one test AI couldn’t beat

An international research team have created a benchmark named "Humanity's Last Exam" to evaluate the limitations of large language models (LLMs), where even the most advanced AI systems currently fail 90% of the time.

Here's what you need to know:

  • The benchmark features 3,000 questions across 100+ specialized fields, with 42% of the questions focused on mathematics.

  • Nearly 1,000 experts from 500 institutions in 50 countries—including professors and PhD holders—collaborated to develop this rigorous assessment.

  • Beyond mathematics, the benchmark spans humanities, natural sciences, and more. To increase complexity, the questions incorporate diagrams, images, and multimedia elements, moving beyond traditional text-based challenges.

Initial results:

  • In early trials, top AI models like GPT-4, Claude 3.5, and DeepSeek scored below 10% on the benchmark.

  • A notable finding was the models' extreme overconfidence. Despite expressing high certainty in their answers, they were wrong over 80% of the time.

Why it matters:

"Humanity’s Last Exam" represents a major advancement in AI evaluation. Unlike previous benchmarks, it rigorously tests AI systems across a wide range of disciplines and formats, pushing them to their absolute limits. This provides a more thorough and nuanced understanding of their capabilities and shortcomings.

THINK PIECES / RESOURCES

Thanks for reading.