Software developers working on complex, multi-file projects now have a new tool to evaluate after Microsoft released MAI-Code ...
Google confirmed that Gemini 3.5 Pro, the most powerful model in its Gemini lineup, is already running inside the company and ...
MiniMax M3 launched June 1, 2026 with a 1-million-token context window and company-reported SWE-Bench Pro scores that edge ...
OpenAI scientists have designed MLE-bench — a compilation of 75 extremely difficult tests that can assess whether a future advanced AI agent is capable of modifying its own code and improving itself.