Ai Benchmarks for Code

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

Kimi K2.7-Code claims 30% fewer thinking tokens and a drop-in API swap path, but independent benchmarks show kernel ...

Xiaomi's new open source, agentic AI coding harness MiMo Code beats Claude Code at ultra-long, 200+ step tasks

The persistent memory system addresses a real and widely felt pain point in agentic development workflows — one that ...

AI Coding Agents Write 180% More Code But Ship Only 30% More Software

AI coding agents boost code output by 180% but shipping rises only 30%, MIT finds. Why private data access beats benchmark ...

12d

Gomboc AI Publishes First Open Benchmark for AI Code Remediation

15 cloud scenarios. 43 merge-ready fixes. 100% loop closure. 12 minutes and $17 to author once; seconds and zero-cost ...

Morning Overview on MSN

The newest Anthropic model just took the top spot on the Super-Agent benchmark — the only AI to finish every test case end-to-end and beat OpenAI’s GPT-5.5

Anthropic’s latest AI model has reportedly reached the top of the Super-Agent benchmark, a grueling test of whether an AI ...

Morning Overview on MSN

Microsoft’s new MAI-Code tool turns plain-English descriptions into working app code

Microsoft has introduced MAI-Code, a tool designed to convert plain-English descriptions into functional application code.

NewsBytes

Xiaomi says its new AI coding model beats Claude Code

Xiaomi has launched Mimo Code v0.1.0, an open-source AI coding tool that reportedly outperforms Anthropic's Claude Code on ...

Xiaomi’s latest AI coding tool claims to outperform Claude Code on complex tasks

The open-source AI coding assistant is designed for long-running software projects and, according to Xiaomi's own benchmarks ...

SD Times

Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

Value stream management involves people in the organization to examine workflows and other processes to ensure they are deriving the maximum value from their efforts while eliminating waste — of ...

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

Crypto Briefing

Claude now authors over 80% of code merged into its own codebase

Anthropic reveals Claude Code now writes over 80% of merged production code, up from low single digits in early 2025, reshaping AI development and engineer ...

1mon

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing single-model systems from Anthropic and OpenAI by using more than 100 specialized AI ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results