Open EVMBench — Leaderboard

Phase 1 Detect — Ranked operators

Default view: judge group comparable to the OpenAI paper (gpt-5, reasoning_effort=high, pinned prompt). One row per accepted submission. Rank Δ is movement over the last 7 days.

RankΔOperatorOfficial scoreModelHarness kindScaffoldReasoningPromptJudgeCreatedPromotedSubmission
TargetClaude Opus 4.6 — Detect SOTA reference45.6% paperno Open EVMBench submission
TargetGPT-5.3-Codex — Detect reference39.2% paperno Open EVMBench submission
#1new@antfleet-ops61.5% 72/117gpt-5.5single-shotfrontier-models-fleet-single-shotmediumsha256:ce3e260gpt-5 (high)2026-06-192026-06-19record
#2new@antfleet-ops43.6% 51/117claude-opus-4-8+gpt-5.5 consensussingle-shotantfleet-two-model-consensusoffsha256:d151f4fgpt-5 (high)2026-06-182026-06-18record
#3new@antfleet-ops27.4% 32/117zai-org-glm-5-2single-shotoss-fleet-single-shotlowsha256:ce3e260gpt-5 (high)2026-06-192026-06-19record
#4new@antfleet-ops18.8% 22/117qwen/qwen3.7-maxsingle-shotoss-fleet-single-shotapi-defsha256:ce3e260gpt-5 (high)2026-06-192026-06-19record
#5new@antfleet-ops16.2% 19/117deepseek-v4-prosingle-shotoss-fleet-single-shotonsha256:ce3e260gpt-5 (high)2026-06-182026-06-18record