Open EVMBench — Leaderboard

Phase 1 Detect — Ranked operators

Default view: judge group comparable to the OpenAI paper (gpt-5, reasoning_effort=high, pinned prompt). One row per accepted submission. Rank Δ is movement over the last 7 days.

Rank	Δ	Operator	Official score	Model	Harness kind	Scaffold	Reasoning	Prompt	Judge	Created	Promoted	Submission
Target	—	Claude Opus 4.6 — Detect SOTA reference	45.6% paper	—	—	—	—	—	—	—	—	no Open EVMBench submission
Target	—	GPT-5.3-Codex — Detect reference	39.2% paper	—	—	—	—	—	—	—	—	no Open EVMBench submission
#1	new	@antfleet-ops	61.5% 72/117	gpt-5.5	single-shot	frontier-models-fleet-single-shot	medium	sha256:ce3e260	gpt-5 (high)	2026-06-19	2026-06-19	record
#2	new	@antfleet-ops	43.6% 51/117	claude-opus-4-8+gpt-5.5 consensus	single-shot	antfleet-two-model-consensus	off	sha256:d151f4f	gpt-5 (high)	2026-06-18	2026-06-18	record
#3	new	@antfleet-ops	27.4% 32/117	zai-org-glm-5-2	single-shot	oss-fleet-single-shot	low	sha256:ce3e260	gpt-5 (high)	2026-06-19	2026-06-19	record
#4	new	@antfleet-ops	18.8% 22/117	qwen/qwen3.7-max	single-shot	oss-fleet-single-shot	api-def	sha256:ce3e260	gpt-5 (high)	2026-06-19	2026-06-19	record
#5	new	@antfleet-ops	16.2% 19/117	deepseek-v4-pro	single-shot	oss-fleet-single-shot	on	sha256:ce3e260	gpt-5 (high)	2026-06-18	2026-06-18	record

Phase 2 Patch — Ranked operators

Patch mode: deterministic diff grading, 44 tasks. No LLM judge — scores are directly comparable. Rank Δ is movement over the last 7 days. Submit via docs/SUBMITTING_PATCH.md.

Rank	Δ	Operator	Official score	Model	Harness kind	Scaffold	Reasoning	Prompt	Created	Promoted	Submission
Target	—	GPT-5.3-Codex — Patch SOTA reference	41.5% paper	—	—	—	—	—	—	—	no Open EVMBench submission

No promoted submissions yet.