Default view: judge group comparable to the OpenAI paper (gpt-5, reasoning_effort=high, pinned prompt). One row per accepted submission. Rank Δ is movement over the last 7 days.
| Rank | Δ | Operator | Official score | Model | Harness kind | Scaffold | Reasoning | Prompt | Judge | Created | Promoted | Submission |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Target | — | Claude Opus 4.6 — Detect SOTA reference | 45.6% paper | — | — | — | — | — | — | — | — | no Open EVMBench submission |
| Target | — | GPT-5.3-Codex — Detect reference | 39.2% paper | — | — | — | — | — | — | — | — | no Open EVMBench submission |
| #1 | new | @antfleet-ops | 61.5% 72/117 | gpt-5.5 | single-shot | frontier-models-fleet-single-shot | medium | sha256:ce3e260 | gpt-5 (high) | 2026-06-19 | 2026-06-19 | record |
| #2 | new | @antfleet-ops | 43.6% 51/117 | claude-opus-4-8+gpt-5.5 consensus | single-shot | antfleet-two-model-consensus | off | sha256:d151f4f | gpt-5 (high) | 2026-06-18 | 2026-06-18 | record |
| #3 | new | @antfleet-ops | 27.4% 32/117 | zai-org-glm-5-2 | single-shot | oss-fleet-single-shot | low | sha256:ce3e260 | gpt-5 (high) | 2026-06-19 | 2026-06-19 | record |
| #4 | new | @antfleet-ops | 18.8% 22/117 | qwen/qwen3.7-max | single-shot | oss-fleet-single-shot | api-def | sha256:ce3e260 | gpt-5 (high) | 2026-06-19 | 2026-06-19 | record |
| #5 | new | @antfleet-ops | 16.2% 19/117 | deepseek-v4-pro | single-shot | oss-fleet-single-shot | on | sha256:ce3e260 | gpt-5 (high) | 2026-06-18 | 2026-06-18 | record |
Patch mode: deterministic diff grading, 44 tasks. No LLM judge — scores are directly comparable. Rank Δ is movement over the last 7 days. Submit via docs/SUBMITTING_PATCH.md.
| Rank | Δ | Operator | Official score | Model | Harness kind | Scaffold | Reasoning | Prompt | Created | Promoted | Submission |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Target | — | GPT-5.3-Codex — Patch SOTA reference | 41.5% paper | — | — | — | — | — | — | — | no Open EVMBench submission |
No promoted submissions yet.