Phase 1 Detect — All judge groups

Full board: every accepted judge group. Detect scores are comparable within each judge model + parameter group (SPEC §3).

RankΔOperatorOfficial scoreModelHarness kindScaffoldReasoningPromptJudgeCreatedPromotedSubmission
TargetClaude Opus 4.6 — Detect SOTA reference45.6% paperno Open EVMBench submission
TargetGPT-5.3-Codex — Detect reference39.2% paperno Open EVMBench submission
#1new@antfleet-ops71.8% 84/117gpt-5.5single-shotfrontier-models-fleet-single-shotxhighsha256:ce3e260gpt-5.5 (high)2026-06-202026-06-20record
#2new@antfleet-ops61.5% 72/117gpt-5.5single-shotfrontier-models-fleet-single-shotmediumsha256:ce3e260gpt-5 (high)2026-06-192026-06-19record
#3new@antfleet-ops59.0% 69/117gpt-5.5single-shotfrontier-models-fleet-slither-augmented-single-shotmediumsha256:f415eddgpt-5.5 (high)2026-06-202026-06-20record
#4new@antfleet-ops54.7% 64/117claude-opus-4-8+gpt-5.5 consensussingle-shotantfleet-two-model-multishot-v3p1-clioffsha256:d151f4fgpt-5.5 (high)2026-06-212026-06-21record
#5new@antfleet-ops43.6% 51/117claude-opus-4-8+gpt-5.5 consensussingle-shotantfleet-two-model-consensusoffsha256:d151f4fgpt-5 (high)2026-06-182026-06-18record
#6new@antfleet-ops43.6% 51/117gpt-5.4single-shotfrontier-models-fleet-single-shotmediumsha256:ce3e260gpt-5.5 (high)2026-06-202026-06-20record
#7new@antfleet-ops35.9% 42/117claude-opus-4-8single-shotfrontier-models-fleet-single-shot-opus-tailoredmaxsha256:2fbf232gpt-5.5 (high)2026-06-212026-06-21record
#8new@antfleet-ops35.0% 41/117composer-2.5single-shotcursor-fleet-single-shotoffsha256:ce3e260gpt-5.5 (high)2026-06-212026-06-21record
#9new@antfleet-ops28.2% 33/117claude-opus-4-8single-shotfrontier-models-fleet-single-shotoffsha256:ce3e260gpt-5.5 (high)2026-06-202026-06-20record
#10new@antfleet-ops27.4% 32/117zai-org-glm-5-2single-shotoss-fleet-single-shotlowsha256:ce3e260gpt-5 (high)2026-06-192026-06-19record
#11new@antfleet-ops26.5% 31/117moonshotai-kimi-k2-7-codesingle-shotoss-fleet-single-shotlowsha256:ce3e260gpt-5.5 (high)2026-06-202026-06-20record
#12new@antfleet-ops26.5% 31/117google-gemini-3-5-flashsingle-shotfrontier-models-fleet-single-shotonsha256:ce3e260gpt-5.5 (high)2026-06-202026-06-20record
#13new@antfleet-ops18.8% 22/117qwen/qwen3.7-maxsingle-shotoss-fleet-single-shotapi-defsha256:ce3e260gpt-5 (high)2026-06-192026-06-19record
#14new@antfleet-ops17.9% 21/117minimax-m3single-shotoss-fleet-single-shotonsha256:ce3e260gpt-5.5 (high)2026-06-192026-06-19record
#15new@antfleet-ops16.2% 19/117deepseek-v4-prosingle-shotoss-fleet-single-shotonsha256:ce3e260gpt-5 (high)2026-06-182026-06-18record
#16new@antfleet-ops13.7% 16/117google-gemini-3-1-pro-previewsingle-shotfrontier-models-fleet-single-shotoffsha256:ce3e260gpt-5.5 (high)2026-06-202026-06-20record
#17new@antfleet-ops9.4% 11/117x-ai-grok-4-3single-shotfrontier-models-fleet-single-shotonsha256:ce3e260gpt-5.5 (high)2026-06-202026-06-20record
#18new@antfleet-ops4.3% 5/117llama-3-3-70bsingle-shotoss-fleet-single-shotoffsha256:ce3e260gpt-5.5 (high)2026-06-202026-06-20record