Skip to main content
This note explains how to review the GPT-5.5 / Codex parity program as four merge units without losing the original six-contract architecture.

Merge units

PR A: strict-agentic execution

Owns:
  • executionContract
  • GPT-5-first same-turn follow-through
  • update_plan as non-terminal progress tracking
  • explicit blocked states instead of plan-only silent stops
Does not own:
  • auth/runtime failure classification
  • permission truthfulness
  • replay/continuation redesign
  • parity benchmarking

PR B: runtime truthfulness

Owns:
  • Codex OAuth scope correctness
  • typed provider/runtime failure classification
  • truthful /elevated full availability and blocked reasons
Does not own:
  • tool schema normalization
  • replay/liveness state
  • benchmark gating

PR C: execution correctness

Owns:
  • provider-owned OpenAI/Codex tool compatibility
  • parameter-free strict schema handling
  • replay-invalid surfacing
  • paused, blocked, and abandoned long-task state visibility
Does not own:
  • self-elected continuation
  • generic Codex dialect behavior outside provider hooks
  • benchmark gating

PR D: parity harness

Owns:
  • first-wave GPT-5.5 vs Opus 4.6 scenario pack
  • parity documentation
  • parity report and release-gate mechanics
Does not own:
  • runtime behavior changes outside QA-lab
  • auth/proxy/DNS simulation inside the harness

Mapping back to the original six contracts

Original contractMerge unit
Provider transport/auth correctnessPR B
Tool contract/schema compatibilityPR C
Same-turn executionPR A
Permission truthfulnessPR B
Replay/continuation/liveness correctnessPR C
Benchmark/release gatePR D

Review order

  1. PR A
  2. PR B
  3. PR C
  4. PR D
PR D is the proof layer. It should not be the reason runtime-correctness PRs are delayed.

What to look for

PR A

  • GPT-5 runs act or fail closed instead of stopping at commentary
  • update_plan no longer looks like progress by itself
  • behavior stays GPT-5-first and embedded-Pi scoped

PR B

  • auth/proxy/runtime failures stop collapsing into generic “model failed” handling
  • /elevated full is only described as available when it is actually available
  • blocked reasons are visible to both the model and the user-facing runtime

PR C

  • strict OpenAI/Codex tool registration behaves predictably
  • parameter-free tools do not fail strict schema checks
  • replay and compaction outcomes preserve truthful liveness state

PR D

  • the scenario pack is understandable and reproducible
  • the pack includes a mutating replay-safety lane, not only read-only flows
  • reports are readable by humans and automation
  • parity claims are evidence-backed, not anecdotal
Expected artifacts from PR D:
  • qa-suite-report.md / qa-suite-summary.json for each model run
  • qa-agentic-parity-report.md with aggregate and scenario-level comparison
  • qa-agentic-parity-summary.json with a machine-readable verdict

Release gate

Do not claim GPT-5.5 parity or superiority over Opus 4.6 until:
  • PR A, PR B, and PR C are merged
  • PR D runs the first-wave parity pack cleanly
  • runtime-truthfulness regression suites remain green
  • the parity report shows no fake-success cases and no regression in stop behavior
The parity harness is not the only evidence source. Keep this split explicit in review:
  • PR D owns the scenario-based GPT-5.5 vs Opus 4.6 comparison
  • PR B deterministic suites still own auth/proxy/DNS and full-access truthfulness evidence

Quick maintainer merge workflow

Use this when you are ready to land a parity PR and want a repeatable, low-risk sequence.
  1. Confirm evidence bar is met before merge:
    • reproducible symptom or failing test
    • verified root cause in touched code
    • fix in the implicated path
    • regression test or explicit manual verification note
  2. Triage/label before merge:
    • apply any r:* auto-close labels when the PR should not land
    • keep merge candidates free of unresolved blocker threads
  3. Validate locally on the touched surface:
    • pnpm check:changed
    • pnpm test:changed when tests changed or bug-fix confidence depends on test coverage
  4. Land with the standard maintainer flow (/landpr process), then verify:
    • linked issues auto-close behavior
    • CI and post-merge status on main
  5. After landing, run duplicate search for related open PRs/issues and close only with a canonical reference.
If any one of the evidence bar items is missing, request changes instead of merging.

Goal-to-evidence map

Completion gate itemPrimary ownerReview artifact
No plan-only stallsPR Astrict-agentic runtime tests and approval-turn-tool-followthrough
No fake progress or fake tool completionPR A + PR Dparity fake-success count plus scenario-level report details
No false /elevated full guidancePR Bdeterministic runtime-truthfulness suites
Replay/liveness failures remain explicitPR C + PR Dlifecycle/replay suites plus compaction-retry-mutating-tool
GPT-5.5 matches or beats Opus 4.6PR Dqa-agentic-parity-report.md and qa-agentic-parity-summary.json

Reviewer shorthand: before vs after

User-visible problem beforeReview signal after
GPT-5.5 stopped after planningPR A shows act-or-block behavior instead of commentary-only completion
Tool use felt brittle with strict OpenAI/Codex schemasPR C keeps tool registration and parameter-free invocation predictable
/elevated full hints were sometimes misleadingPR B ties guidance to actual runtime capability and blocked reasons
Long tasks could disappear into replay/compaction ambiguityPR C emits explicit paused, blocked, abandoned, and replay-invalid state
Parity claims were anecdotalPR D produces a report plus JSON verdict with the same scenario coverage on both models