We tested workflow, not just answer quality
The most useful coding models are not necessarily the ones with the flashiest single response. They are the ones that maintain structure over longer debugging sessions, ask useful clarifying questions, and stay aligned while patching across multiple files.
That is why this review weighted repo navigation, change planning, refactor coherence, and recovery from wrong assumptions.
Where GPT-5.x pulled ahead
GPT-5.x performed strongest in architecture framing, ambiguity handling, and full-thread synthesis. It was more likely to preserve the overall objective when the task involved multiple subsystems or delayed decisions.
It also recovered more gracefully when the prompt changed halfway through a task.
Where Claude 4.x stayed competitive
Claude 4.x remained excellent at calm editing, readable patches, and straightforward transformations that benefited from predictable formatting discipline.
The practical conclusion is not that one model wins everything. It is that teams should choose according to the shape of work: longer reasoning chains versus tighter editorial control.