Document understanding is a memory and structure problem
Strong multimodal performance requires more than OCR or chart recognition. The model has to preserve relationships between sections, footnotes, tables, and visuals without flattening everything into disconnected facts.
That is why our tests emphasized report synthesis, chart comparison, appendix recall, and summary accuracy under long context.
Where Gemini 2.5 felt strongest
Gemini 2.5 was particularly effective at extracting visually grounded details from mixed layouts and staying visually attentive to document regions that mattered. In chart-heavy slides, it moved quickly and confidently.
Its biggest weakness showed up when the answer required deeper document-level synthesis instead of localized extraction.
Where GPT-5.x pulled ahead
GPT-5.x performed better when multiple pages had to be combined into one coherent argument. It was more reliable at preserving narrative structure and cross-referencing evidence before making a conclusion.
For Lab, that means the right pick depends on whether the job is extraction-first or synthesis-first.