Everyone knows LLMs can review documents. But how does a single-pass review from a frontier model compare to a multi-agent system with specialized experts? We ran a real test to find out — and the results surprised us.

We took an Italian IT services contract — a realistic 12-article agreement between a supplier (Alfa Tecnologie S.r.l.) and a client (Beta Consulting S.p.A.) — and submitted it to three different systems with identical instructions: "Identify ALL issues, risks, and unfavorable clauses."

The Setup

The contract is approximately 4,800 characters long and covers standard clauses: scope, duration, payment, SLAs, intellectual property, confidentiality, liability, data protection, subcontracting, termination, and jurisdiction. It was drafted with several embedded issues of varying severity.

We tested three approaches:

•Gemini 3.1 Pro (single-pass) — Google's frontier model, temperature 0.3, 8K max tokens
•GPT-5.2 (single-pass) — OpenAI's latest reasoning model, temperature 0.3, 8K max tokens
•PAK4L Multi-Agent — 9 specialized agents (Clarity Expert, Structure Analyst, Logic Checker, Italian/EU Law Expert, Completeness Auditor, Style Editor, Consistency Checker, Audience Alignment, Fact Checker) + Coordinator + Final Evaluator

All three received the same document. The single-pass models got a simple prompt asking for a complete legal review. PAK4L used its standard pipeline with dynamic agent selection.

Results at a Glance

Here is the raw comparison:

•Gemini 3.1 Pro: 13 issues found, 49 seconds, ~10,000 characters of analysis
•GPT-5.2: 32 issues found, 95 seconds, ~18,700 characters of analysis
•PAK4L: 29 raw issues → 22 unique after deduplication, 9 agents + coordinator + evaluator, structured output with severity breakdown (4 Critical, 9 High, 6 Medium, 2 Low)

GPT-5.2 found the most individual issues (32). PAK4L found fewer (22 deduplicated) but with cross-validated severity, actionable fixes, and deliverables no single model produces. The difference is not in counting — it is in what you do with the findings.

What Everyone Found

All three systems correctly identified the core contract problems. These findings were consistent across all approaches:

•Intellectual property stays with the supplier despite client funding development (Art. 5)
•Liability cap too low at 50% of annual fee, excluding data loss entirely (Art. 7)
•GDPR clause too generic, missing formal Data Processor appointment under Art. 28 (Art. 8)
•Subcontracting allowed without client authorization (Art. 9)
•Termination penalty of 100% of remaining value is disproportionate (Art. 10)
•Confidentiality period of only 1 year is insufficient (Art. 6)
•Travel expenses "without limits" create uncontrollable costs (Art. 3)
•Missing acceptance testing / collaudo procedure
•Missing exit strategy and transition assistance
•Art. 1341 double-signature requirement for unfair terms

GPT-5.2 was particularly impressive here: it found issues like missing change control procedures, undefined maintenance scope, missing ADR clauses, and personnel requirements that even PAK4L's agents did not surface. When it comes to raw issue enumeration, frontier reasoning models are remarkably thorough.

So Why Multi-Agent?

If GPT-5.2 found more issues than PAK4L, why use multiple agents? Because document review is not a counting exercise. The value lies in depth of analysis, cross-validation, actionable output, and unique specialized findings that even the best single-pass model misses.

1. DORA Compliance (Structure Analyst)

PAK4L's Structure Analyst flagged that the contract fails to specify data processing locations, a requirement under the EU Digital Operational Resilience Act (DORA, Regulation 2022/2554). If the client operates in financial services, this omission creates significant regulatory exposure. Neither Gemini nor GPT-5.2 mentioned DORA at all — they focused entirely on GDPR. A dedicated compliance agent knows to check beyond the obvious framework.

2. Logical Paradox Detection (Logic Checker)

All three systems flagged the IP clause as problematic. But PAK4L's Logic Checker identified a deeper logical paradox: if the client terminates the contract due to the supplier's breach, the client still loses access to the software it paid to develop. This creates a perverse incentive where the supplier benefits from its own non-performance. GPT-5.2 described the clause as creating "lock-in" but did not identify this specific logical contradiction.

3. Cross-Clause Connections (Multiple Agents)

PAK4L's Completeness Auditor connected two separate clauses that no single-pass model linked: Art. 7 excludes liability for data loss, yet the contract has no backup or disaster recovery obligations. Excluding liability for data loss is only defensible if the supplier is contractually required to prevent it. This cross-clause analysis requires understanding the document as a system of interconnected obligations, not a list of independent articles.

4. Cross-Validation Between Agents

When the Italian/EU Law Expert flagged GDPR Art. 28 violations, the Structure Analyst confirmed from an architecture perspective, and the Completeness Auditor added that the required DPA annex was entirely missing. Three independent agents, three different methodologies, one reinforced finding. In the debate log, the Italian/EU Law Expert wrote: "I confirm the Structure Analyst's finding on DORA: Art. 30(2)(b) requires specifying data processing locations."

GPT-5.2's 32 issues are generated by a single reasoning chain. If that chain has a blind spot (like DORA), nothing catches it. PAK4L's architecture makes blind spots structurally less likely because each agent has a different one.

Cross-validation does not increase the issue count — it increases the reliability of each finding. When three independent agents flag the same issue, you can be confident it is real, not a hallucination.

The Output Gap

The most significant difference is not what was found but how it is delivered. GPT-5.2 produces an excellent narrative report — but it is still prose. PAK4L produces:

•Structured JSON — every issue has a title, severity, category, location, evidence, confidence score, and suggested fix
•Exact text replacements — not "rewrite the clause" but the specific old text and new text, ready to apply
•Risk score (2/10) with executive summary and priority improvement matrix
•Redline document — a track-changes DOCX showing all recommended modifications
•Revised document — a clean version incorporating all fixes
•Audit trail — which agent found what, how agents agreed or disagreed, and why

Converting GPT-5.2's 32-issue narrative report into actionable document changes requires a lawyer to read it, interpret each recommendation, and manually apply fixes. PAK4L's output is machine-readable and directly applicable — the "Fix" button in the Boardroom applies a suggested change in one click.

When Single-Pass Is Enough

To be fair: GPT-5.2 is remarkably capable. For quick triage — determining whether a document needs detailed review — it provides fast, thorough screening. For simple documents with a narrow scope (standard NDAs, short ToS updates), a single frontier model is often sufficient and more cost-effective.

If your workflow is "read the report, then hand it to a lawyer," a single-pass LLM saves time. The issues it finds are real and well-articulated.

When You Need Multi-Agent

Multi-agent review becomes essential when:

•You need cross-validated findings — not just one model's opinion, but independent confirmation from multiple specialized perspectives
•You need regulatory completeness beyond GDPR — DORA, NIS2, sector-specific regulations that single models consistently overlook
•You need actionable output — structured data, exact text fixes, redline documents, not just narrative prose
•You need audit-grade transparency — showing exactly which expert found what, with evidence and confidence scores
•The document is a system of interconnected clauses where cross-references between articles reveal issues invisible in isolation

Neither GPT-5.2 nor Gemini 3.1 Pro mentioned DORA compliance — a regulation that can result in significant fines for financial services entities. This is not a model quality issue; it is a depth-of-specialization issue. A dedicated compliance agent checks frameworks that a generalist model's attention does not reach.

Methodology Notes

For transparency: Gemini used temperature 0.3 with 8K max output tokens. GPT-5.2 used temperature 0.3 with 8K max completion tokens. Both received an identical prompt asking for severity ratings, article references, descriptions, and recommendations. The PAK4L run used the standard production pipeline with Gemini 3.1 Pro for coordination and Gemini 3 Flash for specialized agents. All tests were run on the same day against the same unmodified contract.

The comparison script and raw outputs are available in our benchmark repository for reproducibility.

Ready to try PAK4L?

Upload a document and see multi-agent review in action.

Get Started Free

Single LLM vs Multi-Agent: A Real-World Contract Review Comparison