Skip to main content
Back to Blog
Engineering

Measuring AI Reliability: Our Repeatability Benchmark

PAK4L EngineeringMarch 4, 20267 min read

If you run the same document through an AI review system on Monday and get 10 critical issues, but on Tuesday you only get 6 different ones, can you trust either result? Repeatability is the foundation of trust in any analytical system.

PAK4L uses multiple specialized agents that run LLMs with non-zero temperature — this introduces inherent stochasticity. We built a rigorous repeatability benchmark to measure exactly how much this affects the output.

Test Protocol

For each test document, we run 5 identical, independent reviews with the same configuration. No caching, no shortcuts — each run is a fresh LLM invocation. We tested three document types: an IT services contract, a corporate privacy policy, and a public procurement proposal.

The Challenge: Matching Issues Across Runs

The hardest part of measuring repeatability isn't running the reviews — it's comparing the results. The same issue gets described differently across runs:

  • Run 1: "Missing GDPR Art. 28 data processing agreement"
  • Run 2: "Non-compliance with GDPR requirements for controller-processor relationships"
  • Run 3: "Absence of DPA clauses required under Art. 28 GDPR"

These are clearly the same defect with different wording. Simple text similarity (TF-IDF, cosine embeddings) fails on legal text where domain semantics matter more than lexical overlap. We use an LLM-based semantic clustering approach: a fast model reads all issues across all runs and groups them by underlying document defect.

Results

All three documents received identical scores (5/10) across all 5 runs. The overall quality assessment is perfectly stable.

But the real story is in the issue-level analysis:

  • Jaccard Similarity 0.54-0.72: Any two runs share 54-72% of their findings
  • Core Issue Rate 43-65%: Nearly half to two-thirds of unique issues appear in 80%+ of runs
  • 100% Frequency Issues: 6-10 critical issues found in every single run, per document
  • Signal/Noise Ratio up to 15.0: The procurement document had zero peripheral issues — every finding was reproducible

Severity Calibration

We measure severity consistency using Linear-Weighted Cohen's Kappa — the standard metric for ordinal scale agreement. This treats a 1-level disagreement (HIGH vs CRITICAL) as partial agreement rather than total disagreement.

Our weighted kappa ranges from 0.40 to 0.59 across documents. For context, human inter-rater agreement on legal severity classifications typically ranges from 0.30 to 0.60. The AI's severity calibration is comparable to expert human variability.

What the System Always Finds

The most important finding: every critical regulatory and legal issue is caught in every single run. The variance is concentrated in lower-severity, subjective findings — exactly where human reviewers also disagree. Things like "clarity could be improved" or "audience alignment is suboptimal" vary between runs, but "GDPR Art. 28 violation" never escapes detection.

Ready to try PAK4L?

Upload a document and see multi-agent review in action.

Get Started Free