Second benchmark edition shows major gains in open-ended compliance work, shifting the focus from model choice to real-world deployment.
AI has crossed a practical threshold in compliance & ethics. The EQS AI Benchmark Volume 2 shows that the latest generation of AI models not only improves performance, but can now reliably handle multi-step compliance workflows – a capability that was out of reach just six months ago.
Building on the first volume published in October 2025. EQS Group tested four newly released frontier AI models on the same set of 120 real-world compliance tasks. The updated benchmark, created in collaboration with the German association Berufsverband der Compliance Manager e.V. (BCM), now compares a total of ten leading models, providing a direct view of how the latest generation performs against last year’s frontier.
Frontier models converge at the top
In Volume 2. OpenAI’s GPT-5.4 now leads the benchmark with a score of 87.6%, closely followed by Google’s Gemini 3.1 Pro (87.4%) and Anthropic’s Claude Opus 4.6 (86.1%). The leading models are now separated by little more than one percentage point. This clustering signals a clear shift: while performance gains continue, leading models are approaching a practical ceiling for general compliance tasks, making deployment strategy more important than marginal differences in model capability.
Biggest gains in open-ended compliance work
The most meaningful improvements are seen in open-ended tasks such as drafting reports, policies, or investigation plans – tasks that closely mirror the work compliance teams deliver to internal stakeholders, management, and regulators. Across all vendors, performance in these tasks increased significantly, with improvements of up to +17-18 percentage points compared to the first report, moving outputs from “usable with heavy editing” to “usable with light review.”
Agentic compliance workflows cross a key threshold
The most important finding of the benchmark lies beyond individual task performance: AI models are now approaching the capability needed to support multi-step compliance workflows end-to-end. In a simulated Conflict of Interest process – covering classification, risk assessment, review routing, and mitigation – a single frontier model (GPT-5.4) achieved above 90% performance across each individual workflow step. While the benchmark did not test a fully connected agentic workflow, the results indicate that such workflows are becoming significantly more feasible than they were just six months ago.
“The benchmark shows how quickly AI is becoming a real driver of innovation in Compliance”, said Dr. Martin Benda, President of BCM. “The opportunity now is to translate these capabilities into practical applications – in a way that strengthens both effectiveness and responsible oversight.”
“Six months ago, the question was whether AI could support real compliance work. Today, the question is how we design workflows around it,” said Moritz Homann, Head of AI at EQS Group. “Agentic compliance is no longer a question of feasibility, but of design, especially where to place the right human oversight. The latest models are strong enough to handle multi-step processes, but the real differentiator is the context around them: the tools and checkpoints that make AI reliable in practice.”