Lowering the Risk of High-Stakes AI

Lucas Chapin

Head of Data

How we build compliance agents you can trust

Introduction

Imagine your senior compliance analyst walks into your office one day with a bold proclamation: they’ve automated all compliance work! No more manual reviews. No more tedious escalations. They've deployed an AI bot that reads every alert, drafts every SAR narrative, follows up on every RFI, and closes every low-risk case before your anachronistic human team has had their morning coffee.

You almost congratulate them.

Then you ask what data it can access.

“All of it,” they say.

“What permissions does it have?”

“Um… all of them?”

“How do you know it’s doing a good job? Can we tell if it’s accurate? Would we know if something goes wrong?”

“Uhh…”

The truth is that new tools like OpenClaw and UI-based agent builders have made building powerful agents much more accessible. Getting a proof of concept running is very easy provided that you’re willing to overlook security, auditability, and governance.

Automation in Compliance

As technologists, we are always looking for ways to make things faster and easier. It’s the drumbeat that brought us online banking, digitized KYC, remote deposit capture, real time fraud detection, and countless other innovations in the financial world.

The next frontier is clearly AI agents. At Hummingbird, we’ve spent much of the last year working to develop agents (see our product announcement). And this year? We're doubling down on that focus.

We also recognize, however, that every industry is buzzing with bold proclamations about their own AI agents and what they can do. Which is why we’ve decided to dedicate this post to explaining to you not just what our agents can do, but how we’re actually building them.

The Challenge with Agents

When developing agents, the core challenge is clear. Asking an agent to make a decision is easy. Knowing it made the right one is harder.

Failure modes abound. Missing data, uncorroborated sources, edge cases, and evolving crime typologies – these are all things that make building compliance agents difficult. Difficult and high stakes. The cost of being cavalier about automation can be catastrophic. In compliance work, it’s not just a bad user experience but regulatory exposure that’s on the line.

At the same time, ignoring AI agents’s potential is shortsighted, and the surest way to end up in real trouble. Especially when, as we believe, automation doesn’t have to come at the cost of increased risk, but can in fact can strengthen control and oversight when managed correctly.

Building with Confidence

Here’s the main thing to keep in mind, and the core of our engineering philosophy at Hummingbird.

Confidence isn't a feature you ship; it's a system you build.

Broadly speaking, we build confidence into our agents in three ways:

  1. Evals during product R&D
  2. Live monitoring of real-world agent execution
  3. Gradual deployment through different agent modes

Let’s discuss each of these in more detail:

Evals

Evals are the heart of R&D for AI products. In traditional software development, you validate whether a product works by running it through a variety of different tests, such as unit and integration tests. These tests can tell you if the software functions as expected and identify instances where new features may break older functionality.

While still useful in building AI products, tests alone are insufficient. Take our AI Assistant for example. A unit test is very good at telling us whether clicking the submit button has completed the expected action of sending the user input off to an LLM. But it falls hopelessly short where assessing the quality of the response is concerned. For example, if a user puts in an open-ended prompt to analyze transactions, we’d expect a high quality response that analyzes those transactions in context of the broader case and through the lens of an analyst working the alert.

The question, then, is how to reliably evaluate for these broader, and more open-ended, quality benchmarks. It’s not impossible – but it takes a significant amount of domain expertise, as well as high level of effort and strong attention to detail. Writing good evals is where we lean on our in-house investigators and customer design partners to work directly with our engineering team. Together, we build realistic test scenarios with high quality answers.

Since we can’t cover the full range of possible tasks an agent may have to perform across different scenarios and contexts, we extend testing through using LLM-as-a-Judge, feeding in examples of good responses to guide the AI in making its own judgment and score the quality of responses numerically. We run multiple evaluations on every execution of our AI agents to look at details like whether the response is backed up factually by the case data and whether certain MCP tools were correctly invoked to extend context that the LLM has available to it.

Live Monitoring

While leveraging evals during product R&D is important, there are a couple of limitations when it comes to releasing an AI agent to a production environment:

  1. It’s impossible to predict in advance the full range of ways an agent may be used out in the wild.
  2. Financial crime evolves, and test scenarios covered today may be insufficient to keep up with new patterns of criminal activity and typologies in the future.

To address these limitations, we not only run existing evaluations on data from real-world agent usage, helping us ensure that what our evals demonstrate in a testing instance is holding true in practice. But we also create AI-based classifications of new scenarios encountered by our agents. This allows us flag new use cases that we may have more limited coverage around. We can then create new evals to test these novel scenarios, ensuring we always keep up with the range of different ways our customers use our agents.

Agent Modes

Last, but perhaps most important, to our confidence system is our agent implementation process. Here at Hummingbird, we bring customers along on the journey to using reliable agents. Why do we believe this is so important? Well, for one, it’s the best form of user training. But equally important is the fact that – as with any new technology – seeing is believing.

For example, take our review agent. The goal of this agent is to complete a case workflow – what we call a "review" – automatically on behalf of a customer. But deploying it without thorough testing and oversight is unthinkable. Even with audit logs designed for explainability, too much can go overlooked.

So instead we have built distinct implementation “modes.” Think of these like a career ladder – each time an agent graduates from one to another, it gains an increased level of autonomy (and responsibility). But the implementation is always controlled by the customer, and is dictated by their confidence in the agent’s accuracy and reliability in real-world scenarios.

Here are our current agent implementation modes.

Observe mode

This is where we deploy an agent in the background without it interacting with the user's workflow at all. For example, an alert fires and creates a review, and the agent makes predictions about how that review will be completed and what decisions will be made but only stores its predictions in our backend system. Once the human investigator completes the review, we now have precise metrics on how closely the agent’s decisions match the human investigator. If the discrepancy between the agent’s prediction and the analyst’s real-world actions is too great, we can make prompt or model changes, build new tools, adjust agent workflows, or bring in additional data.

Recommend mode

In this mode, the agent starts to make recommendations and surface them in the UI. Human investigators still have full control, and can accept or reject the suggestions from the agent at any point. In this mode, we also surface the rationale for the agent’s recommendations to give more context, but there’s always a human in the loop to make the final call. It’s worth noting that some of our customers may decide to never progress beyond recommend mode and always retain human oversight over any and all decisions – something we fully support.

Automate mode

Here, the agent makes decisions in our platform and completes reviews with full autonomy. This mode is clearly the most powerful – and accordingly we recommend it only be implemented either in lower risk scenarios or where observed data from human-reviewed agent performance show that the agent is consistently making good decisions. It’s important to note that we still recommend some percentage of automated reviews are QA-ed by a human investigator to ensure that performance does not degrade over time as the financial crime landscape changes.

Note that the implementation modes are not all or nothing, and it often makes sense to mix and match across agents. For example, a customer may have an agent run in automate mode for an L1 triage review but have recommend mode on with human in the loop for L2 and beyond. Another customer may have automate mode for SAR filings but only for alerts where they file a SAR 100% of the time, such as potential CSAM exposure.

The important part is showing evidence directly to customers of how the agents are performing to allow them to make the best decisions for their programs. Not every customer ends up in automate mode, and that's by design. Different teams have different risk appetites. The modes aren’t a ladder to be climbed blindly; it's a menu they pick from.

Conclusion

Compliance professionals, like workers in nearly every industry these days, are inundated with bold claims about AI agents. And we wholeheartedly believe they’re bringing huge improvements! But we also recognize how the fatigue sets in when vendors all constantly overpromise what their agents can do.

In compliance, each customer has unique needs and operate in different regulatory environments. Deploying an out-of-the-box agent to automate complex compliance work – especially without significant testing – is dangerously reckless.

Our approach, which we stand behind, is to build our agents in a way that allows for our customers to increase their confidence in them as they use them. As agents learn from direct feedback with human oversight, using them becomes the way that our customers can make smart and clear-eyed decisions about how and where they want to automate their workflows.

Stay Connected

Subscribe to receive new content from Hummingbird