Developing a Framework for AI Model Selection

Jesse Reiss

CTO & Co-Founder


When we published our first Hummingbird Labs post, LLMs and Generative AI technology were just beginning to capture public attention. OpenAI’s GPT-4 had just been released, and was the cutting edge of model technology. It was by far the most well-known model option for GenAI work. Notably, however, it also wasn’t yet widely available for testing.

At the end of that post, we expressed our concerns about being too quick to implement SaaS-hosted AI models, pointing out the potential risks involved in blindly opening the doors to untested technologies. We concluded that the best course of action for compliance teams looking to implement AI solutions was to “lead with safety.“

Since then, however, we have seen massive changes to the AI landscape, and it’s worth re-examining things in light of recent industry developments.

The Latest News on AI Model Development

The race to create the best AI model is on! OpenAI is no longer the only (or even the main) player in the space. Amazon Web Services has announced a partnership with Anthropic to make Claude AI models natively available in its cloud environments. (Anthropic, meanwhile, has released Claude 2 and Claude 3 – each of which has brought more advanced functionality to their AI model family and pushed them into direct competition with GPT-4 in terms of capability.) Google, for their part, joined the race by launching their Gemini models, and have already pushed the initial launch version forward with a 1.5 release.

So – what does this latest round of news developments tell us? Importantly, it tells us that all three of the major cloud hosting providers now have their own hosted AI model available for commercial use. This is good news for us! These models all benefit from the intense security protocols of their parent companies, creating opportunities for teams like ours, for whom accuracy, transparency, and data security are paramount. As AI models have time to mature, the data privacy and IT security practices that surround them mature as well.

Here at Hummingbird, we haven’t yet committed to a single model provider for our AI feature development. But these recent industry advancements give us the knowledge and insight we need in order to be able to build with confidence.

Model Choice and Why It Matters

As we’ve begun to integrate AI models into Hummingbird to deliver solutions for the compliance industry, the question of how best to evaluate different models (and the quality of their results) has become increasingly important.

After all, while there are a few benchmarks (such as context window, average response time, cost-per-token, input format support, etc.) helpful for general model comparison, none of them provides any indication as to quality of results. Actually comparing the quality of AI model output requires a much more hands-on approach, with ample time spent fussing over prompt engineering and careful consideration of answers. It’s important to consider things such as performance, context, hosting questions, and prompt management. Each of these represents an area of inquiry that – if an educated decision is to be made about model selection – needs to be explored in detail.

The S.A.F.E. Model Evaluation Framework

The time we’ve spent experimenting in this arena has led us to develop a framework for assessing models and their results. We named it the S.A.F.E. Evaluation Framework after its four main tenets – Security, Accuracy, Faithfulness, and Efficiency.

While there is a wide array of literature on generative model testing (including some with evaluation frameworks), we believe ours is most in-line with the unique needs of a compliance use case. This framework reflects our belief that the best compliance AI solutions will be those that are designed for a specific compliance use case, and which have been evaluated holistically – from context, to prompts, to model selection.

Here’s a look at the S.A.F.E. Framework in action:


How good is the solution at securing the data sent to it and ensuring that training data and prompt details won’t be leaked?

Why it matters:
Not every application needs to worry about data exposure. But in compliance, everything is sensitive. We wanted to be sure that any training data, fine tuning, and prompts are fully secured from unintentional exposure.

A Quick Note on Security: It’s difficult to say exactly what “good IT security” looks like, as each security standard must necessarily be measured against the application its meant to protect. Generally speaking, however, IT security programs are designed to act as a series of layers, each of which is designed to help mitigate risk and repel threats. Security professionals often refer to this type of program as the “swiss cheese model,” based on the idea that if you assume that no security control is perfect, and that each will have areas of weakness (just like each slice of swiss cheese will have holes in it), then the best approach is to add more layers. The more layers you add, the less likely it is that any one attack will be able to penetrate all the layers.

This is why we’re happy to see AI model providers partnering with cloud hosting vendors to provide more layers of security. AI models now benefit from virtual networking to keep traffic off the public internet, threat detection services to help detect attacks, data-loss prevention controls to limit the impact of an attack, confidential computing to keep data encrypted at all times, and security certifications to ensure that the needed controls are in place at all layers of providing these services.


How accurately does the solution answer direct questions? Is it prone to hallucinations or embellishments, or is it direct and matter-of-fact?

Why it matters:
In financial compliance, the implications of incorrect decision making can be dire. If criminal activity is missed, regulatory action is swift to follow. If compliance teams are going to use an AI model’s analysis as part of their work process, they need to be certain that the AI’s responses are reliable and accurate.


How faithfully can the solution represent the relevant information? Does it capture the full context or does it skip over critical details?

Why it matters:
Nearly as important as correctness, we can only rely on an AI for help if we can be certain that the AI is considering all the information available. This is especially true in financial crime compliance, where critical details can be hiding in unexpected places.


How quickly can the solution generate a response? Is the AI model multi-modal? Can it accept files directly as input or does it require an augmentation to first convert the file to text, adding latency? What is the cost of running the solution? How much will uncertainty about model efficiency contribute to potential cost overruns?

Why it matters:
It’s funny to say, given the bleeding-edge nature of AI technology, but AI models are slow! Over the past decade, we’ve become accustomed to applications that respond to prompts in less than a second. Response times for the newest AI models, however, routinely take up to half a minute! Obviously, model efficiency will improve in time, but for now, it’s important that we take into account all the ways we might potentially improve model performance. This includes asking questions about single-call vs. multiple prompts, direct input vs. pre-processing, and computing in series vs. parallelization.

Digging Deeper: The Math Behind AI Model Evaluation

Included in any evaluation of AI application accuracy and performance is an analysis of the underlying model architecture. Model architecture analyses are especially important when we seek to quantify something as intangible as "correctness" or "faithfulness," amongst models whose responses can vary unpredictably with each interaction.

What does an architecture analysis examine? It begins with the identification of the type of architecture underpinning an application. In the world of AI, this can include everything from neural networks (both convolutional and feedforward), transformer networks, or large language models (LLMs). Each of these architectures, however, relies in part on a concept known as "embeddings," intricate mathematical formulas that transform text into numeric sequences. Picture these sequences as vectors—directional arrows in a vast, multidimensional space. If two arrows in this space point in the same direction, the texts they represent are remarkably similar. If the arrows are perpendicular, the texts share no commonality. This geometric approach uses a measure known as cosine similarity, a method to gauge how closely these vectors—or texts—align. And the creation of a mathematical vector is what gives us a robust mathematical way of quantifying the similarity of the relevance of text.


Let’s use a basic example as a way of understanding this concept. Consider the following block of text, sourced by asking ChatGPT to generate 100 words on the history of banking.


Using the text as a starting point for our assessment, we then develop a few prompts with expected answers:


We then compare the results produced by different models against our expected response, which allows us to assess the correctness of responses. An answer of “Modern banking originated in Rome in the 4th century BCE” to the first question, for example, will have low statistical similarity to our sample answer, while an answer like “Banking in it’s current form originated in Babylon roughly 4,000 years ago” will have a high degree of similarity.

These prompt/response pairs can help evaluate both the correctness and the faithfulness of the model being evaluated. While looking at accuracy, we can also measure the speed of the responses, comparing them between models. (This is, of course, a remarkably simple example – our real world examples include much more context with several complex and sometimes dynamic prompts. We also try to generate as many test sets as possible in order to develop a deep and comprehensive set of results.)

Wrap Up

Developing AI products will never be a strictly linear process, and it’s worth being skeptical of anyone who claims otherwise. For our part, we endeavor to bring to our AI feature development the same level of involvement and attention to detail that we bring to all our projects. Working in compliance, it’s only through the implementation of development tools like the S.A.F.E. framework that we are able to create AI features capable of matching the high degree of accuracy and efficiency demanded by our industry.

We ask a lot of our models, to be sure. But no more than we ask of ourselves as product developers, or than will be asked of the compliance professionals who use our platform.

Stay Connected

Subscribe to receive new content from Hummingbird