Skip to main content

Financial risk management in the era of GenAI

Managing risk with generative models is more challenging than traditional ML. Learn how to get started.
Created on January 8|Last edited on January 10
Financial institutions have long employed machine learning models and traditional AI for everything from credit lending and fraud detection to understanding market trends and investing with that understanding. But the emergence of generative AI has unsettled old practices and traditions. Foundation models have brought new possibilities—indeed brand new use cases—in financial services but those use cases must be managed carefully in an industry where regulations abound.
In this post, we’ll look into the key differences between traditional AI and this new, evolving paradigm. Later in the week, we’ll cover how financial services are employing GenAI today, how you should be thinking about model risk management with these new, non-deterministic models, and the tools you can employ to stay both innovative and compliant.

Key differences between traditional AI and generative AI

There are several important distinctions between the traditional models employed for decades in finance and the generative models organizations are investing in today. In this section, we’ll look at some of those contrasts and how you can navigate them intelligently.

Data and scope

Traditional models are trained for a purpose on carefully scoped data. A fraud detection database, for example, usually contains myriad legitimate transactions, a small subset of fraudulent ones, and synthetic fraud data to prevent the model from overfitting. While these datasets can be vast, they’re also niche and tailored for a specific purpose.
The datasets used to train foundation models are orders of magnitude larger. They’re trained on a veritable internet’s worth of data and are, by design, far more general: the same foundation model can be prompted to solve—or aid in solving—a wide variety of tasks.
This is not a better or worse distinction, it’s simply a difference. Many organizations pair traditional fraud detection models with foundation models. Foundation models, for example, aren’t static or rule-based and can detect novel patterns or even understand normal behavior for a particular user in a more nuanced way than traditional models can, owing to the fact they’re trained on massive amounts of data.

Ownership

Few organizations have the team or budget to train a foundation model like GPT from scratch. These are months- or years-long endeavors, costing tens of millions of dollars and even then, it’s unlikely internal teams can reach and maintain the performance of the most cutting-edge models currently available.
For this reason, most institutions employ either open-source models like the Llama family or name-brand models like GPT. It’s worth noting that open-source models will often bring some insights about their underlying data or parameters, where closed models will not generally provide this.
No matter which foundation models you are using, you’re going to want to provide them with context-specific information for your particular use case. For example, fine-tuning can help these models become more performant in your domain—we’ll get to this a bit later—but you’ll want to be more careful from a governance perspective. You want to start using these GenAI models cautiously at first while you’re understanding what they excel at and on which tasks traditional models may perform better.

Explainability

Traditional models I brought more explainability, even if you were using complex neural networks. You could still create a proxy model to understand some of the features and behaviors, for example. With foundation models, you’ll rely more on empirical data, model outputs, and similar evidence to showcase whether the model is working as intended. Interrogating individual outputs can be helpful, but often, you’ll want to settle on a suite a metrics you’re trying to improve or a relevant benchmark you can gauge progress against.
This is why you’ll often see foundation models deployed internally for use cases like looking through massive corpuses of data. They’re incredibly impressive at understanding vast volumes of text and the outputs shouldn’t run into the sort of regulatory hurdles an external-facing application might. That doesn’t mean you can’t use GenAI applications that your customers might interact with. Rather, you should absolutely consider what internal processes can be streamlined or improved.

Governance and model risk management (MRM) in the era of GenAI

Model risk management is common practice in financial services. Most organizations have dedicated teams of model builders that pass those models to their risk management team who in turn assess and measure the risk of those models making unwanted or, potentially, illegal suggestions or decisions.
It’s worth noting that laws like Federal Reserve’s Supervisory Letter SR 11-7 don’t expressly mention generative AI. The technology is simply too new. Some of the best guidance we have currently comes from NIST AI 600-1, issued in the summer of 2024. The framework categorizes risks into various types such as technical (model malfunctions, confabulation), misuse (malicious uses like cyberattacks or disinformation), and societal risks (environmental impacts or ethical concerns). NIST AI 600-1 provides organizations with a roadmap to mitigate these risks, offering over 400 specific recommendations on governance, transparency, and content management, all geared towards making AI systems more secure and responsible.
All told, you should understand that foundation models introduce brand new avenues for efficiency and ROI but they bring with them new and different risks from traditional models. And while the list below can’t cover all 400 recommendations, there are some proactive steps you can take to mitigate your risk.

Update data use and retention policies

While you won’t have oversight into the underlying data used to train a foundation model, you will have oversight into the data you put into these models. This can include any datasets you use to fine-tune a foundation model but crucially, any prompts your end users send to these models.
After all, one of the biggest advantages of foundation models is they reduce the barrier for entry to use. Software engineers and non-technical users can interact with LLM-powered applications to query databases, assist in decision-making, creating financial documents, researching loan applicants, and dozens of other use cases. Keeping track of each interaction is incredibly important, both from a model performance perspective and a governance one. We’ll walk through some tooling for this in a few pages.

Broaden your definition of MRM and include more stakeholders

Oftentimes, your MRM team was sufficient to assess the risk of traditional machine learningI models. They could look at qualitative risks, model architecture, decisions made, and the underlying data that trained the model.
This changes when you’re using foundation models. After all, your MRM team won’t have a lot of the information they rely on for more classical models. Many organizations are creating new frameworks and committees, often involving decision makers from their legal and compliance departments, data science, AI ethics boards, and subject matter experts who can help validate model outputs.
It’s worth keeping in mind that not every foundation model will require this many stakeholders. Internal use cases like building a database of SEC filings and a GenAI app to query and understand them likely don’t require nearly as much oversight. This is a big part of the reason many financial institutions are rapidly building solutions like these versus rolling out autonomous decision-making applications that could affect their customers. Those can, of course, still be deployed but they should be interrogated heavily and tracked extensively as they’re more prone to regulatory pressures.

Educate your team

With traditional AI, many of your internal users didn’t need to know how a model actually worked to use it. For example, your team could offer a loan based on the borrower’s financial history and information without knowing precisely how the model calculated the rate. Your MRM team’s endorsement of the model was enough.
In generative AI, you’ll need to educate your internal users about how to use these outputs. That means, in part, that you’ll want a quick, tight feedback loop about how these models and the applications they power are used. This again leads us back to the current preference for more internally-facing deployments of these models and augmenting traditional approaches with foundational models.

Inform your users

If your GenAI application is user-facing, it’s often smart to let customers know they’re interacting with a conversational agent. This reduces your risk, especially when paired with a flow that users can contact an employee for nuanced questions your agent isn’t equipped to handle or for when a user just wants a second opinion.

A simple rubric for leveraging GenAI

While there’s no one-size-fits-all solution, generally speaking, you want to consider three important factors with your foundation models. A “yes” answer to any of the following questions doesn’t mean you should abandon the initiative, but rather that you should employ a more conservative approach and consider the previous section while building and deploying.

Is your application client facing?

Client facing applications require more care. Even something like a chatbot should be tested thoroughly, making sure customers can’t push the chatbot to answer questions you don’t want answered or offering products that it shouldn’t.

Are there regulatory implications?

You’ll want to make sure any foundation models that touch regulated areas of your business can be explained and are transparent. You’ll want a durable system of record for the underlying fine-tuning datasets, prompt inputs, and model outputs you can interrogate.

Are decisions automated?

Automated, decision-making agent behavior should also be scrupulously tested. For example, if an application can act in a way that directly affects a customer’s credit worthiness or financial security, you should be continuously testing and improving those apps.
For systems where any of the above is true, it’s recommended you leverage internal subject matter experts to help grade the efficacy of your models and steer them to become more performant.

Conclusion

We'll be back later next week with a look at some real-world use cases and tools that can help your organization innovate while staying compliant. In the meantime, feel free to check out some related articles:


Iterate on AI agents and models faster. Try Weights & Biases today.