Leveraging foundation models at financial institutions
How banks and other financial institutions are utilizing GenAI and what your organization can learn from them
Created on January 15|Last edited on January 15
Comment
This piece is a follow-up to our prior post titled Financial risk management in the era of GenAI. You can read it to learn about how MRM teams are evolving to handle the non-deterministic outputs of LLM.
💡
GenAI models are transforming nearly every industry but heavily regulated industries like finance need approach their deployment with additional care. In our last piece, we looked at the differences between traditional and generative AI and how financial institutions are navigating the tradeoffs, with a special focus on model risk management. Today, we're going to look at some real-world use cases, what you need to get started, and the tools that will help you manage it all.
Real world use cases for foundation models in finance
Understanding market trends for smarter investments
GenAI applications often include analytical capabilities tailor-made for handling immense volumes of documents and finding patterns and insights in vast oceans of data no person or team of people could possibly understand.
Whether you’re using internal documents or something public like SEC filings or board meeting minutes, you can fine-tune any foundational model to deliver tailored outputs for your business domain, data, and use case (fine-tuning involves retraining a pre-trained model like GPT to perform specific tasks or in specific domains). Commonly, teams build question-answer applications atop this database so non-technical users can query the information and make investment decisions.
One of the most powerful things here is that searching this data can be dynamic and iterative. Unlike querying a typical database, search doesn’t need to be exact but rather semantic, leverage the model to understand the meaning of the query—not just matching keywords. Additionally, conversations in GenAI applications can let your investment team ask a series of probing questions, narrowing their requests to uncover otherwise impossible to find trends.
Navigating regulatory regimes
In a very similar vein, institutions are fine-tuning foundation models on corpuses of legal documents to better understand and comply with the regulatory landscape. New rules and regulations—whether from government bodies or internal guidelines—can be continuously added to the underlying fine-tuning dataset so you can stay up to date in real time.
That last part is quite important. Foundation models are trained on historical data and can sometimes experience some lag when the conditions on the ground change. You can’t rely on a pre-trained LLM to know about a new law passed last week or the regulation coming the following one. But you can easily add those documents to a fine-tuning pipeline and give teams access to a similar question answering application as the one described above.
These are just two examples of the same underlying strategy: fine-tuning an LLM and giving your technical—and non-technical—teams a way of finding vital information in an immediate, conversational interface.
Chatbots and conversational agents
Chatbots are perhaps the most common user-facing GenAI applications today. What you should consider if you’re building ones is precisely which kind of behavior you want to automate. Chatbots can reduce the load on your staff, freeing them up to answer more nuanced questions—and they can be active twenty four hours a day.
Some common uses for conversational agents:
- Customer support: Chatbots can handle a wide range of customer queries, such as checking account balances, transaction histories, and loan information.
- Fraud detection and alerts: Chatbots can alert users to potentially fraudulent transactions and assist with real-time security queries.
- Loan or credit assistance: Chatbots can assist with loan applications, offering pre-qualification checks, loan options, and real-time status updates on credit applications.
- Multilingual support: Chatbots can be trained to speak less common languages so users can get support in a way they’ll better understand.
It’s important when training your conversational agent to ensure it answers the questions you want it to and not rely on the vast datasets they were trained on. Foundational models know a lot of esoteric information but you don’t need them talking to users about it.
Personalized financial planning
You often should not create a singular chatbot to answer all queries. Some conversations are more sensitive or nuanced and should be their own initiative altogether, involving one or several models working in tandem. Financial planning agents are a perfect example of this.
Financial planning agents can help assess a customer’s financial goals, current financial situation, and future objectives (retirement, education savings, etc.) and help create a tailored financial plan. They can provide investment advice based on those goals and also that customer’s particular appetite for risk or ethical investing concerns. They can help with estate and tax planning, and help a customer track goals over time.
These agents often seem to your users like they function similarly to typical chatbots (via a simple conversational UI). But they can be much more powerful. For starters, they can carry logs of prior conversations and have access to far more information than a simple customer support bot might. They can also behave autonomously in scenarios where you won’t open yourself up to risk (such as providing information from internal documentation or certain products to a customer). You’ll want to apply appropriate guardrails and interrogate model outputs continuously.
Synthetic data creation
Foundation models can be used to create synthetic data other models need to perform. A common example here is for fraud detection.
While fraud is prevalent, fraudulent transactions are an incredibly small percentage of overall transactions on a day-to-day basis. This can lead some models to overfitting—in this case assuming fraudulent transactions are in fact legitimate ones. Generating additional data (in this case, data of fraudulent transactions) can be used to train and augment your fraud detection models and reduce the chances of overfitting.
There are other common use cases for synthetic data built with foundation models I as well. For example, generating synthetic profiles for individuals or businesses with sparse credit histories, enabling the testing of credit scoring algorithms. Testing anomaly detection systems in trading platforms or banking operations is another such use case. GenAI applications can generate rare events or outlier behaviors, such as unusual trading volumes, price spikes, or system failures, which help fine-tune systems that monitor real-time operations.
Research for loan applicants
Financial institutions do a lot of research on potential loan recipients and much of this research is rote, involving the same checks in the same databases, be they external or internal. GenAI apps can be tremendously useful here, especially when complimented with a common technique called retrieval-augmented generation or RAG.
Essentially, RAG lets a model reference an outside database it wasn’t originally trained or fine-tuned on. For example, in this context, you could give your model access to simple web searches or some internal database you maintain but didn’t train your foundation model on.
That means, instead of having your team do tedious research on loan applicants, your application can query any number of relevant datasets and populate the necessary information for your team. You’ll often still want experts involved in making the final decisions, but they’ll be using their expertise instead of spending tedious time fetching the information they need to make those decisions. You could use a similar approach to generate myriad financial documents for internal or customer use.
Innovating with GenAI—while staying compliant—with Weights & Biases
We’ve covered some of the challenges of building GenAI applications in finance but most of the tech challenges come down to your ability to explain and interpret your models and their outputs.
It’s important to note that at many institutions, leveraging foundation models means building applications that expose more of your team to the outputs of these models. That means individual customer service agents, loan review teams, and less technical market analysts.
W&B Weave can track all of the prompts your team sends a model and the outputs it sends back. You can dig into model traces so you can understand your model at a granular level, significantly improving your ability to interpret and explain model behavior. This can also help with identifying novel behaviors, challenging outputs, or hallucinations so your team can debug these interactions.
W&B Models has long served as the system of record for some of the most innovative players in machine learning, including OpenAI and NVIDIA, as well as dozens of financial institutions like the Royal Bank of Canada, Square, and many more. W&B Models is the AI system of record for pre-training, fine-tuning, governing, and managing models from experimentation to production, accelerating time to market. It helps teams run more experiments, analyze them interactively, and quickly build higher-quality models. You can centralize the tracking of models, datasets, metadata, and their lineage in the Registry to support governance, reproducibility, and CI/CD. Automated workflows for training, evaluation, and deployment enable rapid iteration for teams of all sizes.
Choosing the right model for the job
It’s common for institutions to both test competing foundation models against each other for particular use cases and employ several different LLMs (or different versions in the same family) to solve different tasks. After all, not every problem requires the most cutting edge model and there are significant cost and speed concerns to consider at scale.
W&B Weave excels at evaluating these models. It lets you interrogate overall performance and compare individual model outputs using the same input prompts. You can drill down into individual outputs and compare evaluation metrics. You can jump between trials and see model latency, model summaries, and overall aggregate metrics. And we always make the code available so you can see what generated the score. You can even ask internal experts to grade individual outputs and either approve or disapprove of their content with a simple thumbs up or down. W&B Weave also helps by estimating cost and latency so you can understand the tradeoffs before making them.

Increasing AI velocity across your organization
W&B Weave helps developers evaluate, monitor, and iterate continuously to deliver generative AI applications with confidence. W&B Weave is a light-weight, developer friendly toolkit to enable AI developers to execute the AI workflow.
You can run robust application evaluations, keep pace with new LLMs, and monitor applications in production while collaborating securely. W&B Weave is designed to overcome the barriers of traditional software development tools to meet the needs of non-deterministic LLM-powered applications. W&B Weave is framework and LLM agnostic so you don’t need to write any code to work with popular AI frameworks and LLMs such as OpenAI, Anthropic, Cohere, MistralAI, LangChain, LlamaIndex, DSPy, Cerebras, Google Gemini, Amazon Bedrock, Together AI, Groq and more.
W&B Models can help for teams leveraging traditional approaches by serving as a durable system of record for your entire machine learning organization. That can help with everything from velocity while fine-tuning the foundation models your team is building applications on top of or when your model builders pass their work to a risk management team. Logging all relevant information in a centralized system of record makes collaboration and reproducibility far easier, reducing manual pass overs and storing data and model lineage so you can iterate and debug quickly.
Understanding model behavior at a granular level
Foundation models are non-deterministic and more opaque than the models most financial institutions are familiar with. The steps taken between a user’s input in an application built on top of these models and what the model output can be hard to parse.
W&B Weave lifts up the hood on these models and makes their behavior far more understandable. Instead of just knowing the inputs and outputs, you’ll see what’s called a Trace: an interactive, step-by-step visualization of how your foundation model behaves.
This is incredibly useful for use cases where your application may employ agents. Agents can handle sequential reasoning and also autonomously leverage external tools like web search or querying an internal database (RAG pipelines are one common example). By drilling into a Trace, you can see what actions or decisions a model undertook at each step, uncovering areas where it’s excelling and areas where you may have to update your approach. This significantly reduces the opacity of model behavior and gives you much deeper explainability for how your applications function.

Staying compliant
Even the best models can degrade over time. Whether it’s data drift, market conditions, new information or updated regulations, your team needs granular, detailed tracking to both make sure your models are behaving as intended and to debug them quickly if they aren’t.
W&B Weave excels at this for applications built with foundation models. Because you capture the inputs into those applications and the outputs from the underlying model, you have deep visibility into your model’s behavior. You can use this to continuously monitor and validate models in production or debug models that aren’t behaving appropriately. For example, you can employ the Traces feature we mentioned above to understand how your models are making decisions. When a new model comes out, you can test it against existing ones to see if you see performance improvements. Put simply, if you don’t have tracking on inputs and outputs, you will have issues iterating quickly and staying both compliant and performant.
W&B Models brings similar benefits to organizations training or fine-tuning traditional machine learning models. Your team can track and log all relevant model information (architecture, hyperparameters, git commits, model weights, GPU usage, datasets, predictions, and more) and interrogate model behavior collaboratively. Detailed data and lineage tracking means your team can understand which datasets inform which models and what dependencies exist in complex pipelines. This information can be vital, both for your model risk management team and to prove to regulators you tested diligently and used appropriate data. It also makes uncovering failure points—and debugging them—far easier than it would otherwise be.
Conclusion
The machine learning landscape has changed. Financial institutions are increasingly experimenting with and deploying applications built with foundation models. Your organization needs modern tools to innovate with modern models.
Weights & Biases provides an end-to-end solution for teams building with foundation models and training traditional models. Our platform is used by over thousands of companies to productionize machine learning at scale, including more than thirty foundation model builders like OpenAI, Cohere, and Meta. No matter what models you’re training or which foundation models you’re using to power your latest innovation, we can help.
You can learn more about how other financial institutions use Weights & Biases here. If you’d like to see how Weights & Biases can help your team deploy machine learning solutions faster while staying compliant, just reach out to us at contact@wandb.com. We’d love to see how we can help.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.