Why synthetic data for AML testing should not replicate production

5 Jun

A common objection comes up whenever teams first consider synthetic data for AML testing.

They ask how closely the data matches their production environment, and treat that resemblance as the measure of quality.

It is an understandable instinct. Production data is what teams know, and it feels like the natural benchmark for anything that claims to test a system. But it is also the wrong question. Production data was never designed for testing, and replicating it simply reproduces its limitations. The more useful question is whether the data is fit for the purpose of measuring what a system detects and misses.

The problem with production data as a benchmark

Production transaction data has well understood limitations when used for testing. It rarely carries reliable labels, because confirmed money laundering is rare and much laundering is never identified at all. It is usually confined to a single institution, which hides the cross-institution behaviour that real networks depend on. It reflects only the methodologies a firm has happened to encounter, and says nothing about the typologies it has not yet seen.

The operational layer above the transactions is weaker still. Alerts, cases, SARs and exits describe what a system and its investigators did, not what was actually present in the data. Alerts are typically fired by blunt rules with high false positive rates. They sit at the level of an individual transaction or entity rather than describing a coordinated network. Used as test data, these outcomes are incomplete, biased, and in places simply wrong. They tell you about the historic behaviour of a system, not about the financial crime the system was meant to find.

If the goal is to copy production as faithfully as possible, all of these weaknesses come along for the ride. Using production data as the seed for a synthetic set bakes the same gaps and biases into the result. That is rarely a useful foundation for objective testing.

Building from first principles instead

At Syntheticr, we took a different approach. Rather than trying to mirror any single institution's data, we set out to build the best possible dataset for testing and training financial crime systems, starting from first principles.

We began with the UK, using a proprietary blend of macroeconomic and national statistics to construct a population indicative of a national-scale economy. From that we built a representative population of individuals and medium-sized businesses, then connected them into the relationships that real economies are made of: families, employees and employers, friends, business and their customers. Agent-based modelling let each entity behave as it would in the real world. Companies pay individuals, individuals buy services, and money moves between families and businesses in the patterns you would expect to see across a genuine population.

In parallel we modelled criminal networks. These follow their own distinct money flows and methodologies, separate from legitimate activity, while still interacting with the wider world as real criminals do. That separation is what makes the ground truth reliable: every entity is either part of a defined criminal network, or not, and we know the difference.

The transactions are processed across six fictitious financial institutions of varying size and sophistication, so that networks span multiple banks exactly as they do in practice. The result is a two-year ecosystem of close to two billion transactions, performed by around two million entities, with clearly labelled criminal networks built around specific, named methodologies. Each institution also carries its own risk intelligence, generated through basic transaction monitoring, so the alerts, SARs and exits behave with the same imperfections found in the real world. The final months are left as a greenfield period with no risk intelligence, so teams can test transfer learning and avoid over-fitting their models.

Fit for purpose, by design

That’s only the ‘standard’ dataset. Because our data is built by generators, rather than copied from a source, every part of it can be changed. Different input data produces a different population. Different network rules produce different organisational structures. Different methodologies produce different criminal activity, and different monitoring rules produce different risk intelligence. The data can be shaped to the specific demands of a use case, system, or team.

So the question to ask of synthetic data is not how closely it resembles production. Production is the environment whose limitations created the testing problem in the first place.

The question is whether the data is fit for purpose: whether it lets you test, train and monitor the rules and models in your financial crime systems, and whether it gives you an objective, detailed and repeatable assessment of performance you can rely on.

AML TestingSynthetic DataPerformance ScorecardsTransaction Monitoring

Anthony Cosgrove

Why synthetic data for AML testing should not replicate production

The problem with production data as a benchmark

Building from first principles instead

Fit for purpose, by design

What synthetic AML testing actually measures