Technical Reference
Overview
This page provides structural and modelling context for the Syntheticr trial dataset. It is intended to support technical users processing the data within AML systems, rules engines, or machine learning workflows.
For detailed field-level definitions, refer to the Data Dictionary (XLSX) linked below.
Download Data Dictionary (XLSX) →
Dataset Overview
The Syntheticr trial dataset represents a single synthetic financial institution operating within a broader financial ecosystem.
The dataset spans 24 months of activity and includes:
Customer profiles (individual and business)
Customer-related parties
Financial transactions
Risk intelligence
AML alerts
The dataset is ISO20022-aligned and structured to reflect operational banking environments.
Temporal Structure
The dataset spans 24 months.
Months 1–6: Behavioural calibration period (no alerts generated)
Months 1–18: Risk intelligence available
Months 19–24: Unlabelled “greenfield” period
The final 6 months contain no risk intelligence. This is intentional and designed to support unbiased evaluation and model validation.
Risk intelligence data should not be assumed to exist beyond month 18.
Key Join Principles
When joining tables, always include bank_id in join conditions.
Core identifiers include:
entity_id (customer-level identifier)
transaction_id
alert_id (where applicable)
Transactions link to entities via entity_id.
Risk intelligence and AML alerts link to entities using entity_id.
Related party relationships link entities to other entities via defined relationship types.
Ensure join logic reflects the multi-table nature of the dataset rather than assuming a flattened structure.
Entity Layer
The customer profile tables contain:
KYC information
Customer type (individual or business)
Risk level indicators
Geographic and behavioural attributes
Business entities may include:
Directors
UBO/PSC relationships
Related party linkages
Customer risk levels do not guarantee criminality. They are behavioural signals within the ecosystem.
Transaction Layer
Transaction tables represent a range of payment types, including:
Domestic transfers
International transfers
Card activity
Cash activity
Standing orders and direct debits
Transaction volume varies significantly across entities. This is intentional and reflects realistic customer behaviour distributions.
Detection performance may vary across transaction types and volume bands.
Risk Intelligence and Alerts
Risk intelligence includes:
Alerts
SAR indicators
Exit markers
Risk intelligence contains both true positives and false positives. This is deliberate and designed to reflect real-world signal-to-noise conditions.
SAR filing does not automatically result in exit.
Operational close-out windows may show post-exit transaction activity.
Risk intelligence should not be treated as a training label without understanding its structure and temporal limits.
Network and Related Parties
The dataset includes:
Multi-entity networks
Cross-institutional networks
UBO/PSC links
Shared directors
Shared addresses
Family relationships
Network detection performance is a core dimension of the Syntheticr scorecard.
Users should consider both entity-level detection and network-level detection when evaluating system behaviour.
Scorecard Context
The Syntheticr scorecard evaluates detection performance against fully known ground truth.
Key performance concepts:
Detection rate: proportion of criminal entities detected
Precision: proportion of alerts that correctly identify criminal entities
Detection grade: entity-level grading based on detection rate
Detection capability should be interpreted alongside precision to avoid over-alerting or under-detection.
The scorecard measures detection capability only. It does not assess investigative workflow quality.
Data Dictionary
For complete field-level definitions and table structures:
Download Data Dictionary (XLSX) →
If you encounter ambiguity in table structure or joins, contact hello@syntheticr.ai.