Fraud detection · IBM TabFormer benchmark

On the fraud benchmark NVIDIA uses, the best score comes from the smallest model: Neospace's Cortex — ahead of NVIDIA's and Revolut's far larger ones

That benchmark is IBM's TabFormer, the synthetic card-fraud dataset this line of work is built on, and the one NVIDIA built its Transaction Foundation Model on for fraud. Several models have worked the same task — turn a raw stream of card transactions into a fraud signal — including Revolut's 100M-parameter PRAGMA, which we adapted to it here. On the full, time-isolated test, Cortex scores higher than any of them, using about 8 million parameters.

0.99

AUPRC (F1 0.96) on the full 2.03M-transaction test, not a sample

974×

better than a no-skill classifier at 0.102% fraud prevalence

~8M

parameters, 4× smaller than NVIDIA's and 12× smaller than Revolut's PRAGMA

The problem

Fraud shows up in behavior, not in single fields

A single transaction rarely looks fraudulent on its own. The signal sits in the behavior around it: how recently the card was used, which merchants, the spending rhythm, and which pattern it breaks.

For years the standard approach was to hand-engineer features that approximate that behavior and feed them to a classifier. It plateaus quickly: a strong gradient-boosted model on the raw columns tops out at AUPRC 0.14. That limit is what pushed the field toward transaction foundation models, which train on the raw sequence itself and learn the behavioral context that hand-built features tend to miss.

The models on this benchmark

The models that set the bar

Cortex was measured against the work that defined this problem. Three reference points set the bar, all on the same public benchmark.

The benchmark

IBM TabFormer

IBM's synthetic card-fraud dataset and the origin of this line of work. It is the standard these models are measured on.

The benchmark, held out by time

Foundation model

NVIDIA TFM

NVIDIA's open Transaction Foundation Model. Same dataset, same approach. NVIDIA publishes no metrics, so we ran the blueprint end to end.

~29M parameters · AUPRC 0.18

The heavyweight

Revolut PRAGMA-M

Revolut's transaction foundation model. Its attention is bidirectional, which fits a benchmark that labels every transaction awkwardly — we cover that below.

~100M parameters · AUPRC 0.47–0.83

How Cortex works

The score is the detector

Cortex reads each cardholder's raw transaction sequence and produces one fraud score per transaction. That score already carries the behavioral context that drives fraud: recency, merchant patterns, spending rhythm.

On the IBM TabFormer test — held out by time (train 1991–2017, validation 2018, test 2019–2020, 2.03M transactions, 2,068 fraud, 0.10%) — the Cortex score reaches AUPRC 0.99 and F1 0.96, versus 0.14 for an XGBoost over the dataset's 13 raw columns, 0.83 for PRAGMA-M, and 0.18 for NVIDIA's Transaction Foundation Model. The dataset and the target are held fixed; only the model changes — and Cortex's model is ~8M parameters, against 29M for NVIDIA's and 100M for PRAGMA-M.

The other models hand an embedding to a separate downstream classifier. With Cortex the score itself is the detector, so there is no second model to fuse features or maintain. Across every comparison on this page the dataset, the target, and the time-isolated protocol stay fixed, and only the feature changes.

Two lanes on the same dataset, target, and time-isolated split: raw columns versus the Cortex fraud score.

Results · full time-isolated test

Results on the full time-isolated test

The test set is the whole held-out split, isolated by time. We train on 1991 to 2017, validate on 2018, and test on 2019 to 2020: 2,034,720 transactions, of which 2,068 are fraud (0.102%) — not a sample. The decision threshold is set on a pre-2019 validation split and applied frozen to the test years. We report AUPRC and F1 only, because at this prevalence AUROC saturates near 1.0 and stops telling strong models apart.

Model	Readout	Params	AUPRC	F1
Raw features	13 columns into XGBoost	n/a	0.14	0.26
NVIDIA TFM	embedding plus raw	29M	0.18	0.23
Revolut PRAGMA-M	embedding plus raw into XGBoost	100M	0.47	0.60
Revolut PRAGMA-M	LoRA fine-tune	100M	0.83	0.81
Cortex	fraud score, standalone	~8M	0.99	0.96

The readout is how each model's output becomes a fraud prediction. NVIDIA's TFM fuses its embedding with the raw columns; its embedding alone falls below the raw baseline. PRAGMA-M appears twice — its embedding fused with the raw columns into XGBoost, and a LoRA fine-tune with a direct fraud head. Cortex's score is reported by itself. Raw, NVIDIA, and Cortex are scored on the identical 2.03M-transaction rows; NVIDIA's embedding fused with the raw columns lifts the raw baseline from 0.14 to 0.18. PRAGMA-M's fine-tune is scored on every fraud plus a sample of legitimate transactions and reweighted to the true 0.1% rate, since it labels one transaction per pass.

AUPRC and F1 across models, sized by parameter count, all on the full 2.0M-transaction time-isolated 2019–2020 test. Cortex runs at about 8M and scores higher than every larger model; NVIDIA is shown as its embedding fused with raw features.

The Cortex fraud score is a standalone detector — AUPRC 0.99, F1 0.96, about 7× the raw baseline and far above PRAGMA-M (0.83) and NVIDIA's TFM (0.18). It is a calibrated per-transaction probability, fed straight to the metric with no downstream classifier. And it reaches this with the smallest model by a wide margin — ~8M parameters versus NVIDIA's 29M and PRAGMA-M's 100M: the result is the representation, not scale.

AUPRC is the cleaner axis at this prevalence (threshold-free). All metrics are shown rounded up to two decimals (Cortex's exact AUPRC is 0.989), except the raw-feature baseline, which is rounded to nearest so the bar everyone clears isn't flattered (0.142 → 0.14); the underlying figures are read from the committed results files.

Model sizes

Model sizes — Cortex is the smallest

All three models were run on the same IBM TabFormer dataset. Cortex delivers the strongest fraud detection while being ~4× smaller than NVIDIA's model and ~12× smaller than PRAGMA-M, and trains in well under half PRAGMA-M's time.

Model	Parameters	Training time
Cortex	~8M	~1h20m
NVIDIA TFM	29M	—
Revolut PRAGMA-M	100M	~3h30m

Training time is measured on 4×GB200, across pre-training and fine-tuning, for the two models we trained here; NVIDIA's blueprint ships a checkpoint, so its training was not re-run.

Comparison · Revolut PRAGMA

How this compares to Revolut's PRAGMA

PRAGMA is Revolut's transaction foundation model (arXiv:2604.08649): a 100M-parameter, encoder-only transformer trained with a self-supervised masked-modelling objective. We pretrained PRAGMA-M on the same 1991–2017 training data as Cortex and read it out two ways. Its frozen embedding fused with the raw columns into XGBoost reaches AUPRC 0.47 / F1 0.60; adapting it for fraud with LoRA — the fine-tuning recipe from the PRAGMA paper — and a direct head reaches 0.83 / 0.81. Both clear the raw baseline but fall short of Cortex's 0.99 / 0.96, at about 12× the size.

	PRAGMA-M	NeoLDM · Cortex
Parameters	~100M	~8M
Architecture	encoder-only, masked modelling	decoder, per-transaction score
Feature interaction	implicit (attention only)	direct
Training time	~3h30m	~1h20m
Best AUPRC	0.83	0.99

Fraud usually surfaces in combinations of fields: an amount, at a merchant type, hour, and location, taken together. PRAGMA has no direct feature interaction; it relates fields only implicitly, through general-purpose attention. Cortex models those interactions directly. This benchmark also labels every transaction, and PRAGMA's attention is bidirectional, so only the transaction at a window's end can be scored without seeing its own future — one per pass. We therefore score the fine-tune on every fraud plus a sample of legitimate transactions and reweight to the true 0.1% rate, recovering the metric PRAGMA would reach across the whole set; Cortex scores a whole window in one pass. PRAGMA-M is the 100M-parameter variant from the paper, which we pretrained and fine-tuned on IBM TabFormer ourselves (Ostroukhov et al., PRAGMA: Revolut Foundation Model).

Comparison · NVIDIA

Compared with NVIDIA's Transaction Foundation Model

NeoLDM grew up alongside NVIDIA's open Transaction Foundation Model developer blueprint — same dataset (IBM TabFormer), same idea of training a model on raw transactions and using its output for downstream fraud. NVIDIA's blueprint is a 29M-parameter Llama decoder; it pools the last-token hidden state into a 512-d embedding, reduces it to 64-d with PCA, and feeds XGBoost.

NVIDIA's repo ships no metrics in its notebook outputs, so we ran their blueprint end-to-end ourselves — their shipped checkpoint, their notebooks, the real IBM TabFormer data. Their own notebook calls Average Precision (AUPRC) “the operationally critical metric for fraud,” so we lead with it. On the full time-isolated 2019–2020 test (the same rows as Cortex), fusing the foundation-model embedding with the raw columns improves their fraud detection: the raw-feature baseline scores AUPRC 0.14, the embedding alone 0.01, and the two fused reach 0.18 — the combined number we report for NVIDIA.

Cortex's fraud score reaches 0.99 on the full time-isolated test — more than 5× higher, with a model ~4× smaller.

	NVIDIA TFM	NeoLDM · Cortex
Parameters	~29M	~8M
Output	512-d last token, then 64-d PCA embedding	per-transaction fraud score
Downstream	XGBoost	none, the score is the detector
Test set	full 2.03M, time-isolated	full 2.03M, time-isolated
Best AUPRC	0.18 embedding plus raw	0.99

NVIDIA publishes no absolute numbers, so we ran their blueprint ourselves on the full test split — the same rows as Cortex and the raw baseline. The ranking: Cortex (0.99) far ahead, then Revolut's PRAGMA-M (0.83), then NVIDIA's embedding plus raw (0.18), just above the raw baseline (0.14).

What it means

What the result means in practice

The result is not only academic. Each advantage maps to something practical.

Lower running cost

At about 8M parameters, Cortex is 4× smaller than NVIDIA's model and 12× smaller than Revolut's. That means less compute per inference, lower cost at production scale, and a model that is easier to deploy where the data already lives.

Fewer moving parts

The score is the detector, so there is no embedding-plus-classifier stack to build, tune, and maintain. That shortens the path from raw transactions to a decision, and the time it takes to get there.

Reproducible

We publish the full test split, the per-transaction scores, a runnable notebook, and the code under Apache-2.0. The numbers can be checked rather than taken on trust.

Reproduce it

Reproduce the results

The notebook computes the Cortex score from the published per-transaction scores. Without them it falls back to the committed results, so it renders either way. You do not need a GPU, the dataset, or a checkpoint just to see the numbers.

Cortex fraud score

The Cortex score on the full time-isolated test (AUPRC 0.99, 974× random, F1 0.96), next to the raw 13-column baseline.

View rendered Download .ipynb

Per-transaction scores

Scores over all 24,386,900 transactions, published as 4 parquet shards, downloadable from the browser or with curl.

Browse the scores

Dataset & scores

IBM TabFormer

IBM's synthetic credit-card-fraud dataset and the origin of this line of work (arXiv:2011.01843, IBM/TabFormer): 24,386,900 transactions from 2,000 cardholders (9 cards each; the paper's “20,000 users” is a typo), spanning 1991–2020 at 0.11% fraud, split by calendar time (train 1991–2017, validation 2018, test 2019–2020). The raw CSV is fetched from IBM under IBM's terms and is not redistributed here.

Cortex scores

Per-transaction fraud scores over all 24,386,900 transactions, published at embeddings.neospace.ai.

The raw baseline feeds the transaction table's columns to XGBoost directly — no temporal aggregation, no engineered features — so the comparison is clean: a learned Cortex representation versus the raw fields. For reference, the original TabFormer paper reports its best fraud-detection result at F1 ≈ 0.86 (TabBERT features plus an LSTM). That number is not comparable to the results here: the paper predicts over a window of 10 transactions (a fraud window every 10 rows, so roughly 10× higher effective prevalence), on its own windowed split — an easier framing than per-transaction detection at 0.1%. It is a historical reference point on this dataset, not a like-for-like baseline.

References

NVIDIA Transaction Foundation Model — developer blueprint (decoder-only LM on tokenized transactions, last-token embedding → XGBoost): github.com/NVIDIA-AI-Blueprints/transaction-foundation-model.
Revolut PRAGMA — Ostroukhov et al. PRAGMA: Revolut Foundation Model. arXiv:2604.08649.
IBM TabFormer — Padhi, Schiff, Melnyk, Rigotti, Mroueh, Dognin, Ross, Nair, Altman. Tabular Transformers for Modeling Multivariate Time Series. ICASSP 2021. arXiv:2011.01843.
NV-Embed — Lee et al. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. arXiv:2405.17428 — the last-token-pooling recipe that turns a decoder's hidden state into a usable embedding.

On the fraud benchmark NVIDIA uses, the best score comes from the smallest model: Neospace's Cortex — ahead of NVIDIA's and Revolut's far larger ones

Fraud shows up in behavior, not in single fields

The models that set the bar

IBM TabFormer

NVIDIA TFM

Revolut PRAGMA-M

The score is the detector

Results on the full time-isolated test

Model sizes — Cortex is the smallest

How this compares to Revolut's PRAGMA

Compared with NVIDIA's Transaction Foundation Model

What the result means in practice

Lower running cost

Fewer moving parts

Reproducible

Reproduce the results

Cortex fraud score

Per-transaction scores

Dataset & scores

IBM TabFormer

Cortex scores

References

Cortex, the transaction model behind NeoLDM