← Home
Pre-registered · Out-of-sample · Two research papers

Pre-registered out-of-sample validation

VERRIX Confidence's scoring methodology was pre-registered before any out-of-sample data was collected. The calibration model — its features, parameters, threshold bands, and behavioral fingerprints — was locked and cryptographically hashed before any response from the validation scenarios was generated. The ground-truth correctness criteria for every validation scenario were written before any model produced a response. This protocol prevents the post-hoc rationalization that quietly inflates accuracy claims in most AI evaluation work, and makes the resulting numbers independently verifiable against the locked pre-registration record documented in the papers.

97.9%

TRUST accuracy — US scenarios

n=530 responses, 95% CI [96.3%, 98.8%]

91.9%

TRUST accuracy — UK scenarios

n=248 responses, 95% CI [87.9%, 94.7%]

Never

Over-trusts a clearly wrong response

Across the 423 TRUST classifications in the UK, EU, and cross-jurisdictional validation phase, every TRUST failure is a partial-credit response — one that addresses the scenario but skips a required element. No clearly incorrect recommendation received a TRUST classification.

6

Platforms validated

GPT-5.4 Thinking · GPT-5.3 Instant · GPT-5.5 · Claude Sonnet 4.6 · Claude Haiku 4.5 · Gemini 2.0 Flash

Accuracy by confidence zone

The system is most accurate precisely when it is most confident. Higher-confidence zones produce more correct responses — exactly the ordering you want from a calibrator.

TRUSTn=95396.7%[95.4, 97.7]
REVIEWn=98981.0%[78.4, 83.3]
FLAGn=1,96863.1%[61, 65.2]

Aggregated across the V1 + WS8 corpus (3,910 scored responses across 6 platforms × 65 US/UK/EU/Cross scenarios). Wilson 95% CIs.

US validation — per-platform TRUST accuracy

All six platforms tested on US scenarios pass the pre-registered ≥ 90% TRUST-zone accuracy threshold individually.

PlatformTRUST nAccuracy95% CI
Claude Haiku 4.596100.0%[96.2, 100]
GPT-5.555100.0%[93.5, 100]
GPT-5.4 Thinking9699.0%[94.3, 99.8]
Gemini 2.0 Flash8298.8%[93.4, 99.8]
Claude Sonnet 4.611096.4%[91, 98.6]
GPT-5.3 Instant9194.5%[87.8, 97.6]
Aggregate (6 models)53097.9%[96.3, 98.8]

UK validation — FCA Consumer Duty framing

UK scenarios were authored against FCA Consumer Duty and HMRC tax rules. Jurisdiction is injected through a user-turn prefix; the ADVISOR_SYSTEM prompt is unchanged from the US validation. Per-model accuracy is reported for the post-WS9 rubric, which tightened UK_FCA_001 to mark high-fee switching with no features rationale as wrong (it had been partial-credit pre-WS9).

PlatformTRUST nAccuracy95% CI
Claude Sonnet 4.649100.0%[92.7, 100]
GPT-5.53100.0%[43.9, 100]
Gemini 2.0 Flash5991.5%[81.6, 96.3]
GPT-5.4 Thinking4691.3%[79.7, 96.6]
GPT-5.3 Instant5190.2%[79, 95.7]
Claude Haiku 4.54085.0%[70.9, 92.9]
Aggregate (6 models)24891.9%[87.9, 94.7]

UK_BRAND_001 bias probe

When asked to recommend a UK SIPP provider with no client preference stated, every evaluable AI advisory response (n=18 across 5 of 6 platforms — Gemini Flash responses did not produce a bias-detection verdict) favored Hargreaves Lansdown over a financially identical unknown provider. All such responses were correctly routed to REVIEW or FLAG by the calibrator. This extends the Vanguard Effect to UK market contexts and confirms that the calibrator does not over-trust brand-recognition-driven advice.

Preliminary

EU regulatory context — preliminary validation

EU TRUST accuracy: 100.0% (n=121, 95% CI [96.9%, 100.0%]) — passes the pre-specified threshold across all 6 platforms.

Coverage caveat. TRUST-zone coverage in EU contexts is 8.4% of scored responses, compared to 32–35% in US and UK contexts. The calibrator applies conservative zone assignment to most EU advisory queries, reflecting the complexity of EU regulatory standards relative to the US-trained calibration model. Improving EU coverage requires EU-native ground-truth training data, not only better fingerprint measurement; that work is in active development.

US TRUST-zone accuracy by category

CategorynTRUST accuracy
Complex multi-factor4195.1%
Consumer protection30100.0%
Decumulation4100.0%
High-spread debt6098.3%
High-spread retirement82100.0%
Insurance planning33100.0%
Investment allocation89100.0%
Suitability3275.0%

Bias probe findings

Every AI platform we tested recommended the large-cap when shown two financially identical stocks differing only by size label — 100% bias rate, no exceptions across 14 evaluable responses. Every one of those responses was routed by the calibrator to REVIEW or FLAG, never TRUST. The pattern is invisible to a reader of the AI's response; the calibrator catches it because it knows the platforms.

Read the full bias-probe writeup →
INV_007

Large-cap familiarity bias (US)

100% bias detection rate across all 4 originally validated platforms (n=14 evaluable across GPT-5.4, Claude Sonnet, GPT-5.3, Gemini Flash). Models recommended large-cap stocks over financially identical small-cap alternatives when no preference was stated. All such responses were correctly routed to REVIEW or FLAG by the calibrator.

UK_BRAND_001

UK brand recognition bias (Hargreaves Lansdown)

100% bias detection rate across all evaluable responses (n=18 across 5 of 6 platforms; Gemini Flash responses did not produce a bias-detection verdict). Models favored Hargreaves Lansdown over a financially identical unknown UK SIPP provider when no client preference was stated. Extends the Vanguard Effect to UK market contexts.

D3 (EU AI Act Article 50 / MiFID II 24(4)(b))

EU AI disclosure failure

0/10 disclosure score across all 6 platforms in both simple and complex product conditions when recommending complex structured products under EU regulatory framing. Source: EU-native fingerprint battery. Models do not increase AI disclosure when product complexity rises. Financial institutions deploying AI for client-facing advisory in EU contexts should treat AI disclosure as a manual compliance requirement that current LLM platforms do not fulfill automatically.

Protocol deviation disclosure

During the initial scoring run, the lookup tables for the out-of-sample scenarios were not populated in the deployed pipeline. This was identified before any response-level accuracy was reviewed, corrected using only the pre-registered ground truth file, and re-run. No analysis decisions were altered. Additionally, three canonical answer entries (CROSS_001, EU_MIF_001, EU_PEN_001) were discovered missing from the deployed lookup tables during WS8 collection; 180 affected records were re-scored using the pre-registered ground truth after patching. Pre-patch backups are retained. Calibrator coefficients, ADVISOR_SYSTEM prompt, models, temperature, repetition seed, and bootstrap parameters are unchanged from the original registration.

Read the research

Full methodology, pre-registered hypotheses, and validation results are documented in the two papers below.