Pre-registered study · 6 AI platforms · 24,880 trials

AI models each have their own unique, measurable and predictable biases in how they offer financial advice.

VERRIX maps your model’s ‘advice genome’ — so you know which responses to trust, before they reach your clients. One output at a time.

0
AI models tested
0
Bias dimensions measured
0
Clusters
0
Behavioral trials
For institutions

AI financial advisors are not neutral.

Each major model carries a measurable, predictable bias profile — what we call its advice genome. Across six platforms studied across 26 dimensions and 4,495 scored responses, every model expresses structural preferences for established brands, technology sector exposure, and momentum amplification across market regimes. None of these biases is traceable to a human design decision. None is auditable through code review. All of them reach institutional capital through three channels — sanctioned enterprise tools, shadow AI usage, and embedded analytics — that existing AI governance and capability benchmarks do not measure.

VERRIX is the behavioral measurement layer.

Six platforms validated

GPT · Claude · Gemini

930+ TRUST classifications

Pre-registered, no clearly-wrong cases

Regulatory mapping

Reg BI · FCA Consumer Duty · MiFID II

Key finding

When we gave every major AI platform two investment funds identical in returns, fees, and risk — and changed only the provider name — every model recommended the well-known brand. Consistently. By a large margin.

The recommendation came from the training data. Not from the financial analysis.

See the full Vanguard Effect finding →
0.000%
TRUST-zone accuracy
Pre-registered out-of-sample (n=953)
0
Platforms validated
US + UK regulatory contexts
0pp
TRUST > FLAG accuracy gap
Monotonic gradient confirmed

Pre-registered before data collection · 4,495 responses across 6 platforms × 65 US/UK/EU/Cross scenarios

What we found across all platforms

AI financial advisors have fingerprints.

Systematic patterns in what they recommend — based on framing, brand, and presentation — not just on financial facts. These patterns appear regardless of model architecture, provider, or training approach.

Pre-registered · Download the paper (PDF) · 95% CIs via bootstrap (10,000 resamples)

E2 · Structural Preferences

The Vanguard Effect

All models show preference for well-known fund providers

E2
h = 1.30Very Large

When presented with functionally identical investment options from well-known vs. lesser-known providers, all five models systematically recommended the recognized brand. Vanguard, Fidelity, and BlackRock receive disproportionate recommendations regardless of actual fund characteristics.

Effect size · per modelh ∈ [0.45, 2.17]
5.3 Instant: +1.37
5.4 Thinking: +1.69
Gemini 2.0 Flash: +2.17
Claude Sonnet 4.6: +1.54
5.2: +0.59
Claude Haiku 4.5: +0.45
0.45 · Claude Haiku 4.5Gemini 2.0 Flash · 2.17
GPT-5.3 InstantGPT-5.4 ThinkingGemini 2.0 FlashClaude Sonnet 4.6GPT-5.2GPT-5.5
Implications

This "brand halo effect" may direct assets away from equivalent or superior products from smaller providers, potentially reducing competition and client returns.

E1 · Structural Preferences

Technology Sector Preference

Systematic over-weighting of technology investments

E1
h = 0.82Large

All models recommend higher allocations to technology stocks compared to equivalent investments in other sectors, even when fundamentals, valuations, and risk profiles are matched. This bias persists across bull, bear, and neutral market conditions.

Effect size · per modelh ∈ [0.00, 1.34]
5.3 Instant: +1.34
5.4 Thinking: +0.79
Gemini 2.0 Flash: +0.78
Claude Sonnet 4.6: +1.17
5.2: +0.83
Claude Haiku 4.5: +0.00
0.00 · Claude Haiku 4.5GPT-5.3 Instant · 1.34
GPT-5.3 InstantGPT-5.4 ThinkingGemini 2.0 FlashClaude Sonnet 4.6GPT-5.2GPT-5.5
Implications

At scale, AI advisors may amplify existing market concentration in tech, contributing to bubble dynamics and systematic risk for retail investors.

F2 · Suitability

Time Horizon Adaptation

Perfect differentiation between short and long-term advice

F2
h = 2.62Ceiling Effect

This is the largest effect in the study — a statistical ceiling. All models appropriately recommend more conservative allocations for short-term goals and more aggressive allocations for long-term horizons. This represents strong suitability alignment.

Effect size · per modelh ∈ [0.00, 3.14]
5.3 Instant: +3.14
5.4 Thinking: +3.14
Gemini 2.0 Flash: +3.14
Claude Sonnet 4.6: +3.14
5.2: +3.14
Claude Haiku 4.5: +0.00
0.00 · Claude Haiku 4.5GPT-5.2 · 3.14
GPT-5.3 InstantGPT-5.4 ThinkingGemini 2.0 FlashClaude Sonnet 4.6GPT-5.2GPT-5.5
Implications

A positive finding: AI advisors correctly adapt risk recommendations to time horizon, a core regulatory requirement under FINRA Rule 2111 and MiFID II suitability standards.

D2 · Regulatory Compliance

Cost Disclosure Compliance

Universal ceiling effect on fee disclosure

D2
h = 2.89Ceiling Effect

When recommending investment products, all models proactively disclose fee information without being asked. This includes expense ratios, management fees, and trading costs. The effect approaches the statistical ceiling.

Effect size · per modelh ∈ [0.00, 0.00]
5.3 Instant: +0.00
5.4 Thinking: +0.00
Gemini 2.0 Flash: +0.00
Claude Sonnet 4.6: +0.00
5.2: +0.00
Claude Haiku 4.5: +0.00
0.00 · GPT-5.3 InstantClaude Haiku 4.5 · 0.00
GPT-5.3 InstantGPT-5.4 ThinkingGemini 2.0 FlashClaude Sonnet 4.6GPT-5.2GPT-5.5
Implications

AI advisors meet SEC Reg BI Care Obligation requirements for cost disclosure. This suggests RLHF training has effectively embedded regulatory compliance for this dimension.

Model fingerprints

Six platforms. Six different profiles.

Each model carries systematic tendencies that shape its recommendations. Here is what we found.

GPT-5.3 Instant

The Directive Optimist

openai

Direct recommendations with higher anchoring susceptibility

A4Strongest Anchoring Bias
h = 0.80

Most susceptible to arbitrary price anchors. When a scenario mentions "the stock was at $150 last month," recommendations shift significantly toward that reference point.

B5High Availability Cascade
h = 1.20

Heavily weights recent, memorable events. Media-salient information disproportionately influences recommendations.

G1Presentation Order Sensitivity
h = 0.45

Recommendations influenced by order in which options are listed. First-mentioned options receive slight preference.

Strengths
  • +Most direct recommendations
  • +Fast, decisive guidance
  • +Clean regulatory compliance
Watch when using
  • !Anchoring to mentioned prices
  • !Tech sector over-allocation
  • !Recency bias in volatile markets
Deployment note

Consumer-default model likely influencing largest retail volume. Anchoring bias suggests price-sensitive recommendations.

Full fingerprint →

GPT-5.4 Thinking

The Deliberative Calibrator

openai

Extended reasoning reduces some biases but introduces overconfidence

C2Highest Overconfidence
h = 0.68

Extended reasoning leads to more definitive predictions. Expresses higher certainty than warranted by available information.

A4Reduced Anchoring
h = 0.52

Chain-of-thought reasoning partially mitigates anchoring bias compared to GPT Instant (-35% effect size).

E3Geographic Home Bias
h = 0.71

Shows stronger preference for US/Anglo-American markets compared to other models.

Strengths
  • +Deep analytical reasoning
  • +Reduced heuristic biases
  • +Well-calibrated probability statements
Watch when using
  • !Overconfident predictions
  • !Verbose output may bury key advice
  • !Geographic concentration risk
Deployment note

Premium reasoning model for high-net-worth and advisory use cases. Geographic bias may amplify US equity flows.

Full fingerprint →

Gemini 2.0 Flash

The Consistent Optimist

google

Most reliable patterns but strongest brand preferences

E2Strongest Brand Preference
h = 2.17

The highest brand bias in the study. Vanguard and major providers receive overwhelming preference over equivalent alternatives.

G3Lowest Context Sensitivity
h = 0.18

Most resistant to irrelevant contextual information. Maintains consistent recommendations across scenario variations.

E4Product Type Preference
h = 0.94

Strong systematic preference for ETFs over mutual funds regardless of client tax situation or trading patterns.

Strengths
  • +Most consistent across scenarios
  • +Reliable baseline compliance
  • +Resistant to framing manipulation
Watch when using
  • !Extreme brand concentration
  • !May recommend Vanguard inappropriately
  • !ETF bias regardless of suitability
Deployment note

Google ecosystem integration means broad consumer reach. Strongest brand bias creates predictable flow concentration.

Full fingerprint →

Claude Sonnet 4.6

The Cautious Contrarian

anthropic

Strongest regulatory focus with distinctive bias profile

D3Best AI Disclosure
h = 1.84

Most consistent at identifying itself as AI and recommending human financial advisor consultation for complex decisions.

B5Highest Availability Bias
h = 1.59

Paradoxically, shows strongest weighting of recent/memorable events despite Constitutional AI training.

E1Lowest Sector Bias
h = 0.78

Least technology sector preference among all models. More balanced sector recommendations.

Strengths
  • +Best regulatory compliance
  • +Lowest sector concentration
  • +Transparent about AI limitations
Watch when using
  • !Higher refusal rate
  • !May be overly cautious
  • !Availability bias in volatile markets
Deployment note

Constitutional AI training yields highest compliance but paradoxical availability bias. Most balanced sector allocation.

Full fingerprint →

GPT-5.2

The Steady Traditionalist

openai

Hype-resistant with strong status quo and home market preferences

B5Most Recency-Resistant
h = -1.61

Strongest resistance to availability/recency bias among all models. Less swayed by recent market events or media-salient information.

A6Strong Status Quo Bias
h = 1.69

Highest preference for maintaining current allocations. Recommends staying the course even when alternatives have equivalent fundamentals.

E3Geographic Home Bias
h = 1.29

Strong preference for US/Anglo-American markets. International diversification recommendations are systematically lower.

Strengths
  • +Resistant to market hype
  • +Stable recommendations during volatility
  • +Lower brand preference bias
Watch when using
  • !Status quo bias may prevent appropriate rebalancing
  • !Strong home bias
  • !Narrative susceptibility
Deployment note

Older GPT architecture shows distinctive pattern: resistant to recency but anchored to status quo. Lower brand bias than successors.

Full fingerprint →
Release-day fingerprint · April 25, 2026

GPT-5.5

The Invariant

openai

h ≈ 0 across 24 of 26 dimensions — the most accurate model VERRIX has profiled

h ≈ 0 across 24 of 26 dimensions
h ≈ 0

GPT-5.5 scored ~9/10 in both conditions across virtually every tested dimension — equal quality regardless of framing, anchoring, or ordering.

97.6% battery accuracy
97.6%

Highest in the VERRIX validation battery — compared to 37.6–43.5% for prior generations.

C4 / C5Only two detectable effects
h = ±0.26

C4 base-rate usage (h=−0.26) and C5 evidence updating (h=+0.26) are the only non-zero dimensions at the standard binarisation threshold — both below the medium-effect threshold, confirming measurement sensitivity.

Strengths
  • +Effectively condition-invariant across framing dimensions
  • +Highest battery accuracy in the VERRIX dataset
  • +Equal quality in both conditions of every A/B scenario
Context
  • !h≈0 is a finding, not a measurement failure
  • !Release-day fingerprint only — not a full Wave 1 evaluation
  • !VERRIX monitors for drift as the model matures
Deployment note

Release-day fingerprint, April 25, 2026. The accuracy jump (97.6% vs 37.6–43.5%) and h-flattening are two sides of the same coin: GPT-5.5 gives correct, high-quality responses across conditions rather than systematically biased responses in one direction.

Full fingerprint →
Scenario validation

Tested in realistic advisory scenarios.

Beyond controlled experiments, we tested profiled AI advisors against professional advisory standards in 83 realistic client scenarios — the kind of multi-issue cases a financial advisor encounters daily.

50–86%

miss rate on Social Security timing

All five platforms failed to integrate claiming strategy into retirement advice the majority of the time.

+10%

deliberative advantage on complex cases

GPT-5.4 Thinking achieved 95% compliance on multi-domain scenarios vs 85% for GPT-5.3 Instant.

70–90%

range in triage accuracy

When clients present multiple issues, AI advisors range from 70% to 90% accuracy in identifying which problem should be addressed first.