Pre-registered study · 6 AI platforms · 24,880 trials

AI models each have their own unique, measurable and predictable biases in how they offer financial advice.

VERRIX maps your model’s ‘advice genome’ — so you know which responses to trust, before they reach your clients. One output at a time.

Check out our beta products Read the research

The papers

Read the underlying research.

Two papers cover the empirical findings and the calibration model behind VERRIX Confidence. Both are free to download.

Paper 1 · April 2026

How AI Moves Markets

Systematic Biases and Behavioral Divergence in Major LLMs, and the Implications for Capital Flows and Governance

Six major AI platforms tested across 31 behavioral dimensions and 24,880 trials. Every model recommends well-known brands and large-cap stocks over financially identical alternatives. Each platform carries a distinct, predictable bias profile — an "advice genome" — with material implications for capital flows and regulatory oversight.

PDF · 748 KBDownload PDF

Paper 2 · April 2026

Using LLM Bias to Improve Financial Advice

How an understanding of model bias can be used to improve financial advice and investment decisions

A pre-registered, out-of-sample calibration model that scores any AI financial response as TRUST, REVIEW, or FLAG. AUC = 0.876, Brier = 0.105. Never over-trusts a clearly wrong recommendation across 423 TRUST classifications in UK + EU + cross-jurisdictional validation.

PDF · 270 KBDownload PDF

AI models tested

Bias dimensions measured

Clusters

Behavioral trials

For institutions

AI financial advisors are not neutral.

Each major model carries a measurable, predictable bias profile — what we call its advice genome. Across six platforms studied across 26 dimensions and 4,495 scored responses, every model expresses structural preferences for established brands, technology sector exposure, and momentum amplification across market regimes. None of these biases is traceable to a human design decision. None is auditable through code review. All of them reach institutional capital through three channels — sanctioned enterprise tools, shadow AI usage, and embedded analytics — that existing AI governance and capability benchmarks do not measure.

VERRIX is the behavioral measurement layer.

Six platforms validated

GPT · Claude · Gemini

930+ TRUST classifications

Pre-registered, no clearly-wrong cases

Regulatory mapping

Reg BI · FCA Consumer Duty · MiFID II

For institutions →For advisors →

Key finding

When we gave every major AI platform two investment funds identical in returns, fees, and risk — and changed only the provider name — every model recommended the well-known brand. Consistently. By a large margin.

The recommendation came from the training data. Not from the financial analysis.

See the full Vanguard Effect finding →

One question. Three ways to answer it.

Should I trust this AI response before I act on it?

Free · No account

VERRIX Checker

Paste any AI financial advisory response. Get a TRUST, REVIEW, or FLAG zone in seconds — with a feature breakdown showing why. For individual investors, RIAs, and compliance teams doing spot checks.

Try it now →

Pre-registered, OOS-validated

VERRIX Confidence

96.7% TRUST-zone accuracy — pre-registered, out-of-sample validated across 6 platforms in US and UK regulatory contexts (EU validation in development). The three-tier system (TRUST · REVIEW · FLAG) tells you when to act, when to check, and when to stop. For advisors and compliance teams who check AI responses systematically and need a record that they did.

See VERRIX Confidence →

Per evaluation

VERRIX Genome

The full 31-dimension advice genome of any AI advisory model — before you deploy it. Dollar-weighted accuracy, regulatory alignment scorecard, comparison to published baselines.

See VERRIX Genome →

0.000%

TRUST-zone accuracy

Pre-registered out-of-sample (n=953)

Platforms validated

US + UK regulatory contexts

0pp

TRUST > FLAG accuracy gap

Monotonic gradient confirmed

Pre-registered before data collection · 4,495 responses across 6 platforms × 65 US/UK/EU/Cross scenarios

What we found across all platforms

AI financial advisors have fingerprints.

Systematic patterns in what they recommend — based on framing, brand, and presentation — not just on financial facts. These patterns appear regardless of model architecture, provider, or training approach.

Pre-registered · Download the paper (PDF) · 95% CIs via bootstrap (10,000 resamples)

E2 · Structural Preferences

The Vanguard Effect

All models show preference for well-known fund providers

h = 1.30Very Large

When presented with functionally identical investment options from well-known vs. lesser-known providers, all five models systematically recommended the recognized brand. Vanguard, Fidelity, and BlackRock receive disproportionate recommendations regardless of actual fund characteristics.

Effect size · per modelh ∈ [0.45, 2.17]

5.3 Instant: +1.37

5.4 Thinking: +1.69

Gemini 2.0 Flash: +2.17

Claude Sonnet 4.6: +1.54

5.2: +0.59

Claude Haiku 4.5: +0.45

0.45 · Claude Haiku 4.5Gemini 2.0 Flash · 2.17

GPT-5.3 InstantGPT-5.4 ThinkingGemini 2.0 FlashClaude Sonnet 4.6GPT-5.2GPT-5.5

Implications

This "brand halo effect" may direct assets away from equivalent or superior products from smaller providers, potentially reducing competition and client returns.

E1 · Structural Preferences

Technology Sector Preference

Systematic over-weighting of technology investments

h = 0.82Large

All models recommend higher allocations to technology stocks compared to equivalent investments in other sectors, even when fundamentals, valuations, and risk profiles are matched. This bias persists across bull, bear, and neutral market conditions.

Effect size · per modelh ∈ [0.00, 1.34]

5.3 Instant: +1.34

5.4 Thinking: +0.79

Gemini 2.0 Flash: +0.78

Claude Sonnet 4.6: +1.17

5.2: +0.83

Claude Haiku 4.5: +0.00

0.00 · Claude Haiku 4.5GPT-5.3 Instant · 1.34

GPT-5.3 InstantGPT-5.4 ThinkingGemini 2.0 FlashClaude Sonnet 4.6GPT-5.2GPT-5.5

Implications

At scale, AI advisors may amplify existing market concentration in tech, contributing to bubble dynamics and systematic risk for retail investors.

F2 · Suitability

Time Horizon Adaptation

Perfect differentiation between short and long-term advice

h = 2.62Ceiling Effect

This is the largest effect in the study — a statistical ceiling. All models appropriately recommend more conservative allocations for short-term goals and more aggressive allocations for long-term horizons. This represents strong suitability alignment.

Effect size · per modelh ∈ [0.00, 3.14]

5.3 Instant: +3.14

5.4 Thinking: +3.14

Gemini 2.0 Flash: +3.14

Claude Sonnet 4.6: +3.14

5.2: +3.14

Claude Haiku 4.5: +0.00

0.00 · Claude Haiku 4.5GPT-5.2 · 3.14

GPT-5.3 InstantGPT-5.4 ThinkingGemini 2.0 FlashClaude Sonnet 4.6GPT-5.2GPT-5.5

Implications

A positive finding: AI advisors correctly adapt risk recommendations to time horizon, a core regulatory requirement under FINRA Rule 2111 and MiFID II suitability standards.

D2 · Regulatory Compliance

Cost Disclosure Compliance

Universal ceiling effect on fee disclosure

h = 2.89Ceiling Effect

When recommending investment products, all models proactively disclose fee information without being asked. This includes expense ratios, management fees, and trading costs. The effect approaches the statistical ceiling.

Effect size · per modelh ∈ [0.00, 0.00]

5.3 Instant: +0.00

5.4 Thinking: +0.00

Gemini 2.0 Flash: +0.00

Claude Sonnet 4.6: +0.00

5.2: +0.00

Claude Haiku 4.5: +0.00

0.00 · GPT-5.3 InstantClaude Haiku 4.5 · 0.00

GPT-5.3 InstantGPT-5.4 ThinkingGemini 2.0 FlashClaude Sonnet 4.6GPT-5.2GPT-5.5

Implications

AI advisors meet SEC Reg BI Care Obligation requirements for cost disclosure. This suggests RLHF training has effectively embedded regulatory compliance for this dimension.

Model fingerprints

Six platforms. Six different profiles.

Each model carries systematic tendencies that shape its recommendations. Here is what we found.

GPT-5.3 Instant

“The Directive Optimist”

openai

Direct recommendations with higher anchoring susceptibility

A4Strongest Anchoring Bias

h = 0.80

Most susceptible to arbitrary price anchors. When a scenario mentions "the stock was at $150 last month," recommendations shift significantly toward that reference point.

B5High Availability Cascade

h = 1.20

Heavily weights recent, memorable events. Media-salient information disproportionately influences recommendations.

G1Presentation Order Sensitivity

h = 0.45

Recommendations influenced by order in which options are listed. First-mentioned options receive slight preference.

Strengths

+Most direct recommendations
+Fast, decisive guidance
+Clean regulatory compliance

Watch when using

!Anchoring to mentioned prices
!Tech sector over-allocation
!Recency bias in volatile markets

Deployment note

Consumer-default model likely influencing largest retail volume. Anchoring bias suggests price-sensitive recommendations.

Full fingerprint →

GPT-5.4 Thinking

“The Deliberative Calibrator”

openai

Extended reasoning reduces some biases but introduces overconfidence

C2Highest Overconfidence

h = 0.68

Extended reasoning leads to more definitive predictions. Expresses higher certainty than warranted by available information.

A4Reduced Anchoring

h = 0.52

Chain-of-thought reasoning partially mitigates anchoring bias compared to GPT Instant (-35% effect size).

E3Geographic Home Bias

h = 0.71

Shows stronger preference for US/Anglo-American markets compared to other models.

Strengths

+Deep analytical reasoning
+Reduced heuristic biases
+Well-calibrated probability statements

Watch when using

!Overconfident predictions
!Verbose output may bury key advice
!Geographic concentration risk

Deployment note

Premium reasoning model for high-net-worth and advisory use cases. Geographic bias may amplify US equity flows.

Full fingerprint →

Gemini 2.0 Flash

“The Consistent Optimist”

google

Most reliable patterns but strongest brand preferences

E2Strongest Brand Preference

h = 2.17

The highest brand bias in the study. Vanguard and major providers receive overwhelming preference over equivalent alternatives.

G3Lowest Context Sensitivity

h = 0.18

Most resistant to irrelevant contextual information. Maintains consistent recommendations across scenario variations.

E4Product Type Preference

h = 0.94

Strong systematic preference for ETFs over mutual funds regardless of client tax situation or trading patterns.

Strengths

+Most consistent across scenarios
+Reliable baseline compliance
+Resistant to framing manipulation

Watch when using

!Extreme brand concentration
!May recommend Vanguard inappropriately
!ETF bias regardless of suitability

Deployment note

Google ecosystem integration means broad consumer reach. Strongest brand bias creates predictable flow concentration.

Full fingerprint →

Claude Sonnet 4.6

“The Cautious Contrarian”

anthropic

Strongest regulatory focus with distinctive bias profile

D3Best AI Disclosure

h = 1.84

Most consistent at identifying itself as AI and recommending human financial advisor consultation for complex decisions.

B5Highest Availability Bias

h = 1.59

Paradoxically, shows strongest weighting of recent/memorable events despite Constitutional AI training.

E1Lowest Sector Bias

h = 0.78

Least technology sector preference among all models. More balanced sector recommendations.

Strengths

+Best regulatory compliance
+Lowest sector concentration
+Transparent about AI limitations

Watch when using

!Higher refusal rate
!May be overly cautious
!Availability bias in volatile markets

Deployment note

Constitutional AI training yields highest compliance but paradoxical availability bias. Most balanced sector allocation.

Full fingerprint →

GPT-5.2

“The Steady Traditionalist”

openai

Hype-resistant with strong status quo and home market preferences

B5Most Recency-Resistant

h = -1.61

Strongest resistance to availability/recency bias among all models. Less swayed by recent market events or media-salient information.

A6Strong Status Quo Bias

h = 1.69

Highest preference for maintaining current allocations. Recommends staying the course even when alternatives have equivalent fundamentals.

E3Geographic Home Bias

h = 1.29

Strong preference for US/Anglo-American markets. International diversification recommendations are systematically lower.

Strengths

+Resistant to market hype
+Stable recommendations during volatility
+Lower brand preference bias

Watch when using

!Status quo bias may prevent appropriate rebalancing
!Strong home bias
!Narrative susceptibility

Deployment note

Older GPT architecture shows distinctive pattern: resistant to recency but anchored to status quo. Lower brand bias than successors.

Full fingerprint →

Release-day fingerprint · April 25, 2026

GPT-5.5

“The Invariant”

openai

h ≈ 0 across 24 of 26 dimensions — the most accurate model VERRIX has profiled

—h ≈ 0 across 24 of 26 dimensions

h ≈ 0

GPT-5.5 scored ~9/10 in both conditions across virtually every tested dimension — equal quality regardless of framing, anchoring, or ordering.

—97.6% battery accuracy

97.6%

Highest in the VERRIX validation battery — compared to 37.6–43.5% for prior generations.

C4 / C5Only two detectable effects

h = ±0.26

C4 base-rate usage (h=−0.26) and C5 evidence updating (h=+0.26) are the only non-zero dimensions at the standard binarisation threshold — both below the medium-effect threshold, confirming measurement sensitivity.

Strengths

+Effectively condition-invariant across framing dimensions
+Highest battery accuracy in the VERRIX dataset
+Equal quality in both conditions of every A/B scenario

Context

!h≈0 is a finding, not a measurement failure
!Release-day fingerprint only — not a full Wave 1 evaluation
!VERRIX monitors for drift as the model matures

Deployment note

Release-day fingerprint, April 25, 2026. The accuracy jump (97.6% vs 37.6–43.5%) and h-flattening are two sides of the same coin: GPT-5.5 gives correct, high-quality responses across conditions rather than systematically biased responses in one direction.

Full fingerprint →

Scenario validation

Tested in realistic advisory scenarios.

Beyond controlled experiments, we tested profiled AI advisors against professional advisory standards in 83 realistic client scenarios — the kind of multi-issue cases a financial advisor encounters daily.

50–86%

miss rate on Social Security timing

All five platforms failed to integrate claiming strategy into retirement advice the majority of the time.

+10%

deliberative advantage on complex cases

GPT-5.4 Thinking achieved 95% compliance on multi-domain scenarios vs 85% for GPT-5.3 Instant.

70–90%

range in triage accuracy

When clients present multiple issues, AI advisors range from 70% to 90% accuracy in identifying which problem should be addressed first.

Full scenario validation results →

AI models each have their own unique, measurable and predictable biases in how they offer financial advice.

Read the underlying research.

How AI Moves Markets

Using LLM Bias to Improve Financial Advice

AI financial advisors are not neutral.

One question. Three ways to answer it.

VERRIX Checker

VERRIX Confidence

VERRIX Genome

AI financial advisors have fingerprints.

The Vanguard Effect

Technology Sector Preference

Time Horizon Adaptation

Cost Disclosure Compliance

Six platforms. Six different profiles.

GPT-5.3 Instant

GPT-5.4 Thinking

Gemini 2.0 Flash

Claude Sonnet 4.6

GPT-5.2

GPT-5.5

Tested in realistic advisory scenarios.

Get in touch

Enterprise evaluation

API access (Q3 2026)

Research / academic

Licensing / commercial