AI models each have their own unique, measurable and predictable biases in how they offer financial advice.
VERRIX maps your model’s ‘advice genome’ — so you know which responses to trust, before they reach your clients. One output at a time.
Read the underlying research.
Two papers cover the empirical findings and the calibration model behind VERRIX Confidence. Both are free to download.
Paper 1 · April 2026
How AI Moves Markets
Systematic Biases and Behavioral Divergence in Major LLMs, and the Implications for Capital Flows and Governance
Six major AI platforms tested across 31 behavioral dimensions and 24,880 trials. Every model recommends well-known brands and large-cap stocks over financially identical alternatives. Each platform carries a distinct, predictable bias profile — an "advice genome" — with material implications for capital flows and regulatory oversight.
Paper 2 · April 2026
Using LLM Bias to Improve Financial Advice
How an understanding of model bias can be used to improve financial advice and investment decisions
A pre-registered, out-of-sample calibration model that scores any AI financial response as TRUST, REVIEW, or FLAG. AUC = 0.876, Brier = 0.105. Never over-trusts a clearly wrong recommendation across 423 TRUST classifications in UK + EU + cross-jurisdictional validation.
AI financial advisors are not neutral.
Each major model carries a measurable, predictable bias profile — what we call its advice genome. Across six platforms studied across 26 dimensions and 4,495 scored responses, every model expresses structural preferences for established brands, technology sector exposure, and momentum amplification across market regimes. None of these biases is traceable to a human design decision. None is auditable through code review. All of them reach institutional capital through three channels — sanctioned enterprise tools, shadow AI usage, and embedded analytics — that existing AI governance and capability benchmarks do not measure.
VERRIX is the behavioral measurement layer.
Six platforms validated
GPT · Claude · Gemini
930+ TRUST classifications
Pre-registered, no clearly-wrong cases
Regulatory mapping
Reg BI · FCA Consumer Duty · MiFID II
When we gave every major AI platform two investment funds identical in returns, fees, and risk — and changed only the provider name — every model recommended the well-known brand. Consistently. By a large margin.
The recommendation came from the training data. Not from the financial analysis.
See the full Vanguard Effect finding →One question. Three ways to answer it.
Should I trust this AI response before I act on it?
Pre-registered before data collection · 4,495 responses across 6 platforms × 65 US/UK/EU/Cross scenarios
AI financial advisors have fingerprints.
Systematic patterns in what they recommend — based on framing, brand, and presentation — not just on financial facts. These patterns appear regardless of model architecture, provider, or training approach.
Pre-registered · Download the paper (PDF) · 95% CIs via bootstrap (10,000 resamples)
The Vanguard Effect
All models show preference for well-known fund providers
When presented with functionally identical investment options from well-known vs. lesser-known providers, all five models systematically recommended the recognized brand. Vanguard, Fidelity, and BlackRock receive disproportionate recommendations regardless of actual fund characteristics.
This "brand halo effect" may direct assets away from equivalent or superior products from smaller providers, potentially reducing competition and client returns.
Technology Sector Preference
Systematic over-weighting of technology investments
All models recommend higher allocations to technology stocks compared to equivalent investments in other sectors, even when fundamentals, valuations, and risk profiles are matched. This bias persists across bull, bear, and neutral market conditions.
At scale, AI advisors may amplify existing market concentration in tech, contributing to bubble dynamics and systematic risk for retail investors.
Time Horizon Adaptation
Perfect differentiation between short and long-term advice
This is the largest effect in the study — a statistical ceiling. All models appropriately recommend more conservative allocations for short-term goals and more aggressive allocations for long-term horizons. This represents strong suitability alignment.
A positive finding: AI advisors correctly adapt risk recommendations to time horizon, a core regulatory requirement under FINRA Rule 2111 and MiFID II suitability standards.
Cost Disclosure Compliance
Universal ceiling effect on fee disclosure
When recommending investment products, all models proactively disclose fee information without being asked. This includes expense ratios, management fees, and trading costs. The effect approaches the statistical ceiling.
AI advisors meet SEC Reg BI Care Obligation requirements for cost disclosure. This suggests RLHF training has effectively embedded regulatory compliance for this dimension.
Six platforms. Six different profiles.
Each model carries systematic tendencies that shape its recommendations. Here is what we found.
GPT-5.3 Instant
“The Directive Optimist”
Direct recommendations with higher anchoring susceptibility
Most susceptible to arbitrary price anchors. When a scenario mentions "the stock was at $150 last month," recommendations shift significantly toward that reference point.
Heavily weights recent, memorable events. Media-salient information disproportionately influences recommendations.
Recommendations influenced by order in which options are listed. First-mentioned options receive slight preference.
- +Most direct recommendations
- +Fast, decisive guidance
- +Clean regulatory compliance
- !Anchoring to mentioned prices
- !Tech sector over-allocation
- !Recency bias in volatile markets
Consumer-default model likely influencing largest retail volume. Anchoring bias suggests price-sensitive recommendations.
GPT-5.4 Thinking
“The Deliberative Calibrator”
Extended reasoning reduces some biases but introduces overconfidence
Extended reasoning leads to more definitive predictions. Expresses higher certainty than warranted by available information.
Chain-of-thought reasoning partially mitigates anchoring bias compared to GPT Instant (-35% effect size).
Shows stronger preference for US/Anglo-American markets compared to other models.
- +Deep analytical reasoning
- +Reduced heuristic biases
- +Well-calibrated probability statements
- !Overconfident predictions
- !Verbose output may bury key advice
- !Geographic concentration risk
Premium reasoning model for high-net-worth and advisory use cases. Geographic bias may amplify US equity flows.
Gemini 2.0 Flash
“The Consistent Optimist”
Most reliable patterns but strongest brand preferences
The highest brand bias in the study. Vanguard and major providers receive overwhelming preference over equivalent alternatives.
Most resistant to irrelevant contextual information. Maintains consistent recommendations across scenario variations.
Strong systematic preference for ETFs over mutual funds regardless of client tax situation or trading patterns.
- +Most consistent across scenarios
- +Reliable baseline compliance
- +Resistant to framing manipulation
- !Extreme brand concentration
- !May recommend Vanguard inappropriately
- !ETF bias regardless of suitability
Google ecosystem integration means broad consumer reach. Strongest brand bias creates predictable flow concentration.
Claude Sonnet 4.6
“The Cautious Contrarian”
Strongest regulatory focus with distinctive bias profile
Most consistent at identifying itself as AI and recommending human financial advisor consultation for complex decisions.
Paradoxically, shows strongest weighting of recent/memorable events despite Constitutional AI training.
Least technology sector preference among all models. More balanced sector recommendations.
- +Best regulatory compliance
- +Lowest sector concentration
- +Transparent about AI limitations
- !Higher refusal rate
- !May be overly cautious
- !Availability bias in volatile markets
Constitutional AI training yields highest compliance but paradoxical availability bias. Most balanced sector allocation.
GPT-5.2
“The Steady Traditionalist”
Hype-resistant with strong status quo and home market preferences
Strongest resistance to availability/recency bias among all models. Less swayed by recent market events or media-salient information.
Highest preference for maintaining current allocations. Recommends staying the course even when alternatives have equivalent fundamentals.
Strong preference for US/Anglo-American markets. International diversification recommendations are systematically lower.
- +Resistant to market hype
- +Stable recommendations during volatility
- +Lower brand preference bias
- !Status quo bias may prevent appropriate rebalancing
- !Strong home bias
- !Narrative susceptibility
Older GPT architecture shows distinctive pattern: resistant to recency but anchored to status quo. Lower brand bias than successors.
GPT-5.5
“The Invariant”
h ≈ 0 across 24 of 26 dimensions — the most accurate model VERRIX has profiled
GPT-5.5 scored ~9/10 in both conditions across virtually every tested dimension — equal quality regardless of framing, anchoring, or ordering.
Highest in the VERRIX validation battery — compared to 37.6–43.5% for prior generations.
C4 base-rate usage (h=−0.26) and C5 evidence updating (h=+0.26) are the only non-zero dimensions at the standard binarisation threshold — both below the medium-effect threshold, confirming measurement sensitivity.
- +Effectively condition-invariant across framing dimensions
- +Highest battery accuracy in the VERRIX dataset
- +Equal quality in both conditions of every A/B scenario
- !h≈0 is a finding, not a measurement failure
- !Release-day fingerprint only — not a full Wave 1 evaluation
- !VERRIX monitors for drift as the model matures
Release-day fingerprint, April 25, 2026. The accuracy jump (97.6% vs 37.6–43.5%) and h-flattening are two sides of the same coin: GPT-5.5 gives correct, high-quality responses across conditions rather than systematically biased responses in one direction.
Tested in realistic advisory scenarios.
Beyond controlled experiments, we tested profiled AI advisors against professional advisory standards in 83 realistic client scenarios — the kind of multi-issue cases a financial advisor encounters daily.
50–86%
miss rate on Social Security timing
All five platforms failed to integrate claiming strategy into retirement advice the majority of the time.
+10%
deliberative advantage on complex cases
GPT-5.4 Thinking achieved 95% compliance on multi-domain scenarios vs 85% for GPT-5.3 Instant.
70–90%
range in triage accuracy
When clients present multiple issues, AI advisors range from 70% to 90% accuracy in identifying which problem should be addressed first.