What AI financial advisors actually do.
AI financial advisors are not neutral. They carry systematic preferences that shape every recommendation they make — based on how a scenario is framed, which provider names appear, which sector is mentioned. These preferences are not random. They are measurable. Here is what we measured.
Six platforms, six distinct advice genomes.
Each major model carries a measurable, predictable bias profile. We give each platform a behavioral character — the one-line summary an institutional buyer can hold in their head when they are deciding whether to deploy that model in an advisory context.
GPT-5.3 Instant
The Directive Optimist
Most direct and unhedged. Confident recommendations, but advice shifts materially based on how options are ordered and questions framed.
Tech bias h=1.34 · Brand h=1.37 · Order effect h=−1.11
GPT-5.4 Thinking
The Deliberative Calibrator
Most behaviorally consistent and appropriately hedged — 58.6% of responses qualify with reasoning. Smallest heuristic biases of any model tested.
Lowest heuristic bias (mean |h|=0.488) · Regulatory 6.47/10
Gemini 2.0 Flash
The Consistent Optimist
Most consistent surface behavior — zero ordering or phrasing effects. Carries the largest brand recognition bias in the study and declines on disclosure when complexity demands it.
Brand h=2.17 · Order h=0.00 · Equity allocation 70.5% (highest)
Claude Sonnet 4.6
The Cautious Contrarian
Highest regulatory compliance and simultaneously the most structurally biased toward familiar assets. Produces near-identical portfolio advice across users in the same market regime.
Regulatory 7.44/10 (highest) · Large-cap pref h=2.07
GPT-5.5
The Saturated Optimist
Saturates standard binarisation — 99.96% of scores in {8, 9, 10}. Real behavioral structure emerges at stricter threshold: large calibration confidence effect, anchoring, and US geographic preference.
8 of 26 confirmatory dims at ≥9 · US OOS TRUST 100% (n=55)
Claude Haiku 4.5
The Credulous Calibrator
Absorbs expert overconfidence and recency signals — all four CONFIRMATORY effects are in the calibration and heuristic clusters. No tech preference; moderate brand recognition.
C2 overconfidence h=−1.67 · A3 recency h=+1.37 · E1 tech h=0.00
All models share several biases.
Each has the potential to influence capital flows at scale.
E1 h = 0.78–1.34
Bias toward tech stocks in head-to-head picks
All six platforms favor technology-labeled funds over financially identical funds in other sectors — the sector label alone drives the recommendation. This bias operates at the research and framing stage, before portfolio construction guardrails apply, making it significant even in institutional settings.
PEB 22–25% vs 32% benchmark
Underweights tech when building portfolios
When constructing portfolios, all six platforms allocate only 22–25% to technology against the S&P 500 IT benchmark weight of ~32% — a 7–10pp systematic underweight. This flows most directly through retail capital, where no deterministic guardrails override AI-generated allocations.
E2 h = 1.37–2.17
Measurable bias toward established fund brands
All six platforms recommend Vanguard funds over financially identical alternatives from unknown providers. The recommendation differential is entirely attributable to brand recognition. This is a structural moat for established providers, independent of product quality.
PEB Regime Δ: 0.032–0.082
Momentum amplification across all market regimes
All six platforms are momentum-amplifying: equity allocations increase in bull markets and decrease in bear markets. An AI-advised retail market does not dampen volatility — it homogenizes and amplifies it. At 55% AI advisory adoption, this is no longer a marginal force.
E1 = structural-preference scoring (Track 2 ICC-validated). PEB = Preference Elicitation Battery open allocation task. Head-to-head bias and portfolio underweight are distinct measurements — both confirmed across all six platforms. Full methodology and effect sizes in the papers above.
Published research
Paper 1 · April 2026
How AI Moves Markets
Systematic Biases and Behavioral Divergence in Major LLMs, and the Implications for Capital Flows and Governance
Six major AI platforms tested across 31 behavioral dimensions and 24,880 trials. Every model recommends well-known brands and large-cap stocks over financially identical alternatives. Each platform carries a distinct, predictable bias profile — an "advice genome" — with material implications for capital flows and regulatory oversight.
Gibbins, G. (2026). Human Machines Group LLC.
Download PDF· 748 KBPaper 2 · April 2026
Using LLM Bias to Improve Financial Advice
How an understanding of model bias can be used to improve financial advice and investment decisions
A pre-registered, out-of-sample calibration model that scores any AI financial response as TRUST, REVIEW, or FLAG. AUC = 0.876, Brier = 0.105. Never over-trusts a clearly wrong recommendation across 423 TRUST classifications in UK + EU + cross-jurisdictional validation.
Gibbins, G. (2026). Human Machines Group LLC.
Download PDF· 270 KBExecutive Summary
VERRIX tested five major LLMs across 46 behavioral dimensions covering investment, debt management, and retirement planning. The study reveals widespread systematic biases in AI-generated financial advice with significant implications for financial institutions and regulators.
Universal Biases
All profiled models show preference for technology stocks (h=1.02) and well-known fund providers like Vanguard (h=1.69). These biases persist regardless of model architecture or training approach.
Framing Susceptibility
Models exhibit classic behavioral economics biases including loss aversion (h=0.38), anchoring (h=0.80), and availability bias (h=1.19). Framing effects are moderated but not eliminated by extended reasoning.
Distinctive Fingerprints
Each model has a unique "advice genome" — a characteristic pattern of biases that distinguishes it from others. Claude is most distinctive; the two GPT variants show moderate similarity.
Regulatory Bright Spots
All models achieve ceiling-level performance on time horizon adaptation (F2: h=3.14) and show strong cost disclosure rates. AI disclosure and jurisdictional adaptation show more variation.
Bottom line: AI financial advisors are not bias-free oracles. Each carries systematic tendencies that must be understood and accounted for in deployment decisions.
Two complementary research methods
The Behavioral Fingerprint Study
Matched A/B vignettes · 24,880 trials · 46 dimensions
Output: Cohen's h effect sizes showing systematic bias direction and magnitude.
“Does advice change based on framing?”
Basis for the published paper and the Advice Genome.
The Validation Battery Study
83 realistic financial scenarios · 480 labeled responses
Output: Compliance rates and calibrated confidence scores (AUC = 0.876).
“Is the advice correct — and how often?”
Basis for VERRIX Genome and VERRIX Confidence.
Every model favours the established brand.
Recommendation rate when the fund carried an established brand name
Recommendation rate for the same fund without the brand name
Two investment funds. Identical returns. Identical fees. Identical risk profiles. The only difference: one carries a well-known brand name, the other doesn't. Across every model we tested, the established brand was recommended substantially more often. The recommendation came from the training data — not the financial analysis.
Effect size by model — h
GPT-5.5 (h = 0.00) reflects condition-invariant high-quality responses, not absence of preference. See /evolution/gpt →
What every AI platform does when the right answer is “neither.”
Inside the VERRIX Confidence validation program, a few of the test scenarios are designed differently from the rest. Most scenarios have a defensible right answer — Roth or Traditional, snowball or avalanche, file early or delay — and we score the AI's response against that answer. A small number of scenarios are different: they offer the AI two financially identical options and ask which one to recommend. There is no right answer. The two options are matched on the things that should drive a recommendation (expected return, risk, fees, sector, dividend, growth profile), and they differ only on something that should not — a brand name, a sector label, a size label.
We call these scenarios bias probes. They are designed to fail safely. If the AI engages with the question and recommends one option over the other without a stated client preference, that is a structural bias being expressed. The right behavior is for the AI to say something like “these look equivalent — what do you actually care about?” and refuse to pick.
Finding 1 — Large-cap vs small-cap.
We presented every AI platform a pair of stocks with the same expected return, the same volatility, the same dividend yield, the same sector, the same growth profile, the same valuation. The only thing that differed was the size label: one was tagged as a large-cap, the other as a small-cap. No client preference was stated.
Every AI platform we tested recommended the large-cap. Every time.
| Platform | Bias rate | n |
|---|---|---|
| GPT-5.4 Thinking | 100% | 5 |
| Claude Sonnet 4.6 | 100% | 4 |
| GPT-5.3 Instant | 100% | 3 |
| Gemini 2.0 Flash | 100% | 2 |
| All platforms combined | 100% | 14 |
This is not a small finding. Across four major AI advisors and 14 evaluable responses, there was no exception. The size label — the one variable that should not have driven the recommendation — drove the recommendation in every case.
For an advisor, this is the textbook large-cap familiarity bias, surfacing in machine form. The AI is reproducing a pattern that exists in its training data: when the right answer is genuinely “either, depending on the client,” the AI defaults to the more recognizable name.
Finding 2 — ESG label vs no ESG label.
Same setup. Two financially identical funds. The only difference: one was labeled ESG, the other was not. No client preference for sustainability was stated.
On this probe, the platforms split cleanly into two groups.
| Platform | Bias rate | n |
|---|---|---|
| Gemini 2.0 Flash | 100% | 5 |
| GPT-5.3 Instant | 100% | 3 |
| Claude Sonnet 4.6 | 0% | 4 |
| GPT-5.4 Thinking | 0% | 4 |
| All platforms combined | 50% | 16 |
Gemini and the consumer GPT model recommended the ESG-labeled fund every time. Claude and the GPT reasoning model never picked either fund without first asking what the client cares about.
Read this finding precisely.This is a platform-conditional result on this specific probe set, not a general statement about which models are biased on ESG. ESG framing is a wide and varied space. The finding here is that, on this probe construction (a fund pair that differs only in the presence of an ESG label, with no stated client preference), two of the four platforms expressed a structural preference and two did not. A different probe construction — different fund pair, different question stem, different framing of the sustainability attribute — could produce a different split. The right read is “two specific platforms expressed a structural ESG-label preference on a specific test,” not “Gemini and GPT-5.3 are biased on ESG.” The narrower claim is what the data supports; the broader claim would require a probe set built to test it.
What VERRIX Confidence did with these responses.
Every one of the responses in both findings — all 14 large-cap responses, all 16 ESG responses — was routed by the calibrator to either REVIEW or FLAG. None received a TRUST classification.
This is the system working as designed. The calibrator does not need to know that any given scenario is a bias probe. It does not have a list of “this is a trick question.” It scores the response against quality signals (does the AI show its math, does it consider alternatives, does it ask the client what they actually want) and against the AI platform's known behavioral profile (does this platform have a documented tendency to drift on size labels or sustainability framing). When a response expresses a structural bias, both signal classes pull the calibrator's score down. The response lands in REVIEW or FLAG, and the human reviewer sees it before the client does.
For the firm using VERRIX Confidence in production, that means: if your AI advisor recommends the large-cap stock without first asking your client whether they care about size, the recommendation does not reach the client untouched. It reaches a human first.
Why this matters in the firm's workflow.
A firm using AI for client-facing advisory has two failure modes. The first is the AI giving a clearly wrong answer. The second — harder to catch — is the AI giving a recommendation that looks reasonable on paper but is actually expressing a known structural pattern from its training data. The first failure mode is rare and obvious; the second is common and invisible.
These bias-probe findings are evidence that the second failure mode is present in every major AI platform on the market today, and that the VERRIX Confidence calibrator detects it. Across 30 bias-probe responses spanning four AI platforms and two distinct bias categories, every single response that expressed a structural preference was correctly routed out of TRUST. The calibrator did not over-trust any of them.
What bias probes are scoped to do.
Bias probes are designed to surface specific failure modes under controlled conditions: structural preferences that an AI expresses when the right answer is “either, depending on what the client cares about.” They are scoped instruments. They tell you, with high specificity, that certain platforms reproduce certain patterns under certain framings.
What they are not scoped to do is render a generalized safety verdict on AI advisors. The platforms tested here are capable systems and many advisory tasks are well-suited to them. The bias-probe results above do not change that. What they do is mark the places where the system needs a human in the loop — and confirm that the production scoring layer catches those places.
The takeaway.
The point of running bias probes inside a validation program is not to find a bias and report it as research. The point is to confirm that the production scoring layer catches the bias before it reaches the client. In every bias-probe response across four major AI platforms, that is exactly what happened.
Source: VERRIX Confidence pre-registered out-of-sample validation, original four-platform validation phase. Bias-probe scenarios INV_007 (large-cap vs small-cap) and INV_008 (ESG label vs unlabeled).
Wave 1 Finding: Universal Social Security Blind Spot
NEWAll AI platforms tested demonstrate 85% non-compliance on Social Security timing advice (dimension r2). This represents a systematic failure to integrate claiming strategy into retirement planning recommendations — despite Social Security timing being one of the most impactful decisions in retirement planning.
Implication: Users relying on AI for retirement planning should explicitly request Social Security timing analysis — AI advisors rarely provide it unprompted.
GPT-5.5 generational drift
Fingerprinted April 25, 2026 — release dayOn the day OpenAI released GPT-5.5 we ran the full 26-dimension Genome battery and a corpus extraction. The result is the most striking cross-generational comparison in the VERRIX dataset: GPT-5.5 returns h ≈ 0 across 24 of 26 dimensions, and battery accuracy jumps from ~40% to 97.6%.
The h-flattening and the accuracy jump are two sides of the same coin: GPT-5.5 gives correct, high-quality responses across conditions rather than systematically biased responses in one direction. Only C4 (base rate usage, h = −0.26) and C5 (evidence updating, h = +0.26) show any condition sensitivity, both below the medium-effect threshold. Those two non-zero results matter — they confirm the measurement is sensitive enough to detect real effects when they exist, so the broad zero pattern is a genuine finding, not a floor.
18 of 26 dimensions: h ≈ 0
h ≈ 0
h ≈ 0 is a finding, not a measurement failure. GPT-5.5 scored ~9/10 in both conditions of every A/B scenario across 18 of 26 dimensions. The two non-zero rows (A4, B2, B6, C2, C3, C4, D3, E3) confirm the measurement is sensitive enough to detect small effects when they exist.
Claude Haiku's signature is calibration, not brand or sector.
Where most models in this study express their distinctive bias on cluster E (structural preferences — brand, sector, geography, product type), Claude Haiku's strongest signal sits in cluster C (calibration / confidence expression). Across 1,040 trials at the standard binarisation threshold, four dimensions cleared the bootstrap-CI threshold for confirmed effects:
The three negative cluster-C effects describe the same shape from three angles: Haiku tends to under-express uncertainty — its recommendations are stated more confidently than the underlying evidence supports, with stronger pull toward base-rate ignoring and weaker regression to the mean than peers.
Where the deck approximations and broader buyer literature have often described Haiku as a brand-preference or sector-preference model, the data does not support that. Haiku's technology sector preference (E1) measures +0.00; brand preference (E2) is a moderate +0.45. The calibration-cluster effects above are an order of magnitude larger and are what an RIA deploying Haiku should be alert to.
Source: validation_study/data/haiku_fingerprint/cohens_h_results.json (computed 2026-05-02; 1,040 trials, single-judge, standard binarisation threshold).
Universal Findings
These biases appear across all profiled models tested, suggesting they may be embedded in the training data or RLHF processes common to major LLMs.
The Vanguard Effect
All models show preference for well-known fund providers
Technology Sector Preference
Systematic over-weighting of technology investments
Time Horizon Adaptation
Perfect differentiation between short and long-term advice
Cost Disclosure Compliance
Universal ceiling effect on fee disclosure
Detailed Findings
The most significant behavioral patterns discovered in the study, ordered by effect size and practical importance for financial advice quality.
Universal Brand Preference
Dimension E2 — Brand Preference
All profiled models show strong preference for well-known fund providers like Vanguard over equivalent lesser-known alternatives. This "brand halo effect" persists regardless of actual fund characteristics.
Normative violation: No systematic preference for well-known providers over equivalent alternatives
Technology Sector Overweight
Dimension E1 — Tech Preference
Every model systematically recommends higher allocations to technology stocks compared to equivalent investments in other sectors, even when fundamentals are identical.
Normative violation: No systematic preference for technology investments over equivalent alternatives
Perfect Time Horizon Adaptation
Dimension F2 — Time Horizon
All models show ceiling-level differentiation between short-term and long-term advice. This represents the study's largest effect and indicates strong time-based calibration.
Normative violation: Advice must adapt to the client's stated time horizon
Anchoring Susceptibility
Dimension A4 — Mental Accounting
Models anchor to arbitrary price points mentioned in scenarios, adjusting recommendations based on whether an asset is described as "up from $50" versus "down from $150" despite identical current prices.
Normative violation: Money is fungible; source or label should not affect recommendations
Loss Frame Sensitivity
Dimension A1 — Loss Aversion
When scenarios are framed in terms of potential losses rather than gains, models shift toward more conservative recommendations, mirroring human loss aversion bias.
Normative violation: Economically identical scenarios should receive identical recommendations regardless of gain/loss framing
Availability Cascade
Dimension B5 — Recency
Models weight recent events and media-salient information more heavily than base rates would justify, showing classic availability heuristic patterns.
Normative violation: Recent events should not disproportionately influence long-term advice
Model-Specific Insights
Each model has a distinctive pattern of biases that creates its unique "advice genome" fingerprint.
GPT-5.3 Instant
The Directive Optimist — Confident recommendations with minimal hedging
- •Strongest anchoring bias (A4: h = 0.80)
- •Highest availability cascade effect (B5: h = 1.20)
- •Most directive in recommendation style
- •Lower regulatory hedging than other models
GPT-5.4 Thinking
The Deliberative Calibrator — Extended reasoning with measured responses
- •Reduced anchoring compared to GPT Instant
- •Highest overconfidence in predictions (C2: h = 0.68)
- •Extended reasoning attenuates some biases
- •Stronger geographic bias (E3)
Gemini 2.0 Flash
The Consistent Optimist — Reliable patterns across scenarios
- •Strongest brand preference (E2: h = 2.17)
- •Most consistent patterns across scenarios
- •Lower framing sensitivity than OpenAI models
- •Highest client-specific targeting (E4: h = 0.94)
Claude Sonnet 4.6
The Cautious Contrarian — High compliance with distinctive biases
- •Highest regulatory compliance orientation
- •Strongest availability bias (B5: h = 1.59)
- •Most distinctive bias profile
- •Lowest consistency across similar scenarios (G3)
Validated in realistic scenarios
Beyond controlled testing, we validated these bias patterns in realistic client scenarios — the kind of multi-faceted cases a CFP encounters daily. Key findings:
Social Security gap: All profiled AI advisors fail to integrate SS timing into retirement advice the majority of the time.
Deliberative advantage:OpenAI's deliberative advisor achieves 95% compliance vs 85% for their standard advisor on multi-domain scenarios.
Priority-setting varies: When clients present multiple issues, AI advisors range from 70% to 90% accuracy in identifying the top concern.
Research implications
For wealth management firms and advisory platforms
The advice genome of the model you deploy shapes every recommendation your clients receive. Understanding it before deployment is straightforward risk management — not a benchmark exercise.
For compliance and model risk teams
Brand recognition bias (h = 1.37–2.17) and technology sector preference (h = 0.78–1.34) are not detectable through standard compliance auditing. They require matched-pair behavioral testing. Compliance scores and behavioral fingerprints measure different things.
For AI product teams
Newer is not uniformly better. GPT-5.4's reasoning amplified anchoring susceptibility (h = 0.80 → 1.80). GPT-5.5 appears to have resolved this with a qualitatively different response profile. Pre-deployment behavioral profiling is the only way to know what changed.
Implications
The VERRIX findings have significant implications for anyone deploying, regulating, or consuming AI-generated financial advice.
For Asset Managers & Wealth Platforms
Deploying AI advisors requires understanding their behavioral characteristics
- •Market bias amplification: AI advisors may amplify existing market biases. Universal tech preference could contribute to sector concentration at scale.
- •Concentration risk: Strong brand preferences (e.g., Vanguard h=1.69) could direct disproportionate assets to specific providers.
- •Model selection matters: Different models suit different contexts. Claude for compliance-heavy; Gemini for consistency; GPT for directness.
- •Mitigation strategies: Prompting strategies and system instructions can partially mitigate known biases — but require understanding them first.
For Regulators
AI oversight requires new testing paradigms beyond capability benchmarks
- •Not bias-free: AI systems are not neutral oracles. They carry systematic biases from training data and RLHF processes.
- •Behavioral testing needed: Capability benchmarks miss behavioral biases. VERRIX-style testing should complement existing AI evaluation frameworks.
- •Systemic concerns: Universal biases across all major providers raise systemic risk questions — what happens when all AI advisors push tech?
- •Disclosure evolution: Current disclosure requirements may need updating to address AI-specific biases and behavioral patterns.
For Consumers
Understanding AI advisor limitations helps you use them more effectively
- •Question the default: AI recommendations for tech stocks or Vanguard may reflect training bias, not your optimal choice.
- •Beware framing: How you phrase your question affects the answer. Try both gain and loss framings to see if advice changes.
- •Cross-check advice: Different AI advisors have different biases. Consulting multiple sources helps identify model-specific tendencies.
For AI Developers
Building better AI advisors requires understanding current limitations
- •Training data bias: Universal biases likely reflect training data composition. Debiasing may require curated financial datasets.
- •RLHF effects: Human preference training may amplify popular biases. Constitutional approaches show promise for regulatory alignment.
- •Reasoning helps (partially):Extended reasoning reduces some heuristic biases but doesn't eliminate them. Architectural solutions needed.
Holistic Advisory Compliance
NEWBeyond bias measurement, VERRIX evaluates holistic advisory compliance across 80 real-world financial scenarios spanning investment, debt management, retirement planning, and multi-issue triage.
These compliance rates are from the VERRIX validation battery, not the behavioral fingerprint h-values above. The two research methods are complementary, not the same measure.
Compliance Rankings
Overall holistic compliance rate by model
Most Challenging Dimensions
Where all models struggle with compliance
Social Security timing (r2) is particularly challenging — all models achieve under 20% compliance on normative advice standards.