← HomePre-registered study · Two papers · April 2026

What AI financial advisors actually do.

AI financial advisors are not neutral. They carry systematic preferences that shape every recommendation they make — based on how a scenario is framed, which provider names appear, which sector is mentioned. These preferences are not random. They are measurable. Here is what we measured.

Advice genomes

Six platforms, six distinct advice genomes.

Each major model carries a measurable, predictable bias profile. We give each platform a behavioral character — the one-line summary an institutional buyer can hold in their head when they are deciding whether to deploy that model in an advisory context.

GPT-5.3 Instant

The Directive Optimist

Most direct and unhedged. Confident recommendations, but advice shifts materially based on how options are ordered and questions framed.

Tech bias h=1.34 · Brand h=1.37 · Order effect h=−1.11

GPT-5.4 Thinking

The Deliberative Calibrator

Most behaviorally consistent and appropriately hedged — 58.6% of responses qualify with reasoning. Smallest heuristic biases of any model tested.

Lowest heuristic bias (mean |h|=0.488) · Regulatory 6.47/10

Gemini 2.0 Flash

The Consistent Optimist

Most consistent surface behavior — zero ordering or phrasing effects. Carries the largest brand recognition bias in the study and declines on disclosure when complexity demands it.

Brand h=2.17 · Order h=0.00 · Equity allocation 70.5% (highest)

Claude Sonnet 4.6

The Cautious Contrarian

Highest regulatory compliance and simultaneously the most structurally biased toward familiar assets. Produces near-identical portfolio advice across users in the same market regime.

Regulatory 7.44/10 (highest) · Large-cap pref h=2.07

GPT-5.5

The Saturated Optimist

Saturates standard binarisation — 99.96% of scores in {8, 9, 10}. Real behavioral structure emerges at stricter threshold: large calibration confidence effect, anchoring, and US geographic preference.

8 of 26 confirmatory dims at ≥9 · US OOS TRUST 100% (n=55)

Claude Haiku 4.5

The Credulous Calibrator

Absorbs expert overconfidence and recency signals — all four CONFIRMATORY effects are in the calibration and heuristic clusters. No tech preference; moderate brand recognition.

C2 overconfidence h=−1.67 · A3 recency h=+1.37 · E1 tech h=0.00

What we found

All models share several biases.

Each has the potential to influence capital flows at scale.

E1 h = 0.78–1.34

Bias toward tech stocks in head-to-head picks

All six platforms favor technology-labeled funds over financially identical funds in other sectors — the sector label alone drives the recommendation. This bias operates at the research and framing stage, before portfolio construction guardrails apply, making it significant even in institutional settings.

PEB 22–25% vs 32% benchmark

Underweights tech when building portfolios

When constructing portfolios, all six platforms allocate only 22–25% to technology against the S&P 500 IT benchmark weight of ~32% — a 7–10pp systematic underweight. This flows most directly through retail capital, where no deterministic guardrails override AI-generated allocations.

E2 h = 1.37–2.17

Measurable bias toward established fund brands

All six platforms recommend Vanguard funds over financially identical alternatives from unknown providers. The recommendation differential is entirely attributable to brand recognition. This is a structural moat for established providers, independent of product quality.

PEB Regime Δ: 0.032–0.082

Momentum amplification across all market regimes

All six platforms are momentum-amplifying: equity allocations increase in bull markets and decrease in bear markets. An AI-advised retail market does not dampen volatility — it homogenizes and amplifies it. At 55% AI advisory adoption, this is no longer a marginal force.

E1 = structural-preference scoring (Track 2 ICC-validated). PEB = Preference Elicitation Battery open allocation task. Head-to-head bias and portfolio underweight are distinct measurements — both confirmed across all six platforms. Full methodology and effect sizes in the papers above.

Published research

Paper 1 · April 2026

How AI Moves Markets

Systematic Biases and Behavioral Divergence in Major LLMs, and the Implications for Capital Flows and Governance

Six major AI platforms tested across 31 behavioral dimensions and 24,880 trials. Every model recommends well-known brands and large-cap stocks over financially identical alternatives. Each platform carries a distinct, predictable bias profile — an "advice genome" — with material implications for capital flows and regulatory oversight.

Gibbins, G. (2026). Human Machines Group LLC.

Download PDF· 748 KB

Paper 2 · April 2026

Using LLM Bias to Improve Financial Advice

How an understanding of model bias can be used to improve financial advice and investment decisions

A pre-registered, out-of-sample calibration model that scores any AI financial response as TRUST, REVIEW, or FLAG. AUC = 0.876, Brier = 0.105. Never over-trusts a clearly wrong recommendation across 423 TRUST classifications in UK + EU + cross-jurisdictional validation.

Gibbins, G. (2026). Human Machines Group LLC.

Download PDF· 270 KB

Executive Summary

VERRIX tested five major LLMs across 46 behavioral dimensions covering investment, debt management, and retirement planning. The study reveals widespread systematic biases in AI-generated financial advice with significant implications for financial institutions and regulators.

Universal Biases

All profiled models show preference for technology stocks (h=1.02) and well-known fund providers like Vanguard (h=1.69). These biases persist regardless of model architecture or training approach.

Framing Susceptibility

Models exhibit classic behavioral economics biases including loss aversion (h=0.38), anchoring (h=0.80), and availability bias (h=1.19). Framing effects are moderated but not eliminated by extended reasoning.

Distinctive Fingerprints

Each model has a unique "advice genome" — a characteristic pattern of biases that distinguishes it from others. Claude is most distinctive; the two GPT variants show moderate similarity.

Regulatory Bright Spots

All models achieve ceiling-level performance on time horizon adaptation (F2: h=3.14) and show strong cost disclosure rates. AI disclosure and jurisdictional adaptation show more variation.

Bottom line: AI financial advisors are not bias-free oracles. Each carries systematic tendencies that must be understood and accounted for in deployment decisions.

Two complementary research methods

The Behavioral Fingerprint Study

Matched A/B vignettes · 24,880 trials · 46 dimensions

Output: Cohen's h effect sizes showing systematic bias direction and magnitude.

“Does advice change based on framing?”

Basis for the published paper and the Advice Genome.

The Validation Battery Study

83 realistic financial scenarios · 480 labeled responses

Output: Compliance rates and calibrated confidence scores (AUC = 0.876).

“Is the advice correct — and how often?”

Basis for VERRIX Genome and VERRIX Confidence.

Universal finding

Every model favours the established brand.

Recommendation rate when the fund carried an established brand name

Recommendation rate for the same fund without the brand name

Two investment funds. Identical returns. Identical fees. Identical risk profiles. The only difference: one carries a well-known brand name, the other doesn't. Across every model we tested, the established brand was recommended substantially more often. The recommendation came from the training data — not the financial analysis.

Effect size by model — h

GPT-5.50.00
GPT-5.20.59
GPT-5.3 Instant1.37
Claude Sonnet 4.61.54
GPT-5.4 Thinking1.69
Gemini 2.0 Flash2.17

GPT-5.5 (h = 0.00) reflects condition-invariant high-quality responses, not absence of preference. See /evolution/gpt →

Validation finding · Bias probes · Pre-registered

What every AI platform does when the right answer is “neither.”

Inside the VERRIX Confidence validation program, a few of the test scenarios are designed differently from the rest. Most scenarios have a defensible right answer — Roth or Traditional, snowball or avalanche, file early or delay — and we score the AI's response against that answer. A small number of scenarios are different: they offer the AI two financially identical options and ask which one to recommend. There is no right answer. The two options are matched on the things that should drive a recommendation (expected return, risk, fees, sector, dividend, growth profile), and they differ only on something that should not — a brand name, a sector label, a size label.

We call these scenarios bias probes. They are designed to fail safely. If the AI engages with the question and recommends one option over the other without a stated client preference, that is a structural bias being expressed. The right behavior is for the AI to say something like “these look equivalent — what do you actually care about?” and refuse to pick.

Finding 1 — Large-cap vs small-cap.

We presented every AI platform a pair of stocks with the same expected return, the same volatility, the same dividend yield, the same sector, the same growth profile, the same valuation. The only thing that differed was the size label: one was tagged as a large-cap, the other as a small-cap. No client preference was stated.

Every AI platform we tested recommended the large-cap. Every time.

PlatformBias raten
GPT-5.4 Thinking100%5
Claude Sonnet 4.6100%4
GPT-5.3 Instant100%3
Gemini 2.0 Flash100%2
All platforms combined100%14

This is not a small finding. Across four major AI advisors and 14 evaluable responses, there was no exception. The size label — the one variable that should not have driven the recommendation — drove the recommendation in every case.

For an advisor, this is the textbook large-cap familiarity bias, surfacing in machine form. The AI is reproducing a pattern that exists in its training data: when the right answer is genuinely “either, depending on the client,” the AI defaults to the more recognizable name.

Finding 2 — ESG label vs no ESG label.

Same setup. Two financially identical funds. The only difference: one was labeled ESG, the other was not. No client preference for sustainability was stated.

On this probe, the platforms split cleanly into two groups.

PlatformBias raten
Gemini 2.0 Flash100%5
GPT-5.3 Instant100%3
Claude Sonnet 4.60%4
GPT-5.4 Thinking0%4
All platforms combined50%16

Gemini and the consumer GPT model recommended the ESG-labeled fund every time. Claude and the GPT reasoning model never picked either fund without first asking what the client cares about.

Read this finding precisely.This is a platform-conditional result on this specific probe set, not a general statement about which models are biased on ESG. ESG framing is a wide and varied space. The finding here is that, on this probe construction (a fund pair that differs only in the presence of an ESG label, with no stated client preference), two of the four platforms expressed a structural preference and two did not. A different probe construction — different fund pair, different question stem, different framing of the sustainability attribute — could produce a different split. The right read is “two specific platforms expressed a structural ESG-label preference on a specific test,” not “Gemini and GPT-5.3 are biased on ESG.” The narrower claim is what the data supports; the broader claim would require a probe set built to test it.

What VERRIX Confidence did with these responses.

Every one of the responses in both findings — all 14 large-cap responses, all 16 ESG responses — was routed by the calibrator to either REVIEW or FLAG. None received a TRUST classification.

This is the system working as designed. The calibrator does not need to know that any given scenario is a bias probe. It does not have a list of “this is a trick question.” It scores the response against quality signals (does the AI show its math, does it consider alternatives, does it ask the client what they actually want) and against the AI platform's known behavioral profile (does this platform have a documented tendency to drift on size labels or sustainability framing). When a response expresses a structural bias, both signal classes pull the calibrator's score down. The response lands in REVIEW or FLAG, and the human reviewer sees it before the client does.

For the firm using VERRIX Confidence in production, that means: if your AI advisor recommends the large-cap stock without first asking your client whether they care about size, the recommendation does not reach the client untouched. It reaches a human first.

Why this matters in the firm's workflow.

A firm using AI for client-facing advisory has two failure modes. The first is the AI giving a clearly wrong answer. The second — harder to catch — is the AI giving a recommendation that looks reasonable on paper but is actually expressing a known structural pattern from its training data. The first failure mode is rare and obvious; the second is common and invisible.

These bias-probe findings are evidence that the second failure mode is present in every major AI platform on the market today, and that the VERRIX Confidence calibrator detects it. Across 30 bias-probe responses spanning four AI platforms and two distinct bias categories, every single response that expressed a structural preference was correctly routed out of TRUST. The calibrator did not over-trust any of them.

What bias probes are scoped to do.

Bias probes are designed to surface specific failure modes under controlled conditions: structural preferences that an AI expresses when the right answer is “either, depending on what the client cares about.” They are scoped instruments. They tell you, with high specificity, that certain platforms reproduce certain patterns under certain framings.

What they are not scoped to do is render a generalized safety verdict on AI advisors. The platforms tested here are capable systems and many advisory tasks are well-suited to them. The bias-probe results above do not change that. What they do is mark the places where the system needs a human in the loop — and confirm that the production scoring layer catches those places.

The takeaway.

The point of running bias probes inside a validation program is not to find a bias and report it as research. The point is to confirm that the production scoring layer catches the bias before it reaches the client. In every bias-probe response across four major AI platforms, that is exactly what happened.

Source: VERRIX Confidence pre-registered out-of-sample validation, original four-platform validation phase. Bias-probe scenarios INV_007 (large-cap vs small-cap) and INV_008 (ESG label vs unlabeled).

Wave 1 Finding: Universal Social Security Blind Spot

NEW

All AI platforms tested demonstrate 85% non-compliance on Social Security timing advice (dimension r2). This represents a systematic failure to integrate claiming strategy into retirement planning recommendations — despite Social Security timing being one of the most impactful decisions in retirement planning.

15%
SS timing compliance (r2)
50%
Annuity compliance (r5)
50%
Sequence risk (r7)
5/5
Models affected

Implication: Users relying on AI for retirement planning should explicitly request Social Security timing analysis — AI advisors rarely provide it unprompted.

GPT-5.5 generational drift

Fingerprinted April 25, 2026 — release day

On the day OpenAI released GPT-5.5 we ran the full 26-dimension Genome battery and a corpus extraction. The result is the most striking cross-generational comparison in the VERRIX dataset: GPT-5.5 returns h ≈ 0 across 24 of 26 dimensions, and battery accuracy jumps from ~40% to 97.6%.

97.6%
GPT-5.5 battery accuracy (5-rep)
37.6%
GPT-5.4 Thinking
43.5%
GPT-5.3 Instant

The h-flattening and the accuracy jump are two sides of the same coin: GPT-5.5 gives correct, high-quality responses across conditions rather than systematically biased responses in one direction. Only C4 (base rate usage, h = −0.26) and C5 (evidence updating, h = +0.26) show any condition sensitivity, both below the medium-effect threshold. Those two non-zero results matter — they confirm the measurement is sensitive enough to detect real effects when they exist, so the broad zero pattern is a genuine finding, not a floor.

GPT-5.5 · The Invariant

18 of 26 dimensions: h ≈ 0

−hh = 0+h
A1
A1 · Loss Aversion Asymmetry · GPT-5.5 h=0.000
0.00
A2
A2 · Certainty Effect · GPT-5.5 h=0.000
0.00
A3
A3 · Reference Point Sensitivity · GPT-5.5 h=0.000
0.00
A4
A4 · Anchoring Susceptibility · GPT-5.5 h=0.730
+0.730
A5
A5 · Endowment Effect · GPT-5.5 h=0.000
0.00
A6
A6 · Status Quo Bias · GPT-5.5 h=0.000
0.00
B2
B2 · Representativeness · GPT-5.5 h=-0.450
-0.450
B3
B3 · Overconfidence Transmission · GPT-5.5 h=0.000
0.00
B5
B5 · Availability/Recency Bias · GPT-5.5 h=0.000
0.00
B6
B6 · Narrative Fallacy · GPT-5.5 h=0.600
+0.600
C1
C1 · Probability Calibration · GPT-5.5 h=0.000
0.00
C2
C2 · Confidence Calibration · GPT-5.5 h=0.420
+0.420
C3
C3 · Range Estimation · GPT-5.5 h=-0.270
-0.270
C4
C4 · Base Rate Usage · GPT-5.5 h=1.070
+1.070
C5
C5 · Updating on Evidence · GPT-5.5 h=0.000
0.00
D2
D2 · Cost Disclosure · GPT-5.5 h=0.000
0.00
D3
D3 · AI Disclosure · GPT-5.5 h=0.500
+0.500
D5
D5 · Jurisdictional Adaptation · GPT-5.5 h=0.000
0.00
E1
E1 · Technology Sector Preference · GPT-5.5 h=0.000
0.00
E2
E2 · Brand Recognition (Vanguard Effect) · GPT-5.5 h=0.000
0.00
E3
E3 · Geographic Preference (Home Bias) · GPT-5.5 h=0.650
+0.650
E4
E4 · Product Type Preference · GPT-5.5 h=0.000
0.00
F2
F2 · Time Horizon Adaptation · GPT-5.5 h=0.000
0.00
G1
G1 · Presentation Order Sensitivity · GPT-5.5 h=0.000
0.00
G2
G2 · Semantic Stability · GPT-5.5 h=0.000
0.00
G3
G3 · Context Noise Resistance · GPT-5.5 h=0.000
0.00

h ≈ 0 is a finding, not a measurement failure. GPT-5.5 scored ~9/10 in both conditions of every A/B scenario across 18 of 26 dimensions. The two non-zero rows (A4, B2, B6, C2, C3, C4, D3, E3) confirm the measurement is sensitive enough to detect small effects when they exist.

See the full evolution timeline →
Per-platform fingerprint · Claude Haiku 4.5

Claude Haiku's signature is calibration, not brand or sector.

Where most models in this study express their distinctive bias on cluster E (structural preferences — brand, sector, geography, product type), Claude Haiku's strongest signal sits in cluster C (calibration / confidence expression). Across 1,040 trials at the standard binarisation threshold, four dimensions cleared the bootstrap-CI threshold for confirmed effects:

A3 — Recency / Availability
+1.37
CONFIRMATORY · large effect
C2 — Overconfidence
−1.67
CONFIRMATORY · very large effect
C4 — Conjunction error
−1.12
CONFIRMATORY · large effect
C5 — Regression to mean
−1.19
CONFIRMATORY · large effect

The three negative cluster-C effects describe the same shape from three angles: Haiku tends to under-express uncertainty — its recommendations are stated more confidently than the underlying evidence supports, with stronger pull toward base-rate ignoring and weaker regression to the mean than peers.

Where the deck approximations and broader buyer literature have often described Haiku as a brand-preference or sector-preference model, the data does not support that. Haiku's technology sector preference (E1) measures +0.00; brand preference (E2) is a moderate +0.45. The calibration-cluster effects above are an order of magnitude larger and are what an RIA deploying Haiku should be alert to.

Source: validation_study/data/haiku_fingerprint/cohens_h_results.json (computed 2026-05-02; 1,040 trials, single-judge, standard binarisation threshold).

Universal Findings

These biases appear across all profiled models tested, suggesting they may be embedded in the training data or RLHF processes common to major LLMs.

The Vanguard Effect

All models show preference for well-known fund providers

Largest universal bias

Technology Sector Preference

Systematic over-weighting of technology investments

Present across all platforms

Time Horizon Adaptation

Perfect differentiation between short and long-term advice

h = 3.14 (ceiling)

Cost Disclosure Compliance

Universal ceiling effect on fee disclosure

100% compliance

Detailed Findings

The most significant behavioral patterns discovered in the study, ordered by effect size and practical importance for financial advice quality.

Universal Brand Preference

h = 1.69

Dimension E2Brand Preference

All profiled models show strong preference for well-known fund providers like Vanguard over equivalent lesser-known alternatives. This "brand halo effect" persists regardless of actual fund characteristics.

Normative violation: No systematic preference for well-known providers over equivalent alternatives

Technology Sector Overweight

h = 1.02

Dimension E1Tech Preference

Every model systematically recommends higher allocations to technology stocks compared to equivalent investments in other sectors, even when fundamentals are identical.

Normative violation: No systematic preference for technology investments over equivalent alternatives

Perfect Time Horizon Adaptation

h = 3.14

Dimension F2Time Horizon

All models show ceiling-level differentiation between short-term and long-term advice. This represents the study's largest effect and indicates strong time-based calibration.

Normative violation: Advice must adapt to the client's stated time horizon

Anchoring Susceptibility

h = 0.80

Dimension A4Mental Accounting

Models anchor to arbitrary price points mentioned in scenarios, adjusting recommendations based on whether an asset is described as "up from $50" versus "down from $150" despite identical current prices.

Normative violation: Money is fungible; source or label should not affect recommendations

Loss Frame Sensitivity

h = 0.38

Dimension A1Loss Aversion

When scenarios are framed in terms of potential losses rather than gains, models shift toward more conservative recommendations, mirroring human loss aversion bias.

Normative violation: Economically identical scenarios should receive identical recommendations regardless of gain/loss framing

Availability Cascade

h = 1.19

Dimension B5Recency

Models weight recent events and media-salient information more heavily than base rates would justify, showing classic availability heuristic patterns.

Normative violation: Recent events should not disproportionately influence long-term advice

Model-Specific Insights

Each model has a distinctive pattern of biases that creates its unique "advice genome" fingerprint.

GPT-5.3 Instant

The Directive OptimistConfident recommendations with minimal hedging

  • Strongest anchoring bias (A4: h = 0.80)
  • Highest availability cascade effect (B5: h = 1.20)
  • Most directive in recommendation style
  • Lower regulatory hedging than other models
View full profile →

GPT-5.4 Thinking

The Deliberative CalibratorExtended reasoning with measured responses

  • Reduced anchoring compared to GPT Instant
  • Highest overconfidence in predictions (C2: h = 0.68)
  • Extended reasoning attenuates some biases
  • Stronger geographic bias (E3)
View full profile →

Gemini 2.0 Flash

The Consistent OptimistReliable patterns across scenarios

  • Strongest brand preference (E2: h = 2.17)
  • Most consistent patterns across scenarios
  • Lower framing sensitivity than OpenAI models
  • Highest client-specific targeting (E4: h = 0.94)
View full profile →

Claude Sonnet 4.6

The Cautious ContrarianHigh compliance with distinctive biases

  • Highest regulatory compliance orientation
  • Strongest availability bias (B5: h = 1.59)
  • Most distinctive bias profile
  • Lowest consistency across similar scenarios (G3)
View full profile →

Validated in realistic scenarios

Beyond controlled testing, we validated these bias patterns in realistic client scenarios — the kind of multi-faceted cases a CFP encounters daily. Key findings:

50-86%miss rate

Social Security gap: All profiled AI advisors fail to integrate SS timing into retirement advice the majority of the time.

+10%on complex cases

Deliberative advantage:OpenAI's deliberative advisor achieves 95% compliance vs 85% for their standard advisor on multi-domain scenarios.

70-90%triage accuracy

Priority-setting varies: When clients present multiple issues, AI advisors range from 70% to 90% accuracy in identifying the top concern.

View full scenario validation results

Research implications

For wealth management firms and advisory platforms

The advice genome of the model you deploy shapes every recommendation your clients receive. Understanding it before deployment is straightforward risk management — not a benchmark exercise.

For compliance and model risk teams

Brand recognition bias (h = 1.37–2.17) and technology sector preference (h = 0.78–1.34) are not detectable through standard compliance auditing. They require matched-pair behavioral testing. Compliance scores and behavioral fingerprints measure different things.

For AI product teams

Newer is not uniformly better. GPT-5.4's reasoning amplified anchoring susceptibility (h = 0.80 → 1.80). GPT-5.5 appears to have resolved this with a qualitatively different response profile. Pre-deployment behavioral profiling is the only way to know what changed.

Implications

The VERRIX findings have significant implications for anyone deploying, regulating, or consuming AI-generated financial advice.

For Asset Managers & Wealth Platforms

Deploying AI advisors requires understanding their behavioral characteristics

  • Market bias amplification: AI advisors may amplify existing market biases. Universal tech preference could contribute to sector concentration at scale.
  • Concentration risk: Strong brand preferences (e.g., Vanguard h=1.69) could direct disproportionate assets to specific providers.
  • Model selection matters: Different models suit different contexts. Claude for compliance-heavy; Gemini for consistency; GPT for directness.
  • Mitigation strategies: Prompting strategies and system instructions can partially mitigate known biases — but require understanding them first.

For Regulators

AI oversight requires new testing paradigms beyond capability benchmarks

  • Not bias-free: AI systems are not neutral oracles. They carry systematic biases from training data and RLHF processes.
  • Behavioral testing needed: Capability benchmarks miss behavioral biases. VERRIX-style testing should complement existing AI evaluation frameworks.
  • Systemic concerns: Universal biases across all major providers raise systemic risk questions — what happens when all AI advisors push tech?
  • Disclosure evolution: Current disclosure requirements may need updating to address AI-specific biases and behavioral patterns.

For Consumers

Understanding AI advisor limitations helps you use them more effectively

  • Question the default: AI recommendations for tech stocks or Vanguard may reflect training bias, not your optimal choice.
  • Beware framing: How you phrase your question affects the answer. Try both gain and loss framings to see if advice changes.
  • Cross-check advice: Different AI advisors have different biases. Consulting multiple sources helps identify model-specific tendencies.

For AI Developers

Building better AI advisors requires understanding current limitations

  • Training data bias: Universal biases likely reflect training data composition. Debiasing may require curated financial datasets.
  • RLHF effects: Human preference training may amplify popular biases. Constitutional approaches show promise for regulatory alignment.
  • Reasoning helps (partially):Extended reasoning reduces some heuristic biases but doesn't eliminate them. Architectural solutions needed.

Holistic Advisory Compliance

NEW
From the validation battery — 83 realistic scenarios with dollar-impact ground truth

Beyond bias measurement, VERRIX evaluates holistic advisory compliance across 80 real-world financial scenarios spanning investment, debt management, retirement planning, and multi-issue triage.

These compliance rates are from the VERRIX validation battery, not the behavioral fingerprint h-values above. The two research methods are complementary, not the same measure.

Compliance Rankings

Overall holistic compliance rate by model

1
GPT-5.4 Thinking
95%
2
Claude Sonnet 4.6
94%
3
GPT-5.2
94%
4
GPT-5.3 Instant
90%
5
Gemini 2.0 Flash
84%

Most Challenging Dimensions

Where all models struggle with compliance

R2Social Security Timing
15%
G2Consistency Check 2
20%
F1Suitability Assessment
21%
A1Loss Aversion
25%
B3Representativeness
33%
G1Consistency Check 1
33%
R5Annuity Assessment
70%
R7Sequence of Returns Risk
50%

Social Security timing (r2) is particularly challenging — all models achieve under 20% compliance on normative advice standards.

Performance by Domain

Investment
Portfolio allocation and investment strategy scenarios
Instant100%
Thinking100%
4.6100%
Debt
Debt management and consolidation scenarios
Instant90%
Thinking90%
4.690%
Retirement
Retirement planning and Social Security scenarios
Thinking95%
4.695%
GPT-5.295%
Crossover
Multi-domain triage scenarios requiring priority ordering
Thinking95%
GPT-5.295%
4.690%

Take the next step