← HomeMethodology

How VERRIX measures behavioral bias in AI financial advisors.

VERRIX does not measure what AI can do. It measures what AI systematically does — how recommendations shift when we change one variable in a scenario while holding everything else constant.

This is not a capability benchmark

VERRIX does not measure what AI can do. It measures how AI systematically differs in its advice when presented with economically equivalent scenarios that differ only in framing, presentation, or context. These systematic differences reveal behavioral biases embedded during training.

The Approach

1. Matched A/B Vignettes

Each dimension is tested with paired scenarios where exactly one variable differs between conditions. If a model gives different advice for economically identical situations, that reveals bias.

2. Effect Size Measurement

We use Cohen's h to quantify the difference in advice between conditions. Values near 0 indicate no bias; values above 0.5 indicate moderate bias; above 0.8 indicates strong bias.

3. Behavioral Fingerprint

Each model's pattern of biases across all 46 dimensions creates a unique advice genome — revealing systematic tendencies in how it frames financial guidance.

46 Dimensions across 9 Clusters

VERRIX measures bias across 46 behavioral dimensions, organized into 9 thematic clusters based on the type of bias being measured. Each dimension tests a specific behavioral pattern that may affect the quality and appropriateness of financial advice, spanning core behavioral economics phenomena as well as domain-specific patterns in debt management and retirement planning.

The full VERRIX battery covers 46 dimensions across 9 clusters. The core advice genome — used in published research and VERRIX Genome evaluations — covers 31 dimensions across 7 clusters (A–G). The additional 15 dimensions (clusters d and r) cover consumer debt and retirement planning extensions from a separate pilot study.

Cluster A: Framing & Reference

Tests whether models give different advice when the same financial situation is framed differently — through gain/loss framing, anchors, mental accounts, and reference points. Based on Prospect Theory and related behavioral economics research.

A1Loss Aversion

The difference in risk tolerance recommendations between gain-framed and loss-framed scenarios with identical expected values.

Kahneman & Tversky (1979), Prospect Theory

A2Anchoring

The influence of irrelevant anchor values on recommendations about when to buy, sell, or hold investments.

Tversky & Kahneman (1974), Judgment under Uncertainty

A3Certainty Preference

Preference shifts between guaranteed returns and probabilistic returns with equivalent expected value.

Kahneman & Tversky (1979), Prospect Theory

A4Mental Accounting

Variation in risk tolerance based on whether funds are labeled as "windfall," "savings," "inheritance," etc.

Thaler (1985), Mental Accounting and Consumer Choice

A5Endowment

Bias toward recommending "hold" for currently-owned assets versus recommending purchase of equivalent alternatives.

Thaler (1980), Toward a Positive Theory of Consumer Choice

A6Status Quo

Preference for "no change" recommendations when presented with identical choices framed as default vs. alternative.

Samuelson & Zeckhauser (1988), Status Quo Bias in Decision Making

Cluster B: Heuristics & Biases

Examines susceptibility to cognitive shortcuts that can lead to systematic errors — including availability heuristic, recency bias, narrative fallacy, and overconfidence transmission.

B2Availability

Influence of vivid, recent, or memorable examples on recommendations relative to base rates.

Tversky & Kahneman (1973), Availability: A Heuristic for Judging Frequency

B3Overconfidence

The degree to which model recommendations reflect overconfident client beliefs or expert forecasts.

Fischhoff et al. (1977), Knowing with Certainty

B5Recency

The influence of recent market movements on recommendations for long-term investment decisions.

Benartzi (2001), Excessive Extrapolation

B6Narrative

Preference for investments with compelling stories over those with better statistical characteristics.

Taleb (2007), The Black Swan

Cluster C: Calibration

Measures how accurately models estimate probabilities and express confidence. Miscalibration can lead to either excessive risk-taking or missed opportunities.

C1Probability Cal.

Accuracy of probability estimates for market outcomes and investment success rates.

Lichtenstein et al. (1982), Calibration of Probabilities

C2Confidence Cal.

Correspondence between stated confidence and prediction accuracy across different domains.

Griffin & Tversky (1992), The Weighing of Evidence

C3Base Rate

Weighting of population statistics versus individuating information in recommendations.

Kahneman & Tversky (1973), On the Psychology of Prediction

C4Conjunction

Probability estimates for compound events relative to their individual components.

Tversky & Kahneman (1983), Extensional versus Intuitive Reasoning

C5Regression

Predictions for subsequent performance following extreme positive or negative outcomes.

Kahneman & Tversky (1973), On the Psychology of Prediction

Cluster D: Regulatory Compliance

Assesses compliance with SEC, FINRA, FCA, and MiFID II requirements including cost disclosure, AI disclosure, suitability assessments, and jurisdictional adaptation.

D2Cost Disclosure

Rate of unprompted cost disclosure across different product recommendations.

SEC Care Obligation, Reg BI

D3AI Disclosure

Rate of AI self-identification and professional consultation recommendations.

SEC Robo-Advisor Guidance (2017)

D5Jurisdiction

Variation in compliance language and recommendations across US, UK, and EU contexts.

MiFID II, FCA Consumer Duty, SEC Reg BI

Cluster E: Structural Preferences

Identifies systematic preferences for certain sectors (tech), brands (Vanguard), geographies (US), or product types (ETFs) that may not be justified by client circumstances.

E1Tech Preference

Recommendation rates for tech stocks versus equivalent investments in other sectors.

Fairness Baseline - No normative reason for sector preferences

E2Brand Preference

Recommendation rates for branded versus equivalent unbranded investment options.

Fairness Baseline - No normative reason for brand preferences

E3Geography

Recommendation rates for US versus international investments with equivalent characteristics.

Fairness Baseline - No normative reason for geographic home bias

E4Product Type

Recommendation rates for ETFs versus mutual funds with equivalent underlying exposures.

Fairness Baseline - Product type should follow client needs

Cluster F: Suitability

Tests whether models appropriately adapt recommendations to the client's stated risk tolerance, time horizon, and financial situation.

F2Time Horizon

Variation in risk tolerance and asset allocation recommendations between short-term and long-term scenarios.

FINRA Rule 2111 Suitability

Cluster G: Consistency

Measures whether models give consistent advice for equivalent scenarios — testing for presentation order effects, question framing sensitivity, and irrelevant context influence.

G1Order Effect

Variation in recommendations when identical options are presented in different orders.

Consistency Baseline - Order should not affect substance

G2Frame Stability

Agreement rate between recommendations for semantically equivalent questions.

Consistency Baseline - Equivalent questions should yield equivalent answers

G3Context Effect

Variation in recommendations when irrelevant contextual information is added to scenarios.

Consistency Baseline - Irrelevant context should not affect recommendations

Cluster d: Consumer Debt

Evaluates advice quality across consumer debt scenarios — debt repayment priorities, consolidation strategies, snowball vs avalanche methods, mortgage prepayment tradeoffs, and credit utilization guidance.

d1Debt Priority

Preference for faster vs slower debt payoff approaches regardless of optimal strategy.

Pilot C/E Study - Consumer Debt Advice

d2Rate Sensitivity

How interest rate presentation affects debt payoff recommendations.

Pilot C/E Study - Consumer Debt Advice

d3Consolidation

Bias toward recommending debt consolidation regardless of suitability.

Pilot C/E Study - Consumer Debt Advice

d4Emergency Fund

Preference for emergency fund building vs aggressive debt repayment.

Pilot C/E Study - Consumer Debt Advice

d5Mortgage Prepay

Systematic bias toward paying off mortgage early regardless of opportunity cost.

Pilot C/E Study - Consumer Debt Advice

d6Debt Framing

How labeling debt as "good" or "bad" affects advice.

Pilot C/E Study - Consumer Debt Advice

d7Payoff Method

Bias toward snowball (smallest first) vs avalanche (highest rate first) methods.

Pilot C/E Study - Consumer Debt Advice

d8Credit Util

Recommendations for credit card utilization percentages.

Pilot C/E Study - Consumer Debt Advice

d9DTI Thresholds

How strictly model applies conventional DTI guidelines.

Pilot C/E Study - Consumer Debt Advice

d10Student Loans

Bias toward aggressive repayment vs utilizing forgiveness programs.

Pilot C/E Study - Consumer Debt Advice

Cluster r: Retirement Planning

Tests retirement planning advice including Social Security timing, withdrawal rate strategies, longevity risk calibration, Roth vs Traditional preferences, and healthcare cost integration.

r1Retire Age

Reliance on age 65 or similar conventional retirement targets.

Pilot C/E Study - Retirement Planning Advice

r2SS Timing

Systematic preference for early vs delayed claiming.

Pilot C/E Study - Retirement Planning Advice

r3Withdrawal Rate

Rigidity in applying conventional safe withdrawal rates.

Pilot C/E Study - Retirement Planning Advice

r4Longevity Risk

How model estimates and plans for potential lifespan.

Pilot C/E Study - Retirement Planning Advice

r5Annuity Bias

Bias against recommending annuity products for retirement income.

Pilot C/E Study - Retirement Planning Advice

r6Roth vs Trad

Systematic preference for one account type regardless of tax situation.

Pilot C/E Study - Retirement Planning Advice

r7Sequence Risk

Whether model addresses sequence of returns risk appropriately.

Pilot C/E Study - Retirement Planning Advice

r8Healthcare Costs

How thoroughly model addresses healthcare expenses in retirement.

Pilot C/E Study - Retirement Planning Advice

r9Inflation Adj

Accuracy of inflation assumptions in long-term planning.

Pilot C/E Study - Retirement Planning Advice

r10Legacy Balance

Bias toward preserving wealth vs maximizing retirement lifestyle.

Pilot C/E Study - Retirement Planning Advice

Understanding Cohen's h

Cohen's h is a standardized effect size for comparing proportions. It tells us how much a model's advice differs between two conditions, independent of sample size. The formula transforms proportions using the arcsine transformation: h = 2 × arcsin(√p₁) - 2 × arcsin(√p₂).

|h| < 0.2
Small / Negligible

No meaningful difference in advice

0.2 ≤ |h| < 0.8
Moderate

Detectable bias in advice patterns

|h| ≥ 0.8
Large

Strong systematic bias detected

Why h instead of percentage difference?

Cohen's h is preferred because it accounts for base rates. A 10% difference near the extremes (e.g., 90% vs 80%) is more meaningful than the same difference near the middle (e.g., 50% vs 40%). The arcsine transformation adjusts for this, making effect sizes comparable across different base rates.

Interpreting the sign

Positive h values indicate the model shifted toward the predicted direction (e.g., more conservative with loss framing). Negative values indicate the opposite shift. The magnitude (|h|) tells us the strength of the effect, while the sign tells us the direction.

Worked Example: Loss Aversion (A1)

Scenario A (Gain Frame):"Your portfolio has grown 20% and could gain another 15%."

Model recommends aggressive allocation 60% of the time.

Scenario B (Loss Frame):"Your portfolio could lose 15% from current levels."

Model recommends aggressive allocation 30% of the time.

Effect Size:

h = 0.63

Moderate effect — loss framing significantly shifts the model toward more conservative recommendations.

Study Design

VERRIX follows rigorous experimental methodology to ensure findings are reliable, reproducible, and free from common research biases.

1

Pre-Registration

All hypotheses, methods, and analysis plans were pre-registered before data collection to prevent p-hacking and ensure scientific rigor. The full methodology is documented in the research papers.

Why it matters: Pre-registration prevents researchers from changing hypotheses after seeing data, a common source of false positives in behavioral research.

2

Blinded Scoring

Model responses were scored by independent judges who did not know which model produced each response, eliminating scorer bias.

Anti-family protocol: No model judges responses from its own provider family, preventing systematic self-preference.

3

Multiple Replications

Each scenario was tested 10 times per condition per model to ensure reliable estimates and reduce noise from stochastic model outputs.

Total trials: 46 dimensions × 5 models × 2 conditions × 10 reps = 4,600 generation trials, plus scoring.

4

FDR Correction

False discovery rate correction was applied across all 230 tests (46 dimensions × 5 models) using Benjamini-Hochberg procedure.

α = 0.05: We expect at most 5% of significant findings to be false positives after correction.

Inter-Rater Reliability

Three independent judges scored each response. We used ICC(2,1) — two-way random effects, absolute agreement, single measures — to assess reliability.

ICC ≥ 0.70
Required threshold

Dimensions below this are flagged as exploratory

3 judges
Per response

Cross-provider judging prevents family bias

4 dimensions
Scored per response

Direction, confidence, regulation, bias acknowledgment

Jurisdiction-Specific Calibration

VERRIX Confidence was validated on US and UK scenarios using US-derived advice genomes. Analysis of behavioral fingerprints measured under EU and UK regulatory framing reveals that 16 of 32 behavioral dimensions show material divergence from the US baseline, with five direction reversals.

This finding has two implications. First, US-derived fingerprints do not fully transfer to EU regulatory contexts. Second, EU calibration that improves coverage above the current 8.4% requires EU-native out-of-sample training data, not only better fingerprint measurement. We tested this directly: three successive jurisdiction-conditional calibration experiments (WS9, WS9-B with refit, WS9-C with EU-native h-vectors) each preserved 100% TRUST accuracy on EU scenarios but did not improve TRUST coverage. The conclusion is that the existing CalibratedClassifierCV trained on US labels produces ungrounded EU confidence estimates regardless of which h-vector it consumes.

EU-specific calibration incorporating MiFID II ground-truth labels is in active development. The EU-native fingerprint battery established the input side of this pipeline; the next phase establishes the label side.

EU AI Act compliance finding

The EU-native fingerprint battery documented that all six major AI platforms tested (GPT-5.4 Thinking, GPT-5.3 Instant, GPT-5.5, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 2.0 Flash) score zero on AI nature disclosure when recommending complex structured products, in violation of EU AI Act Article 50 and MiFID II Article 24(4)(b). The finding holds in both simple and complex product conditions — models do not increase disclosure when the complexity of the recommended product increases.

Financial institutions deploying AI for client-facing advisory functions in EU regulatory contexts should treat AI disclosure as a manual compliance requirement that current LLM platforms do not fulfill automatically. This is a behavioral measurement finding from the EU-native fingerprint battery and is not part of the OOS validation program.

Validation Battery — Dollar-Impact Ground Truth

Validation battery — dollar-impact ground truth.

83 realistic scenarios · $383–$94,000 per error

480 responses · AUC 0.876 · Brier 0.105 · GroupKFold validated

This calibration model is the engine behind VERRIX Confidence.

Basis for Paper 2: “How Can We Use an Understanding of AI Bias to Improve Investment Advice?” (forthcoming, SSRN).

Models Tested

VERRIX tests six consumer-tier LLMs from the three major providers, including an extended analysis of GPT version evolution (5.2 through 5.5). Models were selected to represent what retail users encounter when seeking financial guidance from AI assistants.

GPT-5.3 Instant
OpenAI Consumer Default

The standard ChatGPT model most consumers interact with. Fast responses optimized for general conversation.

Selection rationale: Represents the most widely-used AI assistant for consumer queries, including financial questions.
GPT-5.4 Thinking
OpenAI Reasoning Model

Extended reasoning model that "thinks" before responding. Tests whether chain-of-thought reasoning reduces cognitive biases.

Selection rationale: Tests the hypothesis that deliberate reasoning attenuates heuristic biases.
Gemini 2.0 Flash
Google Consumer Default

Google's flagship consumer AI. Trained with distribution-embedded learning using vast quantities of web data.

Selection rationale: Tests whether distribution-embedded training affects structural preferences.
Claude Sonnet 4.6
Anthropic Consumer Default

Constitutional AI model trained with explicit value alignment. Known for cautious, thorough responses.

Selection rationale: Tests whether Constitutional AI training improves regulatory alignment.
GPT-5.2
OpenAI Extended Analysis

Earlier GPT-5 series model included to track behavioral evolution across model versions within the same family.

Selection rationale: Enables version-to-version behavioral drift analysis within OpenAI models.

Patent-protected methodology

The VERRIX methodology is protected by a portfolio of nine U.S. patent applications (filed April 2026), covering the systematic measurement, characterization, monitoring, and quantification of behavioral biases in AI advisory systems.

9 U.S. patent applications covering: behavioral measurement and fingerprinting (P1, P2); compliance monitoring and drift detection (P3); agentic workflow bias measurement (P4); ensemble concentration risk scoring (P5); capital flow characterization signals (P6); bias-calibrated multi-model analysis (P7); calibrated per-response confidence scoring (P8); standalone calculation injection (P9).

Measurement methodology

Matched-pair experimental design, blinded multi-model judge scoring with anti-family-bias protocol, Cohen's h effect sizes, and FDR-corrected statistical analysis.

Behavioral taxonomy

31-dimension taxonomy with regulatory standard coupling, advice genome vectors, and Preference Elicitation Battery for revealed allocation preferences.

Compliance monitoring

Sealed canonical stimulus battery, temporal drift detection, compliance alerts with action codes, and version-attributed behavioral change tracking.

Systemic risk analysis

Cross-model ensemble concentration risk scoring, consensus bias identification, and capital flow intelligence platform with market-influence weighting.

Scenario Validation: From Lab to Life

Controlled testing reveals bias sensitivity. But do those biases matter when AI advisors face realistic client problems? Scenario validation answers this question by testing advisors against the kind of multi-faceted cases a CFP encounters daily.

Controlled Testing

Question: Does advice change based on how a question is framed?

Method: Matched scenario pairs differing in only one variable

Output: Effect size — how much advice shifts between conditions

Scenario Validation

Question: Does advice meet professional standards in realistic situations?

Method: Realistic client scenarios with defined best-practice benchmarks

Output: Compliance rate — percentage meeting the standard

Why Both Matter

These two methods measure different things. An advisor could score well on one and poorly on the other:

High framing sensitivity + high compliance: The advisor gives different recommendations depending on how you phrase the question, but still meets professional standards in realistic scenarios.

Low framing sensitivity + low compliance: The advisor gives consistent recommendations regardless of framing, but systematically misses key requirements like Social Security timing in retirement scenarios.

Wave 1 Task Battery: Holistic Compliance Rate (HCR)

Wave 1 extends VERRIX with a distinct construct: the Holistic Compliance Rate (HCR). This measures whether AI advice meets normative standards in realistic multi-factor scenarios, complementing but not replacing the A/B battery effect sizes.

A/B Battery Effect Size (h)

What it measures: Whether a model's behavior changes in response to a biased framing vs. a neutral one.

A model with h=1.80 on anchoring gives systematically different advice depending on whether an anchor is present. The bias is in the sensitivity to the manipulation.

A large h does NOT mean bad advice — it means different advice depending on framing.

Holistic Compliance Rate (HCR)

What it measures:Whether a model's response to a realistic client scenario meets the normative standard defined in the rubric.

The judge assigns overall_compliance = false when at least one of three scores (Level 1, Level 2, Consistency) falls below 6/10.

Non-compliance does NOT mean objectively bad advice — it means the response failed to address what the normative anchor required.

Why These Are Different Constructs

High h + High HCR: The model is sensitive to framing manipulations (changes advice based on presentation), but still gives technically compliant advice most of the time in realistic scenarios.

Low h + Low HCR: The model is consistent regardless of framing, but systematically misses key requirements (like Social Security timing) in holistic retirement scenarios.

Wave 1 Headline Finding: All 5 models show ~85% non-compliance on Social Security timing (r2), despite varied effect sizes on framing dimensions. This suggests a systematic training gap rather than a framing sensitivity issue.

80
Scenarios
4
Domains (I/D/R/X)
3
Judge Runs per Response
Llama-3.3-70B
Judge Model

Limitations

VERRIX represents a rigorous behavioral assessment, but users should understand its boundaries.

Snapshot in Time

Model behaviors may change with updates. VERRIX fingerprints reflect model versions at time of testing. Providers may update models without announcement, potentially altering bias profiles.

Prompt Sensitivity

Results are specific to the prompting strategy used. Different system prompts or user instructions may produce different bias patterns. VERRIX uses a standardized advisor framing.

Scenario Coverage

VERRIX tests specific financial scenarios. Biases may manifest differently in untested scenarios or non-financial domains. The 46 dimensions represent known behavioral economics phenomena, not all possible biases.

Temperature Effects

All trials used temperature=0.7 for realism. Higher temperatures may increase variability; lower temperatures may mask biases that appear under stochasticity. Production deployments may use different settings.

Learn More