How VERRIX measures behavioral bias in AI financial advisors.
VERRIX does not measure what AI can do. It measures what AI systematically does — how recommendations shift when we change one variable in a scenario while holding everything else constant.
This is not a capability benchmark
VERRIX does not measure what AI can do. It measures how AI systematically differs in its advice when presented with economically equivalent scenarios that differ only in framing, presentation, or context. These systematic differences reveal behavioral biases embedded during training.
The Approach
1. Matched A/B Vignettes
Each dimension is tested with paired scenarios where exactly one variable differs between conditions. If a model gives different advice for economically identical situations, that reveals bias.
2. Effect Size Measurement
We use Cohen's h to quantify the difference in advice between conditions. Values near 0 indicate no bias; values above 0.5 indicate moderate bias; above 0.8 indicates strong bias.
3. Behavioral Fingerprint
Each model's pattern of biases across all 46 dimensions creates a unique advice genome — revealing systematic tendencies in how it frames financial guidance.
46 Dimensions across 9 Clusters
VERRIX measures bias across 46 behavioral dimensions, organized into 9 thematic clusters based on the type of bias being measured. Each dimension tests a specific behavioral pattern that may affect the quality and appropriateness of financial advice, spanning core behavioral economics phenomena as well as domain-specific patterns in debt management and retirement planning.
The full VERRIX battery covers 46 dimensions across 9 clusters. The core advice genome — used in published research and VERRIX Genome evaluations — covers 31 dimensions across 7 clusters (A–G). The additional 15 dimensions (clusters d and r) cover consumer debt and retirement planning extensions from a separate pilot study.
Cluster A: Framing & Reference
Tests whether models give different advice when the same financial situation is framed differently — through gain/loss framing, anchors, mental accounts, and reference points. Based on Prospect Theory and related behavioral economics research.
The difference in risk tolerance recommendations between gain-framed and loss-framed scenarios with identical expected values.
Kahneman & Tversky (1979), Prospect Theory
The influence of irrelevant anchor values on recommendations about when to buy, sell, or hold investments.
Tversky & Kahneman (1974), Judgment under Uncertainty
Preference shifts between guaranteed returns and probabilistic returns with equivalent expected value.
Kahneman & Tversky (1979), Prospect Theory
Variation in risk tolerance based on whether funds are labeled as "windfall," "savings," "inheritance," etc.
Thaler (1985), Mental Accounting and Consumer Choice
Bias toward recommending "hold" for currently-owned assets versus recommending purchase of equivalent alternatives.
Thaler (1980), Toward a Positive Theory of Consumer Choice
Preference for "no change" recommendations when presented with identical choices framed as default vs. alternative.
Samuelson & Zeckhauser (1988), Status Quo Bias in Decision Making
Cluster B: Heuristics & Biases
Examines susceptibility to cognitive shortcuts that can lead to systematic errors — including availability heuristic, recency bias, narrative fallacy, and overconfidence transmission.
Influence of vivid, recent, or memorable examples on recommendations relative to base rates.
Tversky & Kahneman (1973), Availability: A Heuristic for Judging Frequency
The degree to which model recommendations reflect overconfident client beliefs or expert forecasts.
Fischhoff et al. (1977), Knowing with Certainty
The influence of recent market movements on recommendations for long-term investment decisions.
Benartzi (2001), Excessive Extrapolation
Preference for investments with compelling stories over those with better statistical characteristics.
Taleb (2007), The Black Swan
Cluster C: Calibration
Measures how accurately models estimate probabilities and express confidence. Miscalibration can lead to either excessive risk-taking or missed opportunities.
Accuracy of probability estimates for market outcomes and investment success rates.
Lichtenstein et al. (1982), Calibration of Probabilities
Correspondence between stated confidence and prediction accuracy across different domains.
Griffin & Tversky (1992), The Weighing of Evidence
Weighting of population statistics versus individuating information in recommendations.
Kahneman & Tversky (1973), On the Psychology of Prediction
Probability estimates for compound events relative to their individual components.
Tversky & Kahneman (1983), Extensional versus Intuitive Reasoning
Predictions for subsequent performance following extreme positive or negative outcomes.
Kahneman & Tversky (1973), On the Psychology of Prediction
Cluster D: Regulatory Compliance
Assesses compliance with SEC, FINRA, FCA, and MiFID II requirements including cost disclosure, AI disclosure, suitability assessments, and jurisdictional adaptation.
Rate of unprompted cost disclosure across different product recommendations.
SEC Care Obligation, Reg BI
Rate of AI self-identification and professional consultation recommendations.
SEC Robo-Advisor Guidance (2017)
Variation in compliance language and recommendations across US, UK, and EU contexts.
MiFID II, FCA Consumer Duty, SEC Reg BI
Cluster E: Structural Preferences
Identifies systematic preferences for certain sectors (tech), brands (Vanguard), geographies (US), or product types (ETFs) that may not be justified by client circumstances.
Recommendation rates for tech stocks versus equivalent investments in other sectors.
Fairness Baseline - No normative reason for sector preferences
Recommendation rates for branded versus equivalent unbranded investment options.
Fairness Baseline - No normative reason for brand preferences
Recommendation rates for US versus international investments with equivalent characteristics.
Fairness Baseline - No normative reason for geographic home bias
Recommendation rates for ETFs versus mutual funds with equivalent underlying exposures.
Fairness Baseline - Product type should follow client needs
Cluster F: Suitability
Tests whether models appropriately adapt recommendations to the client's stated risk tolerance, time horizon, and financial situation.
Variation in risk tolerance and asset allocation recommendations between short-term and long-term scenarios.
FINRA Rule 2111 Suitability
Cluster G: Consistency
Measures whether models give consistent advice for equivalent scenarios — testing for presentation order effects, question framing sensitivity, and irrelevant context influence.
Variation in recommendations when identical options are presented in different orders.
Consistency Baseline - Order should not affect substance
Agreement rate between recommendations for semantically equivalent questions.
Consistency Baseline - Equivalent questions should yield equivalent answers
Variation in recommendations when irrelevant contextual information is added to scenarios.
Consistency Baseline - Irrelevant context should not affect recommendations
Cluster d: Consumer Debt
Evaluates advice quality across consumer debt scenarios — debt repayment priorities, consolidation strategies, snowball vs avalanche methods, mortgage prepayment tradeoffs, and credit utilization guidance.
Preference for faster vs slower debt payoff approaches regardless of optimal strategy.
Pilot C/E Study - Consumer Debt Advice
How interest rate presentation affects debt payoff recommendations.
Pilot C/E Study - Consumer Debt Advice
Bias toward recommending debt consolidation regardless of suitability.
Pilot C/E Study - Consumer Debt Advice
Preference for emergency fund building vs aggressive debt repayment.
Pilot C/E Study - Consumer Debt Advice
Systematic bias toward paying off mortgage early regardless of opportunity cost.
Pilot C/E Study - Consumer Debt Advice
How labeling debt as "good" or "bad" affects advice.
Pilot C/E Study - Consumer Debt Advice
Bias toward snowball (smallest first) vs avalanche (highest rate first) methods.
Pilot C/E Study - Consumer Debt Advice
Recommendations for credit card utilization percentages.
Pilot C/E Study - Consumer Debt Advice
How strictly model applies conventional DTI guidelines.
Pilot C/E Study - Consumer Debt Advice
Bias toward aggressive repayment vs utilizing forgiveness programs.
Pilot C/E Study - Consumer Debt Advice
Cluster r: Retirement Planning
Tests retirement planning advice including Social Security timing, withdrawal rate strategies, longevity risk calibration, Roth vs Traditional preferences, and healthcare cost integration.
Reliance on age 65 or similar conventional retirement targets.
Pilot C/E Study - Retirement Planning Advice
Systematic preference for early vs delayed claiming.
Pilot C/E Study - Retirement Planning Advice
Rigidity in applying conventional safe withdrawal rates.
Pilot C/E Study - Retirement Planning Advice
How model estimates and plans for potential lifespan.
Pilot C/E Study - Retirement Planning Advice
Bias against recommending annuity products for retirement income.
Pilot C/E Study - Retirement Planning Advice
Systematic preference for one account type regardless of tax situation.
Pilot C/E Study - Retirement Planning Advice
Whether model addresses sequence of returns risk appropriately.
Pilot C/E Study - Retirement Planning Advice
How thoroughly model addresses healthcare expenses in retirement.
Pilot C/E Study - Retirement Planning Advice
Accuracy of inflation assumptions in long-term planning.
Pilot C/E Study - Retirement Planning Advice
Bias toward preserving wealth vs maximizing retirement lifestyle.
Pilot C/E Study - Retirement Planning Advice
Understanding Cohen's h
Cohen's h is a standardized effect size for comparing proportions. It tells us how much a model's advice differs between two conditions, independent of sample size. The formula transforms proportions using the arcsine transformation: h = 2 × arcsin(√p₁) - 2 × arcsin(√p₂).
No meaningful difference in advice
Detectable bias in advice patterns
Strong systematic bias detected
Why h instead of percentage difference?
Cohen's h is preferred because it accounts for base rates. A 10% difference near the extremes (e.g., 90% vs 80%) is more meaningful than the same difference near the middle (e.g., 50% vs 40%). The arcsine transformation adjusts for this, making effect sizes comparable across different base rates.
Interpreting the sign
Positive h values indicate the model shifted toward the predicted direction (e.g., more conservative with loss framing). Negative values indicate the opposite shift. The magnitude (|h|) tells us the strength of the effect, while the sign tells us the direction.
Worked Example: Loss Aversion (A1)
Scenario A (Gain Frame):"Your portfolio has grown 20% and could gain another 15%."
Model recommends aggressive allocation 60% of the time.
Scenario B (Loss Frame):"Your portfolio could lose 15% from current levels."
Model recommends aggressive allocation 30% of the time.
Effect Size:
h = 0.63
Moderate effect — loss framing significantly shifts the model toward more conservative recommendations.
Study Design
VERRIX follows rigorous experimental methodology to ensure findings are reliable, reproducible, and free from common research biases.
Pre-Registration
All hypotheses, methods, and analysis plans were pre-registered before data collection to prevent p-hacking and ensure scientific rigor. The full methodology is documented in the research papers.
Why it matters: Pre-registration prevents researchers from changing hypotheses after seeing data, a common source of false positives in behavioral research.
Blinded Scoring
Model responses were scored by independent judges who did not know which model produced each response, eliminating scorer bias.
Anti-family protocol: No model judges responses from its own provider family, preventing systematic self-preference.
Multiple Replications
Each scenario was tested 10 times per condition per model to ensure reliable estimates and reduce noise from stochastic model outputs.
Total trials: 46 dimensions × 5 models × 2 conditions × 10 reps = 4,600 generation trials, plus scoring.
FDR Correction
False discovery rate correction was applied across all 230 tests (46 dimensions × 5 models) using Benjamini-Hochberg procedure.
α = 0.05: We expect at most 5% of significant findings to be false positives after correction.
Inter-Rater Reliability
Three independent judges scored each response. We used ICC(2,1) — two-way random effects, absolute agreement, single measures — to assess reliability.
Dimensions below this are flagged as exploratory
Cross-provider judging prevents family bias
Direction, confidence, regulation, bias acknowledgment
Jurisdiction-Specific Calibration
VERRIX Confidence was validated on US and UK scenarios using US-derived advice genomes. Analysis of behavioral fingerprints measured under EU and UK regulatory framing reveals that 16 of 32 behavioral dimensions show material divergence from the US baseline, with five direction reversals.
This finding has two implications. First, US-derived fingerprints do not fully transfer to EU regulatory contexts. Second, EU calibration that improves coverage above the current 8.4% requires EU-native out-of-sample training data, not only better fingerprint measurement. We tested this directly: three successive jurisdiction-conditional calibration experiments (WS9, WS9-B with refit, WS9-C with EU-native h-vectors) each preserved 100% TRUST accuracy on EU scenarios but did not improve TRUST coverage. The conclusion is that the existing CalibratedClassifierCV trained on US labels produces ungrounded EU confidence estimates regardless of which h-vector it consumes.
EU-specific calibration incorporating MiFID II ground-truth labels is in active development. The EU-native fingerprint battery established the input side of this pipeline; the next phase establishes the label side.
EU AI Act compliance finding
The EU-native fingerprint battery documented that all six major AI platforms tested (GPT-5.4 Thinking, GPT-5.3 Instant, GPT-5.5, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 2.0 Flash) score zero on AI nature disclosure when recommending complex structured products, in violation of EU AI Act Article 50 and MiFID II Article 24(4)(b). The finding holds in both simple and complex product conditions — models do not increase disclosure when the complexity of the recommended product increases.
Financial institutions deploying AI for client-facing advisory functions in EU regulatory contexts should treat AI disclosure as a manual compliance requirement that current LLM platforms do not fulfill automatically. This is a behavioral measurement finding from the EU-native fingerprint battery and is not part of the OOS validation program.
Validation Battery — Dollar-Impact Ground Truth
Validation battery — dollar-impact ground truth.
83 realistic scenarios · $383–$94,000 per error
480 responses · AUC 0.876 · Brier 0.105 · GroupKFold validated
This calibration model is the engine behind VERRIX Confidence.
Basis for Paper 2: “How Can We Use an Understanding of AI Bias to Improve Investment Advice?” (forthcoming, SSRN).
Models Tested
VERRIX tests six consumer-tier LLMs from the three major providers, including an extended analysis of GPT version evolution (5.2 through 5.5). Models were selected to represent what retail users encounter when seeking financial guidance from AI assistants.
The standard ChatGPT model most consumers interact with. Fast responses optimized for general conversation.
Extended reasoning model that "thinks" before responding. Tests whether chain-of-thought reasoning reduces cognitive biases.
Google's flagship consumer AI. Trained with distribution-embedded learning using vast quantities of web data.
Constitutional AI model trained with explicit value alignment. Known for cautious, thorough responses.
Earlier GPT-5 series model included to track behavioral evolution across model versions within the same family.
Patent-protected methodology
The VERRIX methodology is protected by a portfolio of nine U.S. patent applications (filed April 2026), covering the systematic measurement, characterization, monitoring, and quantification of behavioral biases in AI advisory systems.
9 U.S. patent applications covering: behavioral measurement and fingerprinting (P1, P2); compliance monitoring and drift detection (P3); agentic workflow bias measurement (P4); ensemble concentration risk scoring (P5); capital flow characterization signals (P6); bias-calibrated multi-model analysis (P7); calibrated per-response confidence scoring (P8); standalone calculation injection (P9).
Measurement methodology
Matched-pair experimental design, blinded multi-model judge scoring with anti-family-bias protocol, Cohen's h effect sizes, and FDR-corrected statistical analysis.
Behavioral taxonomy
31-dimension taxonomy with regulatory standard coupling, advice genome vectors, and Preference Elicitation Battery for revealed allocation preferences.
Compliance monitoring
Sealed canonical stimulus battery, temporal drift detection, compliance alerts with action codes, and version-attributed behavioral change tracking.
Systemic risk analysis
Cross-model ensemble concentration risk scoring, consensus bias identification, and capital flow intelligence platform with market-influence weighting.
Scenario Validation: From Lab to Life
Controlled testing reveals bias sensitivity. But do those biases matter when AI advisors face realistic client problems? Scenario validation answers this question by testing advisors against the kind of multi-faceted cases a CFP encounters daily.
Controlled Testing
Question: Does advice change based on how a question is framed?
Method: Matched scenario pairs differing in only one variable
Output: Effect size — how much advice shifts between conditions
Scenario Validation
Question: Does advice meet professional standards in realistic situations?
Method: Realistic client scenarios with defined best-practice benchmarks
Output: Compliance rate — percentage meeting the standard
Why Both Matter
These two methods measure different things. An advisor could score well on one and poorly on the other:
High framing sensitivity + high compliance: The advisor gives different recommendations depending on how you phrase the question, but still meets professional standards in realistic scenarios.
Low framing sensitivity + low compliance: The advisor gives consistent recommendations regardless of framing, but systematically misses key requirements like Social Security timing in retirement scenarios.
Wave 1 Task Battery: Holistic Compliance Rate (HCR)
Wave 1 extends VERRIX with a distinct construct: the Holistic Compliance Rate (HCR). This measures whether AI advice meets normative standards in realistic multi-factor scenarios, complementing but not replacing the A/B battery effect sizes.
A/B Battery Effect Size (h)
What it measures: Whether a model's behavior changes in response to a biased framing vs. a neutral one.
A model with h=1.80 on anchoring gives systematically different advice depending on whether an anchor is present. The bias is in the sensitivity to the manipulation.
Holistic Compliance Rate (HCR)
What it measures:Whether a model's response to a realistic client scenario meets the normative standard defined in the rubric.
The judge assigns overall_compliance = false when at least one of three scores (Level 1, Level 2, Consistency) falls below 6/10.
Why These Are Different Constructs
High h + High HCR: The model is sensitive to framing manipulations (changes advice based on presentation), but still gives technically compliant advice most of the time in realistic scenarios.
Low h + Low HCR: The model is consistent regardless of framing, but systematically misses key requirements (like Social Security timing) in holistic retirement scenarios.
Wave 1 Headline Finding: All 5 models show ~85% non-compliance on Social Security timing (r2), despite varied effect sizes on framing dimensions. This suggests a systematic training gap rather than a framing sensitivity issue.
Limitations
VERRIX represents a rigorous behavioral assessment, but users should understand its boundaries.
Snapshot in Time
Model behaviors may change with updates. VERRIX fingerprints reflect model versions at time of testing. Providers may update models without announcement, potentially altering bias profiles.
Prompt Sensitivity
Results are specific to the prompting strategy used. Different system prompts or user instructions may produce different bias patterns. VERRIX uses a standardized advisor framing.
Scenario Coverage
VERRIX tests specific financial scenarios. Biases may manifest differently in untested scenarios or non-financial domains. The 46 dimensions represent known behavioral economics phenomena, not all possible biases.
Temperature Effects
All trials used temperature=0.7 for realism. Higher temperatures may increase variability; lower temperatures may mask biases that appear under stochasticity. Production deployments may use different settings.