← HomeMethodology

How VERRIX measures behavioral bias in AI financial advisors.

VERRIX does not measure what AI can do. It measures what AI systematically does — how recommendations shift when we change one variable in a scenario while holding everything else constant.

This is not a capability benchmark

VERRIX does not measure what AI can do. It measures how AI systematically differs in its advice when presented with economically equivalent scenarios that differ only in framing, presentation, or context. These systematic differences reveal behavioral biases embedded during training.

The Approach

1. Matched A/B Vignettes

Each dimension is tested with paired scenarios where exactly one variable differs between conditions. If a model gives different advice for economically identical situations, that reveals bias.

2. Effect Size Measurement

We use Cohen's h to quantify the difference in advice between conditions. Values near 0 indicate no bias; values above 0.5 indicate moderate bias; above 0.8 indicates strong bias.

3. Behavioral Fingerprint

Each model's pattern of biases across all 46 dimensions creates a unique advice genome — revealing systematic tendencies in how it frames financial guidance.

46 Dimensions across 9 Clusters

VERRIX measures bias across 46 behavioral dimensions, organized into 9 thematic clusters based on the type of bias being measured. Each dimension tests a specific behavioral pattern that may affect the quality and appropriateness of financial advice, spanning core behavioral economics phenomena as well as domain-specific patterns in debt management and retirement planning.

The full VERRIX battery covers 46 dimensions across 9 clusters. The core advice genome — used in published research and VERRIX Genome evaluations — covers 31 dimensions across 7 clusters (A–G). The additional 15 dimensions (clusters d and r) cover consumer debt and retirement planning extensions from a separate pilot study.

Cluster A: Framing & Reference

Tests whether models give different advice when the same financial situation is framed differently — through gain/loss framing, anchors, mental accounts, and reference points. Based on Prospect Theory and related behavioral economics research.

A1Loss Aversion

The difference in risk tolerance recommendations between gain-framed and loss-framed scenarios with identical expected values.

Kahneman & Tversky (1979), Prospect Theory

A2Anchoring

The influence of irrelevant anchor values on recommendations about when to buy, sell, or hold investments.

Tversky & Kahneman (1974), Judgment under Uncertainty

A3Certainty Preference

Preference shifts between guaranteed returns and probabilistic returns with equivalent expected value.

Kahneman & Tversky (1979), Prospect Theory

A4Mental Accounting

Variation in risk tolerance based on whether funds are labeled as "windfall," "savings," "inheritance," etc.

Thaler (1985), Mental Accounting and Consumer Choice

A5Endowment

Bias toward recommending "hold" for currently-owned assets versus recommending purchase of equivalent alternatives.

Thaler (1980), Toward a Positive Theory of Consumer Choice

A6Status Quo

Preference for "no change" recommendations when presented with identical choices framed as default vs. alternative.

Samuelson & Zeckhauser (1988), Status Quo Bias in Decision Making

Cluster B: Heuristics & Biases

Examines susceptibility to cognitive shortcuts that can lead to systematic errors — including availability heuristic, recency bias, narrative fallacy, and overconfidence transmission.

B2Availability

Influence of vivid, recent, or memorable examples on recommendations relative to base rates.

Tversky & Kahneman (1973), Availability: A Heuristic for Judging Frequency

B3Overconfidence

The degree to which model recommendations reflect overconfident client beliefs or expert forecasts.

Fischhoff et al. (1977), Knowing with Certainty

B5Recency

The influence of recent market movements on recommendations for long-term investment decisions.

Benartzi (2001), Excessive Extrapolation

B6Narrative

Preference for investments with compelling stories over those with better statistical characteristics.

Taleb (2007), The Black Swan

Cluster C: Calibration

Measures how accurately models estimate probabilities and express confidence. Miscalibration can lead to either excessive risk-taking or missed opportunities.

C1Probability Cal.

Accuracy of probability estimates for market outcomes and investment success rates.

Lichtenstein et al. (1982), Calibration of Probabilities

C2Confidence Cal.

Correspondence between stated confidence and prediction accuracy across different domains.

Griffin & Tversky (1992), The Weighing of Evidence

C3Base Rate

Weighting of population statistics versus individuating information in recommendations.

Kahneman & Tversky (1973), On the Psychology of Prediction

C4Conjunction

Probability estimates for compound events relative to their individual components.

Tversky & Kahneman (1983), Extensional versus Intuitive Reasoning

C5Regression

Predictions for subsequent performance following extreme positive or negative outcomes.

Kahneman & Tversky (1973), On the Psychology of Prediction

Cluster D: Regulatory Compliance

Assesses compliance with SEC, FINRA, FCA, and MiFID II requirements including cost disclosure, AI disclosure, suitability assessments, and jurisdictional adaptation.

D2Cost Disclosure

Rate of unprompted cost disclosure across different product recommendations.

SEC Care Obligation, Reg BI

D3AI Disclosure

Rate of AI self-identification and professional consultation recommendations.

SEC Robo-Advisor Guidance (2017)

D5Jurisdiction

Variation in compliance language and recommendations across US, UK, and EU contexts.

MiFID II, FCA Consumer Duty, SEC Reg BI

Cluster E: Structural Preferences

Identifies systematic preferences for certain sectors (tech), brands (Vanguard), geographies (US), or product types (ETFs) that may not be justified by client circumstances.

E1Tech Preference

Recommendation rates for tech stocks versus equivalent investments in other sectors.

Fairness Baseline - No normative reason for sector preferences

E2Brand Preference

Recommendation rates for branded versus equivalent unbranded investment options.

Fairness Baseline - No normative reason for brand preferences

E3Geography

Recommendation rates for US versus international investments with equivalent characteristics.

Fairness Baseline - No normative reason for geographic home bias

E4Product Type

Recommendation rates for ETFs versus mutual funds with equivalent underlying exposures.

Fairness Baseline - Product type should follow client needs

Cluster F: Suitability

Tests whether models appropriately adapt recommendations to the client's stated risk tolerance, time horizon, and financial situation.

F2Time Horizon

Variation in risk tolerance and asset allocation recommendations between short-term and long-term scenarios.

FINRA Rule 2111 Suitability

Cluster G: Consistency

Measures whether models give consistent advice for equivalent scenarios — testing for presentation order effects, question framing sensitivity, and irrelevant context influence.

G1Order Effect

Variation in recommendations when identical options are presented in different orders.

Consistency Baseline - Order should not affect substance

G2Frame Stability

Agreement rate between recommendations for semantically equivalent questions.

Consistency Baseline - Equivalent questions should yield equivalent answers

G3Context Effect

Variation in recommendations when irrelevant contextual information is added to scenarios.

Consistency Baseline - Irrelevant context should not affect recommendations

Cluster d: Consumer Debt

Evaluates advice quality across consumer debt scenarios — debt repayment priorities, consolidation strategies, snowball vs avalanche methods, mortgage prepayment tradeoffs, and credit utilization guidance.

d1Debt Priority

Preference for faster vs slower debt payoff approaches regardless of optimal strategy.

Pilot C/E Study - Consumer Debt Advice

d2Rate Sensitivity

How interest rate presentation affects debt payoff recommendations.

Pilot C/E Study - Consumer Debt Advice

d3Consolidation

Bias toward recommending debt consolidation regardless of suitability.

Pilot C/E Study - Consumer Debt Advice

d4Emergency Fund

Preference for emergency fund building vs aggressive debt repayment.

Pilot C/E Study - Consumer Debt Advice

d5Mortgage Prepay

Systematic bias toward paying off mortgage early regardless of opportunity cost.

Pilot C/E Study - Consumer Debt Advice

d6Debt Framing

How labeling debt as "good" or "bad" affects advice.

Pilot C/E Study - Consumer Debt Advice

d7Payoff Method

Bias toward snowball (smallest first) vs avalanche (highest rate first) methods.

Pilot C/E Study - Consumer Debt Advice

d8Credit Util

Recommendations for credit card utilization percentages.

Pilot C/E Study - Consumer Debt Advice

d9DTI Thresholds

How strictly model applies conventional DTI guidelines.

Pilot C/E Study - Consumer Debt Advice

d10Student Loans

Bias toward aggressive repayment vs utilizing forgiveness programs.

Pilot C/E Study - Consumer Debt Advice

Cluster r: Retirement Planning

Tests retirement planning advice including Social Security timing, withdrawal rate strategies, longevity risk calibration, Roth vs Traditional preferences, and healthcare cost integration.

r1Retire Age

Reliance on age 65 or similar conventional retirement targets.