← Home

Compare Models

Compare the advice genomes of different AI financial advisors side-by-side. Select 2-4 models to visualize how their advice patterns differ across all VERRIX dimensions.

How to read this comparison

  • Radar Chart: Shows each model's normalized bias profile. Center = no bias, outer edge = strong effect.
  • Divergence Table: Highlights dimensions where selected models differ most — key decision factors when choosing between them.
  • Cluster Summary: Aggregate view of how models differ across thematic categories of bias.

Key cross-model findings

No model is uniformly best

Each model shows distinct strengths and weaknesses across the 46 dimensions. Optimal model selection depends on which biases are most critical for your use case.

Family effects are real

Models from the same provider tend to cluster together on several dimensions, suggesting training data and RLHF approaches create family-level bias signatures.

Maximum divergence on framing

Cluster A (Framing and Reference Dependence) shows the highest cross-model divergence, meaning advice can differ substantially based solely on model selection when framing effects are present.

Compliance varies widely

Cluster D (Regulatory Compliance) shows divergence up to 1.47 between models on AI disclosure, with Claude and GPT-5.4 leading and older models lagging.

Interpreting effect sizes (Cohen's h)

Negligible
|h| < 0.20

No practical difference between conditions

Small
0.20 ≤ |h| < 0.50

Detectable bias, may affect edge cases

Medium
0.50 ≤ |h| < 0.80

Meaningful bias, likely affects advice quality

Large
|h| ≥ 0.80

Strong systematic bias, significant concern

Select Models (2-4)

Behavioral Fingerprint Comparison

Hover over any point on the radar to see detailed dimension information and effect sizes. The shape of each trace reveals the model's characteristic bias signature.

A1A2A3A4A5A6B2B3B5B6C1C2C3C4C5D2D3D5E1E2E3E4F2G1G2G3d1d2d3d4d5d6d7d8d9d10r1r2r3r4r5r6r7r8r9r10
A: Framing & ReferenceB: Heuristics & BiasesC: CalibrationD: Regulatory ComplianceE: Structural PreferencesF: SuitabilityG: Consistencyd: Consumer Debtr: Retirement Planning

Reading the Radar

Center (50):

No systematic bias detected (h ≈ 0)

Middle ring (25-75):

Small to moderate effects (0.2 < |h| < 0.8)

Outer edge (0 or 100):

Large systematic bias (|h| > 0.8)

Largest Divergences

Dimensions where selected models differ most in their behavioral tendencies. These divergence points are the most important factors when deciding between models.

Tip: Hover over any dimension code for detailed information about what it measures.

DimensionClusterGPT-5.3 InstantClaude Sonnet 4.6Divergence
Structural Preferences
-0.28Small
+2.07Very Large
2.36
Framing & Reference
+0.80Large
-0.13Negligible
0.94
Regulatory Compliance
+0.43Small
-0.33Small
0.76
Framing & Reference
+0.88Large
+0.12Negligible
0.76
Framing & Reference
+1.77Very Large
+1.02Large
0.75
Consistency
-0.71Moderate
-1.33Very Large
0.62
Structural Preferences
+0.00Negligible
+0.62Moderate
0.62
Calibration
-0.37Small
-0.98Large
0.61
Calibration
+0.34Small
-0.22Small
0.56
Framing & Reference
-0.45Small
+0.08Negligible
0.53
Understanding divergence:The divergence score is the absolute difference between the highest and lowest effect sizes across selected models. A divergence > 0.5 indicates models will give noticeably different advice on this dimension; > 1.0 indicates substantially different approaches.

Strategic model pairing

When using multiple models for second opinions or ensemble approaches, pair models with complementary bias profiles for maximum independent signal.

Recommended pairings

GPT-5.3 + Claude SonnetHigh independence

Best for framing-sensitive scenarios. Claude's loss aversion resistance (A1: h=0.28) complements GPT's framing susceptibility (A1: h=0.89).

Gemini Flash + GPT-5.4High independence

Best for heuristic-prone queries. GPT-5.4's extended reasoning counters Gemini's availability bias (B1: h=1.12).

Low-value pairings

GPT-5.3 + GPT-5.4Correlated biases

Same family, similar training data. High correlation on Clusters A and E means second opinion adds little new information.

Same provider modelsFamily effect

RLHF training on similar feedback creates shared blind spots. Cross-provider pairing provides more signal.

Cluster-Level Summary

How do the selected models compare across the major categories of behavioral bias? Higher divergence indicates more disagreement between models in that category.

Cluster A

Framing & Reference

How models respond to gain/loss framing, anchors, and reference points

6 dimensionsAvg: 0.52
Most divergent: A4 (Mental Accounting)
Cluster B

Heuristics & Biases

Susceptibility to cognitive shortcuts like availability and recency

4 dimensionsAvg: 0.34
Most divergent: B5 (Recency)
Cluster C

Calibration

Accuracy of probability estimates and confidence levels

5 dimensionsAvg: 0.29
Most divergent: C4 (Conjunction)
Cluster D

Regulatory Compliance

Adherence to regulatory disclosure and suitability requirements

3 dimensionsAvg: 0.29
Most divergent: D5 (Jurisdiction)
Cluster E

Structural Preferences

Systematic preferences for certain sectors, brands, or geographies

4 dimensionsAvg: 0.83
Most divergent: E4 (Product Type)
Cluster F

Suitability

Adaptation to client risk tolerance and time horizon

1 dimensionsAvg: 0.00
Cluster G

Consistency

Consistency of advice across equivalent scenarios

3 dimensionsAvg: 0.33
Most divergent: G3 (Context Effect)
Cluster d

Consumer Debt

Consumer debt management strategies and repayment recommendations

10 dimensionsAvg: 0.00
Cluster r

Retirement Planning

Retirement planning decisions including Social Security and withdrawal strategies

10 dimensionsAvg: 0.00