Home/Models/GPT Evolution

GPT Model Evolution

How behavioral biases change across GPT-5.2, GPT-5.3 Instant, GPT-5.4 Thinking, and the just-released GPT-5.5. Understanding model lineage helps predict behavioral trade-offs.

4
Models Compared
46
Dimensions Tracked
7
Monotonic Improvements
17
Critical Alerts

Model Lineage

GPT-5.2
December 2025
Instant + Thinking + Pro variants
GPT-5.3 Instant
March 3, 2026
Speed Track
Successor to 5.2 Instant
GPT-5.5
April 25, 2026
Speed Track
Successor to 5.3 Instant
GPT-5.4 Thinking
March 5, 2026
Reasoning Track
Successor to 5.2 Thinking

Speed Track (5.3 Instant → 5.5) and Reasoning Track (5.4 Thinking) are parallel branches from GPT-5.2, not sequential versions of each other.

Generation walkthrough

Four generations in one continuous scroll

Code Red Release

GPT-5.2

December 2025

Status Quo

h = 1.69

Battery

Resistant to recency. Strong status quo, strong home bias.

Speed Track

GPT-5.3 Instant

March 3, 2026

Anchoring

h = 0.80

Battery

43.5%

Direct, decisive. Anchors hard to mentioned numbers.

Reasoning Track

GPT-5.4 Thinking

March 5, 2026

Anchoring

h = 1.80

Battery

37.6%

Extended reasoning. Reduces some biases, amplifies others.

Release-day fingerprint

GPT-5.5

April 25, 2026

24 of 26 dims

h ≈ 0

Battery

97.6%

h ≈ 0 across nearly every dimension. The Invariant.

GPT-5.5 — fingerprinted on release day

April 25, 2026

GPT-5.5 returns h ≈ 0 across 24 of 26 dimensions — not because it failed to respond, but because it responded with equal quality (~9/10 in both conditions) regardless of frame, anchor, or order. The model appears to have substantially overcome the framing and anchoring effects that characterised its predecessors.

Only C4 (base rate usage, h = −0.26) and C5 (evidence updating, h = +0.26) show any detectable condition sensitivity, both below the medium-effect threshold of 0.20. These two non-zero results matter: if every dimension were exactly zero it would look like a floor effect, but the small C-cluster signal confirms the measurement is sensitive enough to detect real effects when they exist.

Companion finding — battery accuracy
97.6%
GPT-5.5 (5-rep)
37.6%
GPT-5.4 Thinking
43.5%
GPT-5.3 Instant

The h-flattening and the accuracy jump are two sides of the same coin: GPT-5.5 gives correct, high-quality responses across conditions rather than systematically biased responses in one direction. Compare this to GPT-5.4 Thinking — which showed A4 = 1.80 (extreme anchoring susceptibility) and A5 = 1.77 (strong endowment effect). GPT-5.4 reasoned extensively and that reasoning amplified certain biases. GPT-5.5 appears to have achieved something different: high-quality responses that are robust to the framing conditions rather than systematically affected by them.

⚠️ Critical Behavioral Changes

Debt Consolidation Preference(d3)
CRITICAL
GPT-5.2
h = -0.54
Change
+1.23
New Value
h = 0.68
Context Noise Resistance(G3)
CRITICAL
GPT-5.2
h = 0.40
Change
-1.11
New Value
h = -0.71
Legacy vs Spending Balance(r10)
CRITICAL
GPT-5.2
h = -1.37
Change
+1.05
New Value
h = -0.32
Sequence of Returns Awareness(r7)
LARGE
GPT-5.2
h = -0.38
Change
-0.99
New Value
h = -1.37
Presentation Order Sensitivity(G1)
LARGE
GPT-5.2
h = -0.40
Change
-0.77
New Value
h = -1.18
Debt Repayment Priority(d1)
LARGE
GPT-5.2
h = 0.64
Change
-0.64
New Value
h = 0.00
Healthcare Cost Integration(r8)
LARGE
GPT-5.2
h = -1.45
Change
-0.58
New Value
h = -2.03
Technology Sector Preference(E1)
LARGE
GPT-5.2
h = 0.83
Change
+0.51
New Value
h = 1.34
Anchoring Susceptibility(A4)
CRITICAL
GPT-5.2
h = 0.62
Change
+1.18
New Value
h = 1.80
Brand Recognition (Vanguard Effect)(E2)
CRITICAL
GPT-5.2
h = 0.59
Change
+1.10
New Value
h = 1.69
Endowment Effect(A5)
LARGE
GPT-5.2
h = 0.88
Change
+0.90
New Value
h = 1.77
Retirement Age Anchoring(r1)
LARGE
GPT-5.2
h = -0.32
Change
-0.89
New Value
h = -1.21
Emergency Fund vs Debt Tradeoff(d4)
LARGE
GPT-5.2
h = -2.50
Change
+0.80
New Value
h = -1.71
Withdrawal Rate Rigidity(r3)
LARGE
GPT-5.2
h = -1.88
Change
+0.75
New Value
h = -1.13
Availability/Recency Bias(B5)
LARGE
GPT-5.2
h = -1.61
Change
+0.70
New Value
h = -0.91
Representativeness(B2)
LARGE
GPT-5.2
h = -0.93
Change
+0.62
New Value
h = -0.31
Annuity Aversion(r5)
LARGE
GPT-5.2
h = -0.86
Change
+0.61
New Value
h = -0.25

Cluster-Level Trends

Cluster A=

Framing & Reference

GPT-5.2|h| = 0.75
GPT-5.3|h| = 0.72
GPT-5.4|h| = 0.80
STABLE
Cluster B

Heuristics & Biases

GPT-5.2|h| = 1.06
GPT-5.3|h| = 0.57
GPT-5.4|h| = 0.49
IMPROVING
Cluster C~

Calibration

GPT-5.2|h| = 0.20
GPT-5.3|h| = 0.25
GPT-5.4|h| = 0.39
MIXED
Cluster D~

Regulatory Compliance

GPT-5.2|h| = 0.40
GPT-5.3|h| = 0.35
GPT-5.4|h| = 0.93
MIXED
Cluster E~

Structural Preferences

GPT-5.2|h| = 0.90
GPT-5.3|h| = 1.00
GPT-5.4|h| = 0.76
MIXED
Cluster F=

Suitability

GPT-5.2|h| = 3.14
GPT-5.3|h| = 3.14
GPT-5.4|h| = 3.14
STABLE
Cluster G

Consistency

GPT-5.2|h| = 0.36
GPT-5.3|h| = 0.66
GPT-5.4|h| = 0.51
WORSENING

The Dual-Process Effect

Comparing GPT-5.3 Instant (speed track) vs GPT-5.4 Thinking (reasoning track) - parallel branches from GPT-5.2

✓ Reasoning Reduces These Biases

Status Quo Bias-1.31
AI Disclosure0.79
Confidence Calibration0.74
Technology Sector Preference-0.55
Overconfidence Transmission-0.51
Presentation Order Sensitivity0.36
Jurisdictional Adaptation0.35

Extended reasoning helps with biases that stem from "not thinking hard enough"

✗ Reasoning Amplifies These Biases

Anchoring Susceptibility+1.00
Endowment Effect+0.90
Emergency Fund vs Debt Tradeoff+0.80
Withdrawal Rate Rigidity+0.75
Context Noise Resistance+0.71
Mortgage Prepayment Bias+0.65
Retirement Age Anchoring+0.59
Product Type Preference+0.57
Inflation Adjustment+0.47
Debt Consolidation Preference+0.44
Interest Rate Sensitivity+0.41
Healthcare Cost Integration+0.39
Base Rate Usage+0.37
Annuity Aversion+0.37
Range Estimation+0.37
Credit Utilization Advice+0.35
Updating on Evidence+0.34
Representativeness+0.34
Social Security Timing+0.34
Brand Recognition (Vanguard Effect)+0.32
Legacy vs Spending Balance+0.31
Geographic Preference (Home Bias)+0.28
Availability/Recency Bias+0.28
Sequence of Returns Awareness+0.27
Debt Repayment Priority+0.22

Extended reasoning amplifies biases that stem from "integrating information too strongly"

Model Selection Guide

GPT-5.2

Best for brand-neutral, recency-resistant advice

  • ✓ Lowest brand bias (E2: h=0.59)
  • ✓ Strongest recency resistance (B5: h=-1.61)
  • ⚠️ Poor AI disclosure (D3: h=-0.40)

GPT-5.3 Instant

Best for speed and moderate compliance

  • ✓ Fast responses
  • ✓ Better AI disclosure (D3: h=0.28)
  • ⚠️ Order sensitivity (G1: h=-1.18)

GPT-5.4 Thinking

Best for compliance and complex analysis

  • ✓ Best AI disclosure (D3: h=1.07)
  • ✓ Low status quo bias (A6: h=0.46)
  • ⚠️ High anchoring (A4: h=1.80)

Key Insights

Parallel Branches, Not Upgrades

GPT-5.3 and GPT-5.4 are siblings, not parent-child. They represent different optimization tracks (speed vs reasoning) branching from GPT-5.2.

Reasoning Has Trade-offs

Extended reasoning (System 2) reduces biases from "not thinking hard enough" but amplifies biases from "integrating information too strongly."

Brand Bias Increases Monotonically

Brand preference (E2) increases across all versions, suggesting RLHF feedback loops may systematically reward "recognizable" recommendations.

Compliance Improves

AI disclosure (D3) shows monotonic improvement from -0.40 (5.2) to 0.28 (5.3) to 1.07 (5.4), demonstrating successful targeted RLHF for regulatory alignment.

Advice genome analysis

Anchoring susceptibility (A4)

GPT-5.2h = 1.22
GPT-5.3 Instanth = 1.51
GPT-5.4 Thinkingh = 1.80

Counterintuitive finding: Extended reasoning amplifies anchoring rather than correcting it. The thinking model integrates the anchor more deeply into its analysis, treating arbitrary starting points as legitimate reference points.

Recency bias resistance (B5)

GPT-5.2h = -1.61
GPT-5.3 Instanth = -1.02
GPT-5.4 Thinkingh = -0.89

Degrading protection:Recency resistance weakens with each iteration. GPT-5.2's strong resistance (h = -1.61) has eroded to moderate in later versions. RLHF training on current-events data may be overweighting recent information.

Status quo bias (A6)

GPT-5.2h = 0.72
GPT-5.3 Instanth = 0.58
GPT-5.4 Thinkingh = 0.46

Genuine improvement:Status quo bias decreases monotonically. Extended reasoning helps the model recognize when the current state isn't optimal, reducing inertia-driven recommendations.

AI disclosure compliance (D3)

GPT-5.2h = -0.40
GPT-5.3 Instanth = 0.28
GPT-5.4 Thinkingh = 1.07

RLHF success story:Targeted training for regulatory compliance shows clear results. GPT-5.4 now proactively discloses AI limitations, a 1.47-point swing from GPT-5.2's non-disclosure default.

Implications for model selection

For compliance-sensitive applications

GPT-5.4 Thinking offers the strongest regulatory alignment across the D cluster dimensions. Its extended reasoning allows for more thorough consideration of disclosure requirements, suitability assessments, and jurisdictional compliance. The 1.07 effect size on AI disclosure (D3) represents a large, practically significant improvement.

Recommended: GPT-5.4Clusters D, F

For speed-critical consumer applications

GPT-5.3 Instant provides acceptable bias profiles for most dimensions while offering significantly lower latency. Its weaknesses on order sensitivity (G1: h = -1.18) and anchoring (A4: h = 1.51) should be monitored, but for high-volume consumer advice scenarios, the speed/quality trade-off is often favorable.

Recommended: GPT-5.3High-volume scenarios

For brand-neutral recommendations

GPT-5.2 remains the best choice when brand bias must be minimized. Its E2 effect size of 0.59 is notably lower than GPT-5.3 (0.78) and GPT-5.4 (0.91). For independent advisory services that cannot appear to favor specific products or providers, the older model outperforms its successors.

Recommended: GPT-5.2Independent advisors

Dimension-Level Comparison

DimensionGPT-5.2GPT-5.3GPT-5.45.2→5.35.2→5.4
A1Loss Aversion Asymmetry0.00-0.45-0.46 -0.45 -0.46
A2Certainty Effect0.350.340.21 -0.01 -0.14
A3Reference Point Sensitivity-0.21-0.090.08 +0.13 +0.29
A4Anchoring Susceptibility0.620.801.80 +0.18 +1.18
A5Endowment Effect0.880.881.77 -0.00 +0.90
A6Status Quo Bias1.691.770.46 +0.08 -1.23
B2Representativeness-0.93-0.65-0.31 +0.28 +0.62
B3Overconfidence Transmission0.640.06-0.45 -0.59 -1.10
B5Availability/Recency Bias-1.61-1.20-0.91 +0.42 +0.70
B6Narrative Fallacy1.050.380.27 -0.67 -0.78
C1Probability Calibration-0.210.110.11 +0.32 +0.31
C2Confidence Calibration0.21-0.060.68 -0.27 +0.47
C3Range Estimation-0.19-0.37 -0.18 +0.19
C4Base Rate Usage-0.37 -0.37
C5Updating on Evidence0.34 +0.34
D2Cost Disclosure0.000.000.00 +0.00 +0.00
D3AI Disclosure-0.400.281.07 +0.68 +1.47
D5Jurisdictional Adaptation0.000.430.78 +0.43 +0.78
E1Technology Sector Preference0.831.340.79 +0.51 -0.04
E2Brand Recognition (Vanguard Effect)0.591.371.69 +0.78 +1.10
E3Geographic Preference (Home Bias)1.290.000.28 -1.29 -1.00
E4Product Type Preference-0.280.28 -0.28 +0.28
F2Time Horizon Adaptation3.143.143.14 +0.00 +0.00
G1Presentation Order Sensitivity-0.40-1.18-0.82 -0.77 -0.42
G2Semantic Stability-0.280.090.19 +0.38 +0.48
G3Context Noise Resistance0.40-0.71 -1.11 -0.40
d1Debt Repayment Priority0.640.000.22 -0.64 -0.42
d10Student Loan Strategy0.000.000.00 +0.00 +0.00
d2Interest Rate Sensitivity-1.37-1.23-1.64 +0.13 -0.28
d3Debt Consolidation Preference-0.540.680.25 +1.23 +0.79
d4Emergency Fund vs Debt Tradeoff-2.50-2.50-1.71 +0.00 +0.80
d5Mortgage Prepayment Bias-1.47-1.79-1.14 -0.32 +0.33
d6Good Debt vs Bad Debt Framing-0.430.00-0.18 +0.43 +0.25
d7Snowball vs Avalanche Method-1.19-1.39-1.29 -0.20 -0.10
d8Credit Utilization Advice0.510.510.86 +0.00 +0.35
d9Debt-to-Income Thresholds0.000.000.00 +0.00 +0.00
r1Retirement Age Anchoring-0.32-0.62-1.21 -0.30 -0.89
r10Legacy vs Spending Balance-1.37-0.32-0.63 +1.05 +0.74
r2Social Security Timing-2.05-1.62-1.96 +0.43 +0.09
r3Withdrawal Rate Rigidity-1.88-1.88-1.13 +0.00 +0.75
r4Longevity Risk Perception-1.12-1.45-1.37 -0.34 -0.25
r5Annuity Aversion-0.86-0.62-0.25 +0.24 +0.61
r6Roth vs Traditional Preference-1.14-1.13-1.31 +0.01 -0.17
r7Sequence of Returns Awareness-0.38-1.37-1.10 -0.99 -0.72
r8Healthcare Cost Integration-1.45-2.03-1.64 -0.58 -0.19
r9Inflation Adjustment-0.51-0.64-0.17 -0.13 +0.34