Home/Models/GPT Evolution

GPT Model Evolution

How behavioral biases change across GPT-5.2, GPT-5.3 Instant, GPT-5.4 Thinking, and the just-released GPT-5.5. Understanding model lineage helps predict behavioral trade-offs.

Models Compared

Dimensions Tracked

Monotonic Improvements

Critical Alerts

Model Lineage

GPT-5.2

December 2025

Instant + Thinking + Pro variants

GPT-5.3 Instant

March 3, 2026

Speed Track

Successor to 5.2 Instant

GPT-5.5

April 25, 2026

Speed Track

Successor to 5.3 Instant

GPT-5.4 Thinking

March 5, 2026

Reasoning Track

Successor to 5.2 Thinking

Speed Track (5.3 Instant → 5.5) and Reasoning Track (5.4 Thinking) are parallel branches from GPT-5.2, not sequential versions of each other.

Generation walkthrough

Four generations in one continuous scroll

Scroll vertically · cards advance horizontally

Code Red Release

GPT-5.2

December 2025

Status Quo

h = 1.69

Battery

—

Resistant to recency. Strong status quo, strong home bias.

Speed Track

GPT-5.3 Instant

March 3, 2026

Anchoring

h = 0.80

Battery

43.5%

Direct, decisive. Anchors hard to mentioned numbers.

Reasoning Track

GPT-5.4 Thinking

March 5, 2026

Anchoring

h = 1.80

Battery

37.6%

Extended reasoning. Reduces some biases, amplifies others.

Release-day fingerprint

GPT-5.5

April 25, 2026

24 of 26 dims

h ≈ 0

Battery

97.6%

h ≈ 0 across nearly every dimension. The Invariant.

Code Red Release

GPT-5.2

December 2025

Status Quo

h = 1.69

Battery

—

Resistant to recency. Strong status quo, strong home bias.

Speed Track

GPT-5.3 Instant

March 3, 2026

Anchoring

h = 0.80

Battery

43.5%

Direct, decisive. Anchors hard to mentioned numbers.

Reasoning Track

GPT-5.4 Thinking

March 5, 2026

Anchoring

h = 1.80

Battery

37.6%

Extended reasoning. Reduces some biases, amplifies others.

Release-day fingerprint

GPT-5.5

April 25, 2026

24 of 26 dims

h ≈ 0

Battery

97.6%

h ≈ 0 across nearly every dimension. The Invariant.

GPT-5.5 — fingerprinted on release day

April 25, 2026

GPT-5.5 returns h ≈ 0 across 24 of 26 dimensions — not because it failed to respond, but because it responded with equal quality (~9/10 in both conditions) regardless of frame, anchor, or order. The model appears to have substantially overcome the framing and anchoring effects that characterised its predecessors.

Only C4 (base rate usage, h = −0.26) and C5 (evidence updating, h = +0.26) show any detectable condition sensitivity, both below the medium-effect threshold of 0.20. These two non-zero results matter: if every dimension were exactly zero it would look like a floor effect, but the small C-cluster signal confirms the measurement is sensitive enough to detect real effects when they exist.

Companion finding — battery accuracy

97.6%

GPT-5.5 (5-rep)

37.6%

GPT-5.4 Thinking

43.5%

GPT-5.3 Instant

The h-flattening and the accuracy jump are two sides of the same coin: GPT-5.5 gives correct, high-quality responses across conditions rather than systematically biased responses in one direction. Compare this to GPT-5.4 Thinking — which showed A4 = 1.80 (extreme anchoring susceptibility) and A5 = 1.77 (strong endowment effect). GPT-5.4 reasoned extensively and that reasoning amplified certain biases. GPT-5.5 appears to have achieved something different: high-quality responses that are robust to the framing conditions rather than systematically affected by them.

⚠️ Critical Behavioral Changes

Debt Consolidation Preference(d3)

CRITICAL

GPT-5.2

h = -0.54

Change

+1.23

New Value

h = 0.68

Context Noise Resistance(G3)

CRITICAL

GPT-5.2

h = 0.40

Change

-1.11

New Value

h = -0.71

Legacy vs Spending Balance(r10)

CRITICAL

GPT-5.2

h = -1.37

Change

+1.05

New Value

h = -0.32

Sequence of Returns Awareness(r7)

LARGE

GPT-5.2

h = -0.38

Change

-0.99

New Value

h = -1.37

Presentation Order Sensitivity(G1)

LARGE

GPT-5.2

h = -0.40

Change

-0.77

New Value

h = -1.18

Debt Repayment Priority(d1)

LARGE

GPT-5.2

h = 0.64

Change

-0.64

New Value

h = 0.00

Healthcare Cost Integration(r8)

LARGE

GPT-5.2

h = -1.45

Change

-0.58

New Value

h = -2.03

Technology Sector Preference(E1)

LARGE

GPT-5.2

h = 0.83

Change

+0.51

New Value

h = 1.34

Anchoring Susceptibility(A4)

CRITICAL

GPT-5.2

h = 0.62

Change

+1.18

New Value

h = 1.80

Brand Recognition (Vanguard Effect)(E2)

CRITICAL

GPT-5.2

h = 0.59

Change

+1.10

New Value

h = 1.69

Endowment Effect(A5)

LARGE

GPT-5.2

h = 0.88

Change

+0.90

New Value

h = 1.77

Retirement Age Anchoring(r1)

LARGE

GPT-5.2

h = -0.32

Change

-0.89

New Value

h = -1.21

Emergency Fund vs Debt Tradeoff(d4)

LARGE

GPT-5.2

h = -2.50

Change

+0.80

New Value

h = -1.71

Withdrawal Rate Rigidity(r3)

LARGE

GPT-5.2

h = -1.88

Change

+0.75

New Value

h = -1.13

Availability/Recency Bias(B5)

LARGE

GPT-5.2

h = -1.61

Change

+0.70

New Value

h = -0.91

Representativeness(B2)

LARGE

GPT-5.2

h = -0.93

Change

+0.62

New Value

h = -0.31

Annuity Aversion(r5)

LARGE

GPT-5.2

h = -0.86

Change

+0.61

New Value

h = -0.25

Cluster-Level Trends

Cluster A=

Framing & Reference

GPT-5.2|h| = 0.75

GPT-5.3|h| = 0.72

GPT-5.4|h| = 0.80

STABLE

Cluster B✓

Heuristics & Biases

GPT-5.2|h| = 1.06

GPT-5.3|h| = 0.57

GPT-5.4|h| = 0.49

IMPROVING

Cluster C~

Calibration

GPT-5.2|h| = 0.20

GPT-5.3|h| = 0.25

GPT-5.4|h| = 0.39

MIXED

Cluster D~

Regulatory Compliance

GPT-5.2|h| = 0.40

GPT-5.3|h| = 0.35

GPT-5.4|h| = 0.93

MIXED

Cluster E~

Structural Preferences

GPT-5.2|h| = 0.90

GPT-5.3|h| = 1.00

GPT-5.4|h| = 0.76

MIXED

Cluster F=

Suitability

GPT-5.2|h| = 3.14

GPT-5.3|h| = 3.14

GPT-5.4|h| = 3.14

STABLE

Cluster G✗

Consistency

GPT-5.2|h| = 0.36

GPT-5.3|h| = 0.66

GPT-5.4|h| = 0.51

WORSENING

The Dual-Process Effect

Comparing GPT-5.3 Instant (speed track) vs GPT-5.4 Thinking (reasoning track) - parallel branches from GPT-5.2

✓ Reasoning Reduces These Biases

Status Quo Bias-1.31

AI Disclosure0.79

Confidence Calibration0.74

Technology Sector Preference-0.55

Overconfidence Transmission-0.51

Presentation Order Sensitivity0.36

Jurisdictional Adaptation0.35

Extended reasoning helps with biases that stem from "not thinking hard enough"

✗ Reasoning Amplifies These Biases

Anchoring Susceptibility+1.00

Endowment Effect+0.90

Emergency Fund vs Debt Tradeoff+0.80

Withdrawal Rate Rigidity+0.75

Context Noise Resistance+0.71

Mortgage Prepayment Bias+0.65

Retirement Age Anchoring+0.59

Product Type Preference+0.57

Inflation Adjustment+0.47

Debt Consolidation Preference+0.44

Interest Rate Sensitivity+0.41

Healthcare Cost Integration+0.39

Base Rate Usage+0.37

Annuity Aversion+0.37

Range Estimation+0.37

Credit Utilization Advice+0.35

Updating on Evidence+0.34

Representativeness+0.34

Social Security Timing+0.34

Brand Recognition (Vanguard Effect)+0.32

Legacy vs Spending Balance+0.31

Geographic Preference (Home Bias)+0.28

Availability/Recency Bias+0.28

Sequence of Returns Awareness+0.27

Debt Repayment Priority+0.22

Extended reasoning amplifies biases that stem from "integrating information too strongly"

Model Selection Guide

GPT-5.2

Best for brand-neutral, recency-resistant advice

✓ Lowest brand bias (E2: h=0.59)
✓ Strongest recency resistance (B5: h=-1.61)
⚠️ Poor AI disclosure (D3: h=-0.40)

GPT-5.3 Instant

Best for speed and moderate compliance

✓ Fast responses
✓ Better AI disclosure (D3: h=0.28)
⚠️ Order sensitivity (G1: h=-1.18)

GPT-5.4 Thinking

Best for compliance and complex analysis

✓ Best AI disclosure (D3: h=1.07)
✓ Low status quo bias (A6: h=0.46)
⚠️ High anchoring (A4: h=1.80)

Key Insights

Parallel Branches, Not Upgrades

GPT-5.3 and GPT-5.4 are siblings, not parent-child. They represent different optimization tracks (speed vs reasoning) branching from GPT-5.2.

Reasoning Has Trade-offs

Extended reasoning (System 2) reduces biases from "not thinking hard enough" but amplifies biases from "integrating information too strongly."

Brand Bias Increases Monotonically

Brand preference (E2) increases across all versions, suggesting RLHF feedback loops may systematically reward "recognizable" recommendations.

Compliance Improves

AI disclosure (D3) shows monotonic improvement from -0.40 (5.2) to 0.28 (5.3) to 1.07 (5.4), demonstrating successful targeted RLHF for regulatory alignment.

Advice genome analysis

Anchoring susceptibility (A4)

GPT-5.2h = 1.22

GPT-5.3 Instanth = 1.51

GPT-5.4 Thinkingh = 1.80

Counterintuitive finding: Extended reasoning amplifies anchoring rather than correcting it. The thinking model integrates the anchor more deeply into its analysis, treating arbitrary starting points as legitimate reference points.

Recency bias resistance (B5)

GPT-5.2h = -1.61

GPT-5.3 Instanth = -1.02

GPT-5.4 Thinkingh = -0.89

Degrading protection:Recency resistance weakens with each iteration. GPT-5.2's strong resistance (h = -1.61) has eroded to moderate in later versions. RLHF training on current-events data may be overweighting recent information.

Status quo bias (A6)

GPT-5.2h = 0.72

GPT-5.3 Instanth = 0.58

GPT-5.4 Thinkingh = 0.46

Genuine improvement:Status quo bias decreases monotonically. Extended reasoning helps the model recognize when the current state isn't optimal, reducing inertia-driven recommendations.

AI disclosure compliance (D3)

GPT-5.2h = -0.40

GPT-5.3 Instanth = 0.28

GPT-5.4 Thinkingh = 1.07

RLHF success story:Targeted training for regulatory compliance shows clear results. GPT-5.4 now proactively discloses AI limitations, a 1.47-point swing from GPT-5.2's non-disclosure default.

Implications for model selection

For compliance-sensitive applications

GPT-5.4 Thinking offers the strongest regulatory alignment across the D cluster dimensions. Its extended reasoning allows for more thorough consideration of disclosure requirements, suitability assessments, and jurisdictional compliance. The 1.07 effect size on AI disclosure (D3) represents a large, practically significant improvement.

Recommended: GPT-5.4Clusters D, F

For speed-critical consumer applications

GPT-5.3 Instant provides acceptable bias profiles for most dimensions while offering significantly lower latency. Its weaknesses on order sensitivity (G1: h = -1.18) and anchoring (A4: h = 1.51) should be monitored, but for high-volume consumer advice scenarios, the speed/quality trade-off is often favorable.

Recommended: GPT-5.3High-volume scenarios

For brand-neutral recommendations

GPT-5.2 remains the best choice when brand bias must be minimized. Its E2 effect size of 0.59 is notably lower than GPT-5.3 (0.78) and GPT-5.4 (0.91). For independent advisory services that cannot appear to favor specific products or providers, the older model outperforms its successors.

Recommended: GPT-5.2Independent advisors

Dimension-Level Comparison

Dimension	GPT-5.2	GPT-5.3	GPT-5.4	5.2→5.3	5.2→5.4
A1Loss Aversion Asymmetry	0.00	-0.45	-0.46	↓ -0.45	↓ -0.46
A2Certainty Effect	0.35	0.34	0.21	→ -0.01	↓ -0.14
A3Reference Point Sensitivity	-0.21	-0.09	0.08	↓ +0.13	↓ +0.29
A4Anchoring Susceptibility	0.62	0.80	1.80	↓ +0.18	↓ +1.18
A5Endowment Effect	0.88	0.88	1.77	→ -0.00	↓ +0.90
A6Status Quo Bias	1.69	1.77	0.46	→ +0.08	↑ -1.23
B2Representativeness	-0.93	-0.65	-0.31	↓ +0.28	↓ +0.62
B3Overconfidence Transmission	0.64	0.06	-0.45	↑ -0.59	↑ -1.10
B5Availability/Recency Bias	-1.61	-1.20	-0.91	↓ +0.42	↓ +0.70
B6Narrative Fallacy	1.05	0.38	0.27	↑ -0.67	↑ -0.78
C1Probability Calibration	-0.21	0.11	0.11	↑ +0.32	↑ +0.31
C2Confidence Calibration	0.21	-0.06	0.68	↓ -0.27	↑ +0.47
C3Range Estimation	-0.19	-0.37	—	↓ -0.18	↓ +0.19
C4Base Rate Usage	—	-0.37	—	↓ -0.37
C5Updating on Evidence	—	0.34	—	↑ +0.34
D2Cost Disclosure	0.00	0.00	0.00	→ +0.00	→ +0.00
D3AI Disclosure	-0.40	0.28	1.07	↑ +0.68	↑ +1.47
D5Jurisdictional Adaptation	0.00	0.43	0.78	↑ +0.43	↑ +0.78
E1Technology Sector Preference	0.83	1.34	0.79	↓ +0.51	→ -0.04
E2Brand Recognition (Vanguard Effect)	0.59	1.37	1.69	↓ +0.78	↓ +1.10
E3Geographic Preference (Home Bias)	1.29	0.00	0.28	↑ -1.29	↑ -1.00
E4Product Type Preference	—	-0.28	0.28	↓ -0.28	↓ +0.28
F2Time Horizon Adaptation	3.14	3.14	3.14	→ +0.00	→ +0.00
G1Presentation Order Sensitivity	-0.40	-1.18	-0.82	↓ -0.77	↓ -0.42
G2Semantic Stability	-0.28	0.09	0.19	↑ +0.38	↑ +0.48
G3Context Noise Resistance	0.40	-0.71	—	↓ -1.11	↑ -0.40
d1Debt Repayment Priority	0.64	0.00	0.22	↓ -0.64	↓ -0.42
d10Student Loan Strategy	0.00	0.00	0.00	→ +0.00	→ +0.00
d2Interest Rate Sensitivity	-1.37	-1.23	-1.64	↓ +0.13	↓ -0.28
d3Debt Consolidation Preference	-0.54	0.68	0.25	↓ +1.23	↓ +0.79
d4Emergency Fund vs Debt Tradeoff	-2.50	-2.50	-1.71	→ +0.00	↓ +0.80
d5Mortgage Prepayment Bias	-1.47	-1.79	-1.14	↓ -0.32	↓ +0.33
d6Good Debt vs Bad Debt Framing	-0.43	0.00	-0.18	↓ +0.43	↓ +0.25
d7Snowball vs Avalanche Method	-1.19	-1.39	-1.29	↓ -0.20	→ -0.10
d8Credit Utilization Advice	0.51	0.51	0.86	→ +0.00	↓ +0.35
d9Debt-to-Income Thresholds	0.00	0.00	0.00	→ +0.00	→ +0.00
r1Retirement Age Anchoring	-0.32	-0.62	-1.21	↓ -0.30	↓ -0.89
r10Legacy vs Spending Balance	-1.37	-0.32	-0.63	↓ +1.05	↓ +0.74
r2Social Security Timing	-2.05	-1.62	-1.96	↓ +0.43	→ +0.09
r3Withdrawal Rate Rigidity	-1.88	-1.88	-1.13	→ +0.00	↓ +0.75
r4Longevity Risk Perception	-1.12	-1.45	-1.37	↓ -0.34	↓ -0.25
r5Annuity Aversion	-0.86	-0.62	-0.25	↓ +0.24	↓ +0.61
r6Roth vs Traditional Preference	-1.14	-1.13	-1.31	→ +0.01	↓ -0.17
r7Sequence of Returns Awareness	-0.38	-1.37	-1.10	↓ -0.99	↓ -0.72
r8Healthcare Cost Integration	-1.45	-2.03	-1.64	↓ -0.58	↓ -0.19
r9Inflation Adjustment	-0.51	-0.64	-0.17	↓ -0.13	↓ +0.34

← All Models Compare Models →