GPT Model Evolution
How behavioral biases change across GPT-5.2, GPT-5.3 Instant, GPT-5.4 Thinking, and the just-released GPT-5.5. Understanding model lineage helps predict behavioral trade-offs.
Model Lineage
Speed Track (5.3 Instant → 5.5) and Reasoning Track (5.4 Thinking) are parallel branches from GPT-5.2, not sequential versions of each other.
Four generations in one continuous scroll
Code Red Release
GPT-5.2
December 2025
Status Quo
h = 1.69
Battery
—
Resistant to recency. Strong status quo, strong home bias.
Speed Track
GPT-5.3 Instant
March 3, 2026
Anchoring
h = 0.80
Battery
43.5%
Direct, decisive. Anchors hard to mentioned numbers.
Reasoning Track
GPT-5.4 Thinking
March 5, 2026
Anchoring
h = 1.80
Battery
37.6%
Extended reasoning. Reduces some biases, amplifies others.
Release-day fingerprint
GPT-5.5
April 25, 2026
24 of 26 dims
h ≈ 0
Battery
97.6%
h ≈ 0 across nearly every dimension. The Invariant.
Code Red Release
GPT-5.2
December 2025
Status Quo
h = 1.69
Battery
—
Resistant to recency. Strong status quo, strong home bias.
Speed Track
GPT-5.3 Instant
March 3, 2026
Anchoring
h = 0.80
Battery
43.5%
Direct, decisive. Anchors hard to mentioned numbers.
Reasoning Track
GPT-5.4 Thinking
March 5, 2026
Anchoring
h = 1.80
Battery
37.6%
Extended reasoning. Reduces some biases, amplifies others.
Release-day fingerprint
GPT-5.5
April 25, 2026
24 of 26 dims
h ≈ 0
Battery
97.6%
h ≈ 0 across nearly every dimension. The Invariant.
GPT-5.5 — fingerprinted on release day
April 25, 2026GPT-5.5 returns h ≈ 0 across 24 of 26 dimensions — not because it failed to respond, but because it responded with equal quality (~9/10 in both conditions) regardless of frame, anchor, or order. The model appears to have substantially overcome the framing and anchoring effects that characterised its predecessors.
Only C4 (base rate usage, h = −0.26) and C5 (evidence updating, h = +0.26) show any detectable condition sensitivity, both below the medium-effect threshold of 0.20. These two non-zero results matter: if every dimension were exactly zero it would look like a floor effect, but the small C-cluster signal confirms the measurement is sensitive enough to detect real effects when they exist.
The h-flattening and the accuracy jump are two sides of the same coin: GPT-5.5 gives correct, high-quality responses across conditions rather than systematically biased responses in one direction. Compare this to GPT-5.4 Thinking — which showed A4 = 1.80 (extreme anchoring susceptibility) and A5 = 1.77 (strong endowment effect). GPT-5.4 reasoned extensively and that reasoning amplified certain biases. GPT-5.5 appears to have achieved something different: high-quality responses that are robust to the framing conditions rather than systematically affected by them.
⚠️ Critical Behavioral Changes
Cluster-Level Trends
Framing & Reference
Heuristics & Biases
Calibration
Regulatory Compliance
Structural Preferences
Suitability
Consistency
The Dual-Process Effect
Comparing GPT-5.3 Instant (speed track) vs GPT-5.4 Thinking (reasoning track) - parallel branches from GPT-5.2
✓ Reasoning Reduces These Biases
Extended reasoning helps with biases that stem from "not thinking hard enough"
✗ Reasoning Amplifies These Biases
Extended reasoning amplifies biases that stem from "integrating information too strongly"
Model Selection Guide
GPT-5.2
Best for brand-neutral, recency-resistant advice
- ✓ Lowest brand bias (E2: h=0.59)
- ✓ Strongest recency resistance (B5: h=-1.61)
- ⚠️ Poor AI disclosure (D3: h=-0.40)
GPT-5.3 Instant
Best for speed and moderate compliance
- ✓ Fast responses
- ✓ Better AI disclosure (D3: h=0.28)
- ⚠️ Order sensitivity (G1: h=-1.18)
GPT-5.4 Thinking
Best for compliance and complex analysis
- ✓ Best AI disclosure (D3: h=1.07)
- ✓ Low status quo bias (A6: h=0.46)
- ⚠️ High anchoring (A4: h=1.80)
Key Insights
Parallel Branches, Not Upgrades
GPT-5.3 and GPT-5.4 are siblings, not parent-child. They represent different optimization tracks (speed vs reasoning) branching from GPT-5.2.
Reasoning Has Trade-offs
Extended reasoning (System 2) reduces biases from "not thinking hard enough" but amplifies biases from "integrating information too strongly."
Brand Bias Increases Monotonically
Brand preference (E2) increases across all versions, suggesting RLHF feedback loops may systematically reward "recognizable" recommendations.
Compliance Improves
AI disclosure (D3) shows monotonic improvement from -0.40 (5.2) to 0.28 (5.3) to 1.07 (5.4), demonstrating successful targeted RLHF for regulatory alignment.
Advice genome analysis
Anchoring susceptibility (A4)
Counterintuitive finding: Extended reasoning amplifies anchoring rather than correcting it. The thinking model integrates the anchor more deeply into its analysis, treating arbitrary starting points as legitimate reference points.
Recency bias resistance (B5)
Degrading protection:Recency resistance weakens with each iteration. GPT-5.2's strong resistance (h = -1.61) has eroded to moderate in later versions. RLHF training on current-events data may be overweighting recent information.
Status quo bias (A6)
Genuine improvement:Status quo bias decreases monotonically. Extended reasoning helps the model recognize when the current state isn't optimal, reducing inertia-driven recommendations.
AI disclosure compliance (D3)
RLHF success story:Targeted training for regulatory compliance shows clear results. GPT-5.4 now proactively discloses AI limitations, a 1.47-point swing from GPT-5.2's non-disclosure default.
Implications for model selection
For compliance-sensitive applications
GPT-5.4 Thinking offers the strongest regulatory alignment across the D cluster dimensions. Its extended reasoning allows for more thorough consideration of disclosure requirements, suitability assessments, and jurisdictional compliance. The 1.07 effect size on AI disclosure (D3) represents a large, practically significant improvement.
For speed-critical consumer applications
GPT-5.3 Instant provides acceptable bias profiles for most dimensions while offering significantly lower latency. Its weaknesses on order sensitivity (G1: h = -1.18) and anchoring (A4: h = 1.51) should be monitored, but for high-volume consumer advice scenarios, the speed/quality trade-off is often favorable.
For brand-neutral recommendations
GPT-5.2 remains the best choice when brand bias must be minimized. Its E2 effect size of 0.59 is notably lower than GPT-5.3 (0.78) and GPT-5.4 (0.91). For independent advisory services that cannot appear to favor specific products or providers, the older model outperforms its successors.
Dimension-Level Comparison
| Dimension | GPT-5.2 | GPT-5.3 | GPT-5.4 | 5.2→5.3 | 5.2→5.4 |
|---|---|---|---|---|---|
| A1Loss Aversion Asymmetry | 0.00 | -0.45 | -0.46 | ↓ -0.45 | ↓ -0.46 |
| A2Certainty Effect | 0.35 | 0.34 | 0.21 | → -0.01 | ↓ -0.14 |
| A3Reference Point Sensitivity | -0.21 | -0.09 | 0.08 | ↓ +0.13 | ↓ +0.29 |
| A4Anchoring Susceptibility | 0.62 | 0.80 | 1.80 | ↓ +0.18 | ↓ +1.18 |
| A5Endowment Effect | 0.88 | 0.88 | 1.77 | → -0.00 | ↓ +0.90 |
| A6Status Quo Bias | 1.69 | 1.77 | 0.46 | → +0.08 | ↑ -1.23 |
| B2Representativeness | -0.93 | -0.65 | -0.31 | ↓ +0.28 | ↓ +0.62 |
| B3Overconfidence Transmission | 0.64 | 0.06 | -0.45 | ↑ -0.59 | ↑ -1.10 |
| B5Availability/Recency Bias | -1.61 | -1.20 | -0.91 | ↓ +0.42 | ↓ +0.70 |
| B6Narrative Fallacy | 1.05 | 0.38 | 0.27 | ↑ -0.67 | ↑ -0.78 |
| C1Probability Calibration | -0.21 | 0.11 | 0.11 | ↑ +0.32 | ↑ +0.31 |
| C2Confidence Calibration | 0.21 | -0.06 | 0.68 | ↓ -0.27 | ↑ +0.47 |
| C3Range Estimation | -0.19 | -0.37 | — | ↓ -0.18 | ↓ +0.19 |
| C4Base Rate Usage | — | -0.37 | — | ↓ -0.37 | |
| C5Updating on Evidence | — | 0.34 | — | ↑ +0.34 | |
| D2Cost Disclosure | 0.00 | 0.00 | 0.00 | → +0.00 | → +0.00 |
| D3AI Disclosure | -0.40 | 0.28 | 1.07 | ↑ +0.68 | ↑ +1.47 |
| D5Jurisdictional Adaptation | 0.00 | 0.43 | 0.78 | ↑ +0.43 | ↑ +0.78 |
| E1Technology Sector Preference | 0.83 | 1.34 | 0.79 | ↓ +0.51 | → -0.04 |
| E2Brand Recognition (Vanguard Effect) | 0.59 | 1.37 | 1.69 | ↓ +0.78 | ↓ +1.10 |
| E3Geographic Preference (Home Bias) | 1.29 | 0.00 | 0.28 | ↑ -1.29 | ↑ -1.00 |
| E4Product Type Preference | — | -0.28 | 0.28 | ↓ -0.28 | ↓ +0.28 |
| F2Time Horizon Adaptation | 3.14 | 3.14 | 3.14 | → +0.00 | → +0.00 |
| G1Presentation Order Sensitivity | -0.40 | -1.18 | -0.82 | ↓ -0.77 | ↓ -0.42 |
| G2Semantic Stability | -0.28 | 0.09 | 0.19 | ↑ +0.38 | ↑ +0.48 |
| G3Context Noise Resistance | 0.40 | -0.71 | — | ↓ -1.11 | ↑ -0.40 |
| d1Debt Repayment Priority | 0.64 | 0.00 | 0.22 | ↓ -0.64 | ↓ -0.42 |
| d10Student Loan Strategy | 0.00 | 0.00 | 0.00 | → +0.00 | → +0.00 |
| d2Interest Rate Sensitivity | -1.37 | -1.23 | -1.64 | ↓ +0.13 | ↓ -0.28 |
| d3Debt Consolidation Preference | -0.54 | 0.68 | 0.25 | ↓ +1.23 | ↓ +0.79 |
| d4Emergency Fund vs Debt Tradeoff | -2.50 | -2.50 | -1.71 | → +0.00 | ↓ +0.80 |
| d5Mortgage Prepayment Bias | -1.47 | -1.79 | -1.14 | ↓ -0.32 | ↓ +0.33 |
| d6Good Debt vs Bad Debt Framing | -0.43 | 0.00 | -0.18 | ↓ +0.43 | ↓ +0.25 |
| d7Snowball vs Avalanche Method | -1.19 | -1.39 | -1.29 | ↓ -0.20 | → -0.10 |
| d8Credit Utilization Advice | 0.51 | 0.51 | 0.86 | → +0.00 | ↓ +0.35 |
| d9Debt-to-Income Thresholds | 0.00 | 0.00 | 0.00 | → +0.00 | → +0.00 |
| r1Retirement Age Anchoring | -0.32 | -0.62 | -1.21 | ↓ -0.30 | ↓ -0.89 |
| r10Legacy vs Spending Balance | -1.37 | -0.32 | -0.63 | ↓ +1.05 | ↓ +0.74 |
| r2Social Security Timing | -2.05 | -1.62 | -1.96 | ↓ +0.43 | → +0.09 |
| r3Withdrawal Rate Rigidity | -1.88 | -1.88 | -1.13 | → +0.00 | ↓ +0.75 |
| r4Longevity Risk Perception | -1.12 | -1.45 | -1.37 | ↓ -0.34 | ↓ -0.25 |
| r5Annuity Aversion | -0.86 | -0.62 | -0.25 | ↓ +0.24 | ↓ +0.61 |
| r6Roth vs Traditional Preference | -1.14 | -1.13 | -1.31 | → +0.01 | ↓ -0.17 |
| r7Sequence of Returns Awareness | -0.38 | -1.37 | -1.10 | ↓ -0.99 | ↓ -0.72 |
| r8Healthcare Cost Integration | -1.45 | -2.03 | -1.64 | ↓ -0.58 | ↓ -0.19 |
| r9Inflation Adjustment | -0.51 | -0.64 | -0.17 | ↓ -0.13 | ↓ +0.34 |