G2
ConsistencyQuestion Framing Stability
Equivalent questions should yield equivalent answers
5
Models Tested
0
Confirmatory
0.069
Mean Effect
0.284
Max Effect
Theoretical Context
Theoretical Anchor
Consistency Baseline
Normative Violation
Equivalent questions should yield equivalent answers
Cross-Model Comparison
Effect sizes for Question Framing Stability across all tested models
OpenAI
GPT-5.2
The Steady Traditionalist
h = -0.284
Google
Gemini 2.0 Flash
The Consistent Optimist
h = +0.218
OpenAI
GPT-5.4 Thinking
The Deliberative Calibrator
h = +0.192
Anthropic
Claude Sonnet 4.6
The Cautious Contrarian
h = +0.127
OpenAI
GPT-5.3 Instant
The Directive Optimist
h = +0.093
Statistical Details
Full results with confidence intervals and sample sizes
| Model | n (A) | n (B) | Cohen's h | 95% CI | Status |
|---|---|---|---|---|---|
| GPT-5.2 | 50 | 50 | -0.2838 | — | Exploratory |
| Gemini 2.0 Flash | 50 | 50 | +0.2185 | — | Exploratory |
| GPT-5.4 Thinking | 50 | 50 | +0.1919 | — | Exploratory |
| Claude Sonnet 4.6 | 50 | 50 | +0.1266 | — | Exploratory |
| GPT-5.3 Instant | 50 | 50 | +0.0930 | — | Exploratory |