Products

Three products. One question.

Should I trust this AI financial response before I act on it?

VERRIX Checker answers it for free in 30 seconds. VERRIX Confidence answers it systematically in production. VERRIX Genome answers it at the model level — before a single response reaches anyone.

All three products are in beta
Pre-deployment

VERRIX Genome

Request a Genome evaluation →

Before you deploy an AI advisory model, you need to know what it will actually do. Not whether it can answer financial questions — that is a capability question. Whether its recommendations shift based on how a scenario is framed, which fund providers it systematically favours, and whether its reasoning amplifies or attenuates specific biases. VERRIX Genome runs a validated 83-scenario dollar-impact battery and delivers the answer.

How it works

We run the full validation battery on any model or fine-tuned variant. Every scenario carries a pre-specified correct answer and a dollar stake — ranging from $383 to $94,000 per error.

What you get

  • 31-dimension advice genome across 7 clusters (A–G)
  • Dollar-weighted accuracy by scenario type and advisory domain
  • Failure-mode analysis — where the model is wrong and by how much
  • Comparison to published baselines for all six profiled models
  • Regulatory alignment scorecard (Reg BI · FCA Consumer Duty · MiFID II)
  • Drift threshold calibration for ongoing monitoring

Case study — GPT-5.5

GPT-5.5 fingerprinted on release day, April 25, 2026. Battery accuracy 97.6% vs prior generation 37.6–43.5%. The behavioral shift was detectable within 24 hours of model availability — before anyone was relying on it.

Per-evaluation fee. Quarterly subscription available. Report delivered within 48–72 hours of model API access. Indicative pricing on request — email hello@human-machines.com.

In production · pre-registered, OOS-validated

VERRIX Confidence

96.7% TRUST-zone accuracy, validated out-of-sample across 6 platforms in US and UK regulatory contexts. EU validation in development.

Or try the free demo →

Not every AI response deserves the same level of trust. VERRIX Confidence scores each response and routes it into one of three tiers — telling you when to act, when to check, and when to stop, without needing to know the right answer in advance.

TRUSTScore ≥ 0.60Act on this response — the model's known behavioral tendencies are not strongly active on this question type.
REVIEWScore 0.30–0.60Have a human check this first — moderate uncertainty about reliability.
FLAGScore < 0.30Do not act on this response — the model's known biases are likely active on this question type.

Zone accuracy figures reflect performance on a pre-registered out-of-sample validation program across 6 platforms × 65 US/UK/EU/Cross-jurisdiction scenarios (4,495 scored responses, monotonic gradient TRUST 97% > REVIEW 81% > FLAG 63%). Individual response accuracy in deployment may differ. VERRIX Confidence is a quality scoring tool, not a guarantee of correctness.

96.7%

TRUST-zone accuracy (OOS)

6

platforms validated

4,495

scored responses

34pp

TRUST > FLAG accuracy gap

Pre-registered before data collection · 4,495 responses across 6 platforms × 65 US/UK/EU/Cross scenarios · monotonic gradient confirmed

Across the 423 TRUST classifications in the UK, EU, and cross-jurisdictional validation phase, no clearly incorrect response received a TRUST classification. Every TRUST-zone failure is a partial-credit response — one that addresses the scenario correctly but skips a required element. The worst-case TRUST outcome is an incomplete answer, not a wrong one.

How the score is computed

Every response is reduced to 11features and run through a calibrated logistic regression. Five features capture response quality (composite quality, explicit calculation present, regulatory factor cited, clear recommendation made, alternatives considered); six come from the model's advice genome (composite and weighted Cohen's h, factor signal alignment, calculation performed, language confidence, agreement with prior advice). The calibrator is CalibratedClassifierCV wrapping LogisticRegression, fit with 5-fold GroupKFold cross-validation so no scenario appears in both training and validation. The output probability maps to TRUST (≥ 0.60), REVIEW (0.300.60), or FLAG (< 0.30).

What VERRIX Confidence does not do

  • It does not tell you the correct answer. It tells you how consistent the response is with patterns that were correct in the validation dataset — not what the response should have said.
  • It does not generalise unconditionally outside the training distribution. The validated battery covers framing, high-stakes planning, consumer protection, and structural-preference scenarios; novel scenario types fall back to the confidence score alone, without arithmetic verification.
  • It does not replace human review for high-stakes decisions. REVIEW and FLAG zones are explicit hand-offs to a human reviewer; even TRUST-zone responses warrant review when the dollar impact of an error is large.
  • It is not a regulatory compliance certification. The validation studies measure agreement with regulatory standards in specific scenarios — they do not constitute legal advice or an attestation that any particular response complies with FINRA, SEC, FCA, or MiFID II rules.
  • It does not catch every error type. Fabricated facts that read fluent and self-consistent can score in the TRUST zone if they hit the quality and fingerprint signals the model was trained on. Arithmetic verification (when scenarios match) is the second line of defense; human review is the third.

Try it now

Free in beta. Enterprise pricing for production volume — email hello@human-machines.com.

Free · No account required

VERRIX Checker

Try VERRIX Checker →

Paste any AI financial advisory response on a supported scenario. VERRIX Checker generates an independent arithmetic check — without seeing the original recommendation — and scores the response using the VERRIX Confidence calibration model. You get:

  • A calibrated confidence score — TRUST, REVIEW, or FLAG
  • An independent arithmetic check — generated without seeing the original recommendation, confirming whether the numbers support the advice
  • Detection of any invented figures in the calculation

26 supported scenarios including Social Security timing, 401k match optimisation, Roth vs Traditional, mortgage overpayment vs investing, and UK and EU pension scenarios.

5 free checks/day · 3 model comparisons/day · No account required

Market context

Where VERRIX fits with existing approaches.

AI governance platforms (Credo AI, Holistic AI, Arthur AI) audit policy and process — they do not measure behavioral bias in advisory contexts using controlled matched-pair methodology. Capability benchmarks (Vals AI, FinBen, FINOS FinLLM) measure whether a model can answer financial questions correctly — they do not measure whether the answer changes systematically based on framing or which providers the model prefers. VERRIX is the behavioral measurement layer those tools do not cover.

AI Governance

Credo AI · Holistic AI · Arthur AI

Audits AI process, policy, and documentation. Essential for enterprise AI governance. Does not measure domain-specific behavioral bias in advisory contexts.

Financial AI Benchmarks

Vals AI · FinBen · FINOS FinLLM

Measure capability — can the model answer financial questions correctly? Do not measure whether advice changes systematically based on how the same situation is framed.

VERRIX Confidence

The behavioral measurement layer

Pre-registered behavioral bias measurement with dollar-impact ground truth. Per-response calibrated quality scoring validated across US and UK regulatory contexts. The only platform answering both the pre-deployment behavioral question and the in-production quality question.

These are complementary, not competing categories. A firm using an AI governance platform for policy compliance still cannot answer the behavioral bias question without VERRIX.

When does Confidence need a Genome first?

Confidence relies on a per-model advice genome. For the six publicly profiled models, the fingerprint is already in the VERRIX database. For new public models we haven't profiled yet, or for private and fine-tuned deployments, the calibrator operates conservatively until a fingerprint is collected.

Scenario A

Public profiled model

GPT-5.4 Thinking · GPT-5.3 · GPT-5.5 · Claude Sonnet 4.6 · Claude Haiku 4.5 · Gemini 2.0 Flash

Confidence works directly. No Genome needed.

Fingerprints for these six models are already in the VERRIX database. Calibration metrics (AUC 0.876, Brier 0.105, 97.9% US TRUST-zone accuracy across 530 classifications) were measured on responses from these models in the pre-registered out-of-sample validation program.

Scenario B

UK regulatory framing

GPT-5.4 Thinking · GPT-5.3 · GPT-5.5 · Claude Sonnet 4.6 · Claude Haiku 4.5 · Gemini 2.0 Flash

Validated at 91.9% TRUST-zone accuracy, with US-derived fingerprints.

Across 248 UK TRUST classifications under FCA Consumer Duty framing, accuracy is 91.9% (95% CI [87.9, 94.7]). Failures bound to partial-credit responses — never clearly wrong. EU validation is in development; under EU regulatory framing the calibrator currently routes most queries to REVIEW pending EU-specific calibration.

Scenario C

Private or fine-tuned deployment

Azure-hosted GPT · fine-tuned variants · enterprise stacks

Run VERRIX Genome first.

For models VERRIX has never seen — your firm's fine-tuned deployment, an Azure-hosted variant, an internal model — Confidence cannot calibrate without behavioral ground truth. Genome collects the fingerprint over an A/B vignette battery; once captured, Confidence operates with full calibration.

In development · Request early access

What we are building next.

Four products in development, each patent-pending.

In development

VERRIX Monitor

Behavioral compliance monitoring with drift detection.

AI model providers update their models without announcing behavioral changes. VERRIX Monitor applies a sealed canonical stimulus battery on a recurring schedule and alerts you when a model update has shifted compliance-relevant dimensions — generating structured alerts (REVIEW, ESCALATE, SUSPEND) mapped to regulatory frameworks.

Who it's for

Compliance teams, model risk officers, and regulatory bodies deploying AI advisory systems at scale.

What it does

  • Scheduled re-evaluation using a sealed, cryptographically-verified stimulus battery
  • Dimension-level drift detection with compliance-direction flagging
  • Structured machine-readable alerts when drift moves toward non-compliance
  • Trajectory analysis across multiple monitoring periods
  • Regulatory framework mapping (Reg BI, FCA Consumer Duty, MiFID II)

Patent-pending (HMG-2026-P3)

Request early access →
In development

VERRIX Signal

AI advisory allocation preferences as capital flow signals.

VERRIX Signal measures how AI advisors allocate across asset classes, sectors, and market regimes — and generates characterization signals that describe where AI-advised retail capital is directed before flows are visible in fund data.

Who it's for

Institutional investors, systematic funds, and risk managers who need to understand the directional influence of AI advice at scale.

What it does

  • Preference Elicitation Battery measuring AI allocation preferences across market regimes
  • Condition sensitivity scoring — how strongly preferences shift in bull, bear, and volatility regimes
  • Drift detection when AI allocation preferences change following model updates
  • Ensemble concentration risk scoring when multiple AI advisors converge on the same allocation
  • Capital flow characterization signals as leading indicators of AI-advised retail positioning

Patent-pending (HMG-2026-P6) · Target 2027

Register interest →
In development

VERRIX Systemic

Cross-model ensemble concentration risk measurement.

When multiple AI advisors simultaneously share systematic biases, the aggregate effect on capital markets is amplified. VERRIX Systemic quantifies this — computing cross-model behavioral similarity scores, identifying consensus bias dimensions, and generating ensemble concentration risk scores for regulators and institutional risk managers.

Who it's for

Systemic risk teams, central banks, financial regulators, and institutions with AI-aware macro risk mandates.

What it does

  • Pairwise behavioral similarity matrices across AI advisor populations
  • Consensus bias dimension identification (where multiple models agree on a bias direction)
  • Ensemble Concentration Risk Score — a scalar capturing aggregate bias alignment
  • Co-recommendation probability estimates for correlated positioning
  • Trend monitoring for increasing or decreasing systemic concentration

Patent-pending (HMG-2026-P5)

Register interest →
In development

VERRIX Ensemble

Genome-calibrated multi-model advisory analysis.

When multiple AI advisors agree, it usually looks like strong signal. It is often correlated bias. VERRIX Ensemble decomposes multi-model advisory outputs into what is predicted by each model's advice genome versus what is genuinely independent analytical signal — weighting cross-model agreement by fingerprint independence.

Who it's for

Investment analysts, portfolio managers, and research teams already using multiple AI advisory tools in parallel.

What it does

  • Bias Saturation Index (BSI) for each model on each query — fraction of output predicted by fingerprint
  • Decomposition of multi-model outputs into bias-explained and residual signal components
  • Independence-weighted agreement scoring — agreement across models with divergent fingerprints carries more evidential weight
  • Fingerprint-unexplained divergence detection — the signal most likely to contain genuine analytical content
  • Correlated-bias detection when agreement reflects shared training tendencies, not independent analysis

Patent-pending (HMG-2026-P7)

Register interest →

All four products are patent-pending. Early access requests are reviewed individually.