How It Works
Every dataset item or trace is analyzed against Valiqor’s failure taxonomy. For each applicable subcategory, a classification decision is produced along with a severity score and confidence level.Failure Decisions
Each subcategory check produces one of four decisions:| Decision | Meaning |
|---|---|
fail | Failure detected — evidence supports this classification |
pass | No failure — the system behaved correctly |
unsure | Ambiguous — insufficient evidence to decide |
not_applicable | This subcategory doesn’t apply to this app type |
Severity
Severity measures how bad a failure is, on a 0–5 scale. Higher severity means greater potential impact on the user, the business, or safety. Valiqor computes severity automatically based on the failure type, its context, and how frequently it recurs.| Severity | Interpretation |
|---|---|
| 0–1 | Minor — cosmetic or low-risk issue |
| 2–3 | Moderate — user-visible failure that should be investigated |
| 4–5 | Critical — potential financial, legal, or safety impact |
Failures associated with high-risk security categories
(e.g. self-harm, PII exposure, hate speech) are automatically escalated
to critical severity.
Frequency Amplification
When the same failure type recurs across multiple items in a dataset, severity is amplified. Isolated issues score lower than systemic patterns.Confidence
Confidence measures how certain Valiqor is about a classification, on a 0.0–1.0 scale. Confidence increases when multiple independent signals agree:- Rule-based detectors confirm the failure
- LLM judge classifies the failure
- Evaluation metrics corroborate the finding
- Security classifiers flag related content
| Confidence | Interpretation |
|---|---|
| 0.8–1.0 | High — multiple signals agree |
| 0.5–0.7 | Moderate — some evidence, but not conclusive |
| < 0.5 | Low — limited evidence, consider manual review |
Reading FATag Results
Every failure is returned as anFATag with these key fields:
Judge Rationale
For LLM-judge-detected failures, thejudge_rationale field contains
the judge’s explanation:
Evidence
Each tag includes structured evidence linking back to the original data:| Field | Description |
|---|---|
evidence_type | Kind of evidence (e.g. "span", "claim", "metric") |
description | Human-readable explanation |
source | Where it came from ("trace", "eval", "security") |
content_snippet | Relevant text excerpt |
Eval Metric Values
Theeval_metric_values dict shows which evaluation metrics were used
as supporting evidence:
Automation Flags
TheFARunResult summary includes built-in flags for automation:
| Flag | When It Triggers |
|---|---|
should_alert | Critical failures with high confidence |
should_gate_ci | Failures severe enough to block deployment |
needs_human_review | High-severity failures where evidence is ambiguous |
Summary Statistics
Interpreting Results
High Severity + High Confidence → Act Now
Reliable, serious failures. Set up automated alerts and CI gates.High Severity + Low Confidence → Review
The system suspects a serious failure but evidence is ambiguous. Queue for human review — theneeds_human_review flag catches these automatically.
Low Severity + High Confidence → Monitor
Real but minor issues. Track trends withget_trends() — if frequency
increases, severity will be amplified.