Documentation Index
Fetch the complete documentation index at: https://docs.valiqor.com/llms.txt
Use this file to discover all available pages before exploring further.
Valiqor’s Failure Analysis doesn’t just detect failures — it traces each
failure back to its root cause, assigns a severity score, computes
a confidence level, and suggests remediation.
How It Works
Every dataset item or trace is analyzed against Valiqor’s
failure taxonomy. For each applicable
subcategory, a classification decision is produced along with a severity
score and confidence level.
Input (prompt, response, context)
↓
Evidence Collection
↓
Failure Classification
↓
FATag per subcategory
(decision + severity + confidence + evidence)
Failure Decisions
Each subcategory check produces one of four decisions:
| Decision | Meaning |
|---|
fail | Failure detected — evidence supports this classification |
pass | No failure — the system behaved correctly |
unsure | Ambiguous — insufficient evidence to decide |
not_applicable | This subcategory doesn’t apply to this app type |
Severity
Severity measures how bad a failure is, on a 0–5 scale.
Higher severity means greater potential impact on the user, the business, or
safety. Valiqor computes severity automatically based on the failure type,
its context, and how frequently it recurs.
| Severity | Interpretation |
|---|
| 0–1 | Minor — cosmetic or low-risk issue |
| 2–3 | Moderate — user-visible failure that should be investigated |
| 4–5 | Critical — potential financial, legal, or safety impact |
Failures associated with high-risk security categories
(e.g. self-harm, PII exposure, hate speech) are automatically escalated
to critical severity.
Frequency Amplification
When the same failure type recurs across multiple items in a dataset,
severity is amplified. Isolated issues score lower than systemic patterns.
Confidence
Confidence measures how certain Valiqor is about a classification,
on a 0.0–1.0 scale.
Confidence increases when multiple independent signals agree:
- Rule-based detectors confirm the failure
- LLM judge classifies the failure
- Evaluation metrics corroborate the finding
- Security classifiers flag related content
When signals disagree, confidence is reduced and the result may be
flagged for human review.
| Confidence | Interpretation |
|---|
| 0.8–1.0 | High — multiple signals agree |
| 0.5–0.7 | Moderate — some evidence, but not conclusive |
| < 0.5 | Low — limited evidence, consider manual review |
Reading FATag Results
Every failure is returned as an FATag with these key fields:
tag = result.failure_tags[0]
print(tag.decision) # "fail", "pass", "unsure", "not_applicable"
print(tag.bucket_name) # e.g. "Hallucination & Grounding"
print(tag.subcategory_name) # e.g. "Unsupported factual claim"
print(tag.severity) # 0.0 – 5.0
print(tag.confidence) # 0.0 – 1.0
print(tag.detector_type_used) # "deterministic", "llm_judge", or "hybrid"
Judge Rationale
For LLM-judge-detected failures, the judge_rationale field contains
the judge’s explanation:
if tag.judge_rationale:
print(f"Why: {tag.judge_rationale}")
Evidence
Each tag includes structured evidence linking back to the original data:
for evidence in tag.evidence_items:
print(f" [{evidence.evidence_type}] {evidence.description}")
if evidence.content_snippet:
print(f" → {evidence.content_snippet}")
| Field | Description |
|---|
evidence_type | Kind of evidence (e.g. "span", "claim", "metric") |
description | Human-readable explanation |
source | Where it came from ("trace", "eval", "security") |
content_snippet | Relevant text excerpt |
Eval Metric Values
The eval_metric_values dict shows which evaluation metrics were used
as supporting evidence:
for metric, score in tag.eval_metric_values.items():
print(f" {metric}: {score:.3f}")
# e.g. hallucination: 0.85, context_precision: 0.42
Automation Flags
The FARunResult summary includes built-in flags for automation:
result = client.failure_analysis.run(dataset=data)
if result.summary.should_alert:
# Trigger Slack / PagerDuty notification
send_alert(result)
if result.summary.should_gate_ci:
# Block the CI/CD pipeline
sys.exit(1)
if result.summary.needs_human_review:
# Queue for manual inspection
create_review_ticket(result)
| Flag | When It Triggers |
|---|
should_alert | Critical failures with high confidence |
should_gate_ci | Failures severe enough to block deployment |
needs_human_review | High-severity failures where evidence is ambiguous |
Summary Statistics
summary = result.summary
print(f"Failures: {summary.total_failures_detected}")
print(f"Passes: {summary.total_passes}")
print(f"Items with failures: {summary.items_with_failures}")
print(f"Overall severity: {summary.overall_severity:.1f}")
print(f"Buckets affected: {summary.buckets_affected}")
Interpreting Results
High Severity + High Confidence → Act Now
Reliable, serious failures. Set up automated alerts and CI gates.
High Severity + Low Confidence → Review
The system suspects a serious failure but evidence is ambiguous. Queue for
human review — the needs_human_review flag catches these automatically.
Low Severity + High Confidence → Monitor
Real but minor issues. Track trends with get_trends() — if frequency
increases, severity will be amplified.
Unsure Decision → Investigate
The detector couldn’t reach a conclusion. This typically means the input
data is insufficient for classification (e.g. missing context for RAG
checks). Provide richer data for better results.
Complete Example
from valiqor import ValiqorClient
client = ValiqorClient()
result = client.failure_analysis.run(
dataset=[{
"input": "What are the side effects of aspirin?",
"output": "Aspirin can cause liver failure in most patients.",
"context": ["Aspirin may cause stomach upset and rare allergic reactions."]
}]
)
# Check summary
print(f"Failures found: {result.summary.total_failures_detected}")
print(f"Should alert: {result.summary.should_alert}")
# Inspect individual failures
for tag in result.failure_tags:
if tag.decision == "fail":
print(f"\n🔴 {tag.subcategory_name}")
print(f" Severity: {tag.severity:.1f} | Confidence: {tag.confidence:.2f}")
print(f" Detector: {tag.detector_type_used}")
if tag.judge_rationale:
print(f" Rationale: {tag.judge_rationale}")
See the Failure Analysis workflow for
end-to-end usage and the Failure Taxonomy
for the full list of detectable failure types.