In the previous page, you ran Failure Analysis on a sample dataset and saw a hallucination detected. Now let’s understand the result in detail and fix the issue.
The result hierarchy
Every Failure Analysis result follows this structure:
FARunResult
├── summary → FailureSummary (aggregate stats)
├── failure_tags[] → FailureTag[] (per-item classifications)
│ ├── bucket_name → "Hallucination"
│ ├── subcategory_name → "Entity Fabrication"
│ ├── decision → "fail"
│ ├── severity → 4
│ ├── confidence → 0.95
│ ├── judge_rationale → "The output states Berlin..."
│ └── scoring_breakdown → { ... }
├── eval_metrics → Metric scores (if eval was run)
└── inputs → Original data you submitted
The FailureSummary
The result.summary gives you the big picture at a glance:
summary = result.summary
print(f"Total items analyzed: {summary.total_items}")
print(f"Items with failures: {summary.items_with_failures}")
print(f"Items all passed: {summary.items_all_passed}")
print(f"Total failures: {summary.total_failures_detected}")
print(f"Total passes: {summary.total_passes}")
print(f"Overall severity: {summary.overall_severity}/5")
print(f"Overall confidence: {summary.overall_confidence}")
print(f"Primary failure: {summary.primary_failure_name}")
print(f"Buckets affected: {summary.buckets_affected}")
Key decision fields
| Field | What it means |
|---|
should_alert | True if severity is high enough to warrant immediate attention |
should_gate_ci | True if this failure should block a CI/CD pipeline |
needs_human_review | True if the LLM judge is uncertain and a human should review |
Reading a FailureTag
Each FailureTag represents one classification decision for one item:
for tag in result.failure_tags:
print(f"Bucket: {tag.bucket_name}")
print(f"Subcategory: {tag.subcategory_name}")
print(f"Decision: {tag.decision}") # "pass", "fail", or "unsure"
print(f"Severity: {tag.severity}/5") # 0 = no issue, 5 = critical
print(f"Confidence: {tag.confidence}") # 0.0 to 1.0
print(f"Detector: {tag.detector_type_used}")
print(f"Item index: {tag.item_index}") # Which dataset item
print()
if tag.judge_rationale:
print(f"Rationale: {tag.judge_rationale}")
print(f"Breakdown: {tag.scoring_breakdown}")
print(f"Evidence: {tag.evidence_items}")
Understanding severity
| Severity | Meaning | Example |
|---|
| 0 | No issue | Correct answer, well-grounded |
| 1 | Minor | Slight wording imprecision |
| 2 | Low | Missing context but not harmful |
| 3 | Medium | Partial hallucination, some facts wrong |
| 4 | High | Major factual error, contradicts source |
| 5 | Critical | Dangerous misinformation, safety risk |
Understanding confidence
- 0.9–1.0: The judge is highly certain about this classification.
- 0.6–0.9: Likely correct but some ambiguity exists.
- Below 0.6: The judge is uncertain — consider human review. These often come with
decision: "unsure".
Root cause analysis
The judge_rationale field explains exactly what went wrong and why:
Rationale: The output states Berlin is the capital of France,
which directly contradicts the provided context that identifies
Paris as the capital. This is a factual fabrication where the
model generated an entity (Berlin) that is not supported by
any of the provided context passages.
The scoring_breakdown provides structured evidence:
breakdown = tag.scoring_breakdown
# Typically includes:
# - Which context passages were relevant
# - What the correct answer should be
# - Why the output diverges from the context
Fix the prompt and re-run
Now that you know the root cause, fix the issue. In this quickstart example, the “fix” is simply providing the correct output — but in real applications, you’d adjust your prompt, retrieval pipeline, or guardrails.
Before (failing)
dataset_failing = [
{
"input": "What is the capital of France?",
"output": "The capital of France is Berlin.", # ← Wrong
"context": ["The capital of France is Paris."],
}
]
After (fixed)
dataset_fixed = [
{
"input": "What is the capital of France?",
"output": "The capital of France is Paris.", # ← Correct
"context": ["The capital of France is Paris."],
}
]
Re-run to confirm
result = client.failure_analysis.run(dataset=dataset_fixed)
print(f"Failures: {result.summary.total_failures_detected}")
print(f"Passes: {result.summary.total_passes}")
# Expected:
# Failures: 0
# Passes: 1
When failures drop to zero, your fix is confirmed. In a real workflow, you’d commit the prompt change and add this as a regression test.
Filtering failures programmatically
For larger datasets, filter and sort failures by severity:
# Get only failures (skip passes and unsure)
failures = [t for t in result.failure_tags if t.decision == "fail"]
# Sort by severity (worst first)
failures.sort(key=lambda t: t.severity, reverse=True)
# Group by bucket
from collections import defaultdict
by_bucket = defaultdict(list)
for tag in failures:
by_bucket[tag.bucket_name].append(tag)
for bucket, tags in by_bucket.items():
print(f"\n{bucket} ({len(tags)} failures):")
for tag in tags:
print(f" - {tag.subcategory_name} (severity: {tag.severity})")
CI/CD gating
Use should_gate_ci to block deployments when critical failures are found:
result = client.failure_analysis.run(dataset=test_cases)
if result.summary.should_gate_ci:
print("❌ Critical failures detected — blocking deployment")
exit(1)
else:
print("✅ All checks passed")