Skip to main content
In the previous page, you ran Failure Analysis on a sample dataset and saw a hallucination detected. Now let’s understand the result in detail and fix the issue.

The result hierarchy

Every Failure Analysis result follows this structure:
FARunResult
├── summary          → FailureSummary (aggregate stats)
├── failure_tags[]   → FailureTag[] (per-item classifications)
│   ├── bucket_name        → "Hallucination"
│   ├── subcategory_name   → "Entity Fabrication"
│   ├── decision           → "fail"
│   ├── severity           → 4
│   ├── confidence         → 0.95
│   ├── judge_rationale    → "The output states Berlin..."
│   └── scoring_breakdown  → { ... }
├── eval_metrics     → Metric scores (if eval was run)
└── inputs           → Original data you submitted

The FailureSummary

The result.summary gives you the big picture at a glance:
summary = result.summary

print(f"Total items analyzed:  {summary.total_items}")
print(f"Items with failures:   {summary.items_with_failures}")
print(f"Items all passed:      {summary.items_all_passed}")
print(f"Total failures:        {summary.total_failures_detected}")
print(f"Total passes:          {summary.total_passes}")
print(f"Overall severity:      {summary.overall_severity}/5")
print(f"Overall confidence:    {summary.overall_confidence}")
print(f"Primary failure:       {summary.primary_failure_name}")
print(f"Buckets affected:      {summary.buckets_affected}")

Key decision fields

FieldWhat it means
should_alertTrue if severity is high enough to warrant immediate attention
should_gate_ciTrue if this failure should block a CI/CD pipeline
needs_human_reviewTrue if the LLM judge is uncertain and a human should review

Reading a FailureTag

Each FailureTag represents one classification decision for one item:
for tag in result.failure_tags:
    print(f"Bucket:       {tag.bucket_name}")
    print(f"Subcategory:  {tag.subcategory_name}")
    print(f"Decision:     {tag.decision}")      # "pass", "fail", or "unsure"
    print(f"Severity:     {tag.severity}/5")     # 0 = no issue, 5 = critical
    print(f"Confidence:   {tag.confidence}")     # 0.0 to 1.0
    print(f"Detector:     {tag.detector_type_used}")
    print(f"Item index:   {tag.item_index}")     # Which dataset item
    print()
    if tag.judge_rationale:
        print(f"Rationale:    {tag.judge_rationale}")
    print(f"Breakdown:    {tag.scoring_breakdown}")
    print(f"Evidence:     {tag.evidence_items}")

Understanding severity

SeverityMeaningExample
0No issueCorrect answer, well-grounded
1MinorSlight wording imprecision
2LowMissing context but not harmful
3MediumPartial hallucination, some facts wrong
4HighMajor factual error, contradicts source
5CriticalDangerous misinformation, safety risk

Understanding confidence

  • 0.9–1.0: The judge is highly certain about this classification.
  • 0.6–0.9: Likely correct but some ambiguity exists.
  • Below 0.6: The judge is uncertain — consider human review. These often come with decision: "unsure".

Root cause analysis

The judge_rationale field explains exactly what went wrong and why:
Rationale: The output states Berlin is the capital of France,
which directly contradicts the provided context that identifies
Paris as the capital. This is a factual fabrication where the
model generated an entity (Berlin) that is not supported by
any of the provided context passages.
The scoring_breakdown provides structured evidence:
breakdown = tag.scoring_breakdown
# Typically includes:
# - Which context passages were relevant
# - What the correct answer should be
# - Why the output diverges from the context

Fix the prompt and re-run

Now that you know the root cause, fix the issue. In this quickstart example, the “fix” is simply providing the correct output — but in real applications, you’d adjust your prompt, retrieval pipeline, or guardrails.

Before (failing)

dataset_failing = [
    {
        "input": "What is the capital of France?",
        "output": "The capital of France is Berlin.",  # ← Wrong
        "context": ["The capital of France is Paris."],
    }
]

After (fixed)

dataset_fixed = [
    {
        "input": "What is the capital of France?",
        "output": "The capital of France is Paris.",  # ← Correct
        "context": ["The capital of France is Paris."],
    }
]

Re-run to confirm

result = client.failure_analysis.run(dataset=dataset_fixed)

print(f"Failures: {result.summary.total_failures_detected}")
print(f"Passes:   {result.summary.total_passes}")
# Expected:
#   Failures: 0
#   Passes: 1
When failures drop to zero, your fix is confirmed. In a real workflow, you’d commit the prompt change and add this as a regression test.

Filtering failures programmatically

For larger datasets, filter and sort failures by severity:
# Get only failures (skip passes and unsure)
failures = [t for t in result.failure_tags if t.decision == "fail"]

# Sort by severity (worst first)
failures.sort(key=lambda t: t.severity, reverse=True)

# Group by bucket
from collections import defaultdict
by_bucket = defaultdict(list)
for tag in failures:
    by_bucket[tag.bucket_name].append(tag)

for bucket, tags in by_bucket.items():
    print(f"\n{bucket} ({len(tags)} failures):")
    for tag in tags:
        print(f"  - {tag.subcategory_name} (severity: {tag.severity})")

CI/CD gating

Use should_gate_ci to block deployments when critical failures are found:
result = client.failure_analysis.run(dataset=test_cases)

if result.summary.should_gate_ci:
    print("❌ Critical failures detected — blocking deployment")
    exit(1)
else:
    print("✅ All checks passed")