Use this file to discover all available pages before exploring further.
In the previous page, you ran Failure Analysis on a sample dataset and saw a hallucination detected. Now let’s understand the result in detail and fix the issue.
The judge_rationale field explains exactly what went wrong and why:
Rationale: The output states Berlin is the capital of France,which directly contradicts the provided context that identifiesParis as the capital. This is a factual fabrication where themodel generated an entity (Berlin) that is not supported byany of the provided context passages.
The scoring_breakdown provides structured evidence:
breakdown = tag.scoring_breakdown# Typically includes:# - Which context passages were relevant# - What the correct answer should be# - Why the output diverges from the context
Now that you know the root cause, fix the issue. In this quickstart example, the “fix” is simply providing the correct output — but in real applications, you’d adjust your prompt, retrieval pipeline, or guardrails.
dataset_failing = [ { "input": "What is the capital of France?", "output": "The capital of France is Berlin.", # ← Wrong "context": ["The capital of France is Paris."], }]
dataset_fixed = [ { "input": "What is the capital of France?", "output": "The capital of France is Paris.", # ← Correct "context": ["The capital of France is Paris."], }]
For larger datasets, filter and sort failures by severity:
# Get only failures (skip passes and unsure)failures = [t for t in result.failure_tags if t.decision == "fail"]# Sort by severity (worst first)failures.sort(key=lambda t: t.severity, reverse=True)# Group by bucketfrom collections import defaultdictby_bucket = defaultdict(list)for tag in failures: by_bucket[tag.bucket_name].append(tag)for bucket, tags in by_bucket.items(): print(f"\n{bucket} ({len(tags)} failures):") for tag in tags: print(f" - {tag.subcategory_name} (severity: {tag.severity})")