Skip to main content
Valiqor’s Failure Analysis doesn’t just detect failures — it traces each failure back to its root cause, assigns a severity score, computes a confidence level, and suggests remediation.

How It Works

Every dataset item or trace is analyzed against Valiqor’s failure taxonomy. For each applicable subcategory, a classification decision is produced along with a severity score and confidence level.
Input (prompt, response, context)

   Evidence Collection

   Failure Classification

   FATag per subcategory
   (decision + severity + confidence + evidence)

Failure Decisions

Each subcategory check produces one of four decisions:
DecisionMeaning
failFailure detected — evidence supports this classification
passNo failure — the system behaved correctly
unsureAmbiguous — insufficient evidence to decide
not_applicableThis subcategory doesn’t apply to this app type

Severity

Severity measures how bad a failure is, on a 0–5 scale. Higher severity means greater potential impact on the user, the business, or safety. Valiqor computes severity automatically based on the failure type, its context, and how frequently it recurs.
SeverityInterpretation
0–1Minor — cosmetic or low-risk issue
2–3Moderate — user-visible failure that should be investigated
4–5Critical — potential financial, legal, or safety impact
Failures associated with high-risk security categories (e.g. self-harm, PII exposure, hate speech) are automatically escalated to critical severity.

Frequency Amplification

When the same failure type recurs across multiple items in a dataset, severity is amplified. Isolated issues score lower than systemic patterns.

Confidence

Confidence measures how certain Valiqor is about a classification, on a 0.0–1.0 scale. Confidence increases when multiple independent signals agree:
  • Rule-based detectors confirm the failure
  • LLM judge classifies the failure
  • Evaluation metrics corroborate the finding
  • Security classifiers flag related content
When signals disagree, confidence is reduced and the result may be flagged for human review.
ConfidenceInterpretation
0.8–1.0High — multiple signals agree
0.5–0.7Moderate — some evidence, but not conclusive
< 0.5Low — limited evidence, consider manual review

Reading FATag Results

Every failure is returned as an FATag with these key fields:
tag = result.failure_tags[0]

print(tag.decision)            # "fail", "pass", "unsure", "not_applicable"
print(tag.bucket_name)         # e.g. "Hallucination & Grounding"
print(tag.subcategory_name)    # e.g. "Unsupported factual claim"
print(tag.severity)            # 0.0 – 5.0
print(tag.confidence)          # 0.0 – 1.0
print(tag.detector_type_used)  # "deterministic", "llm_judge", or "hybrid"

Judge Rationale

For LLM-judge-detected failures, the judge_rationale field contains the judge’s explanation:
if tag.judge_rationale:
    print(f"Why: {tag.judge_rationale}")

Evidence

Each tag includes structured evidence linking back to the original data:
for evidence in tag.evidence_items:
    print(f"  [{evidence.evidence_type}] {evidence.description}")
    if evidence.content_snippet:
        print(f"    → {evidence.content_snippet}")
FieldDescription
evidence_typeKind of evidence (e.g. "span", "claim", "metric")
descriptionHuman-readable explanation
sourceWhere it came from ("trace", "eval", "security")
content_snippetRelevant text excerpt

Eval Metric Values

The eval_metric_values dict shows which evaluation metrics were used as supporting evidence:
for metric, score in tag.eval_metric_values.items():
    print(f"  {metric}: {score:.3f}")
# e.g. hallucination: 0.85, context_precision: 0.42

Automation Flags

The FARunResult summary includes built-in flags for automation:
result = client.failure_analysis.run(dataset=data)

if result.summary.should_alert:
    # Trigger Slack / PagerDuty notification
    send_alert(result)

if result.summary.should_gate_ci:
    # Block the CI/CD pipeline
    sys.exit(1)

if result.summary.needs_human_review:
    # Queue for manual inspection
    create_review_ticket(result)
FlagWhen It Triggers
should_alertCritical failures with high confidence
should_gate_ciFailures severe enough to block deployment
needs_human_reviewHigh-severity failures where evidence is ambiguous

Summary Statistics

summary = result.summary
print(f"Failures: {summary.total_failures_detected}")
print(f"Passes: {summary.total_passes}")
print(f"Items with failures: {summary.items_with_failures}")
print(f"Overall severity: {summary.overall_severity:.1f}")
print(f"Buckets affected: {summary.buckets_affected}")

Interpreting Results

High Severity + High Confidence → Act Now

Reliable, serious failures. Set up automated alerts and CI gates.

High Severity + Low Confidence → Review

The system suspects a serious failure but evidence is ambiguous. Queue for human review — the needs_human_review flag catches these automatically.

Low Severity + High Confidence → Monitor

Real but minor issues. Track trends with get_trends() — if frequency increases, severity will be amplified.

Unsure Decision → Investigate

The detector couldn’t reach a conclusion. This typically means the input data is insufficient for classification (e.g. missing context for RAG checks). Provide richer data for better results.

Complete Example

from valiqor import ValiqorClient

client = ValiqorClient()
result = client.failure_analysis.run(
    dataset=[{
        "input": "What are the side effects of aspirin?",
        "output": "Aspirin can cause liver failure in most patients.",
        "context": ["Aspirin may cause stomach upset and rare allergic reactions."]
    }]
)

# Check summary
print(f"Failures found: {result.summary.total_failures_detected}")
print(f"Should alert: {result.summary.should_alert}")

# Inspect individual failures
for tag in result.failure_tags:
    if tag.decision == "fail":
        print(f"\n🔴 {tag.subcategory_name}")
        print(f"   Severity: {tag.severity:.1f} | Confidence: {tag.confidence:.2f}")
        print(f"   Detector: {tag.detector_type_used}")
        if tag.judge_rationale:
            print(f"   Rationale: {tag.judge_rationale}")
See the Failure Analysis workflow for end-to-end usage and the Failure Taxonomy for the full list of detectable failure types.