Root Cause Detection

Valiqor’s Failure Analysis doesn’t just detect failures — it traces each failure back to its root cause, assigns a severity score, computes a confidence level, and suggests remediation.

How It Works

Every dataset item or trace is analyzed against Valiqor’s failure taxonomy. For each applicable subcategory, a classification decision is produced along with a severity score and confidence level.

Input (prompt, response, context)
        ↓
   Evidence Collection
        ↓
   Failure Classification
        ↓
   FATag per subcategory
   (decision + severity + confidence + evidence)

Failure Decisions

Each subcategory check produces one of four decisions:

Decision	Meaning
`fail`	Failure detected — evidence supports this classification
`pass`	No failure — the system behaved correctly
`unsure`	Ambiguous — insufficient evidence to decide
`not_applicable`	This subcategory doesn’t apply to this app type

Severity

Severity measures how bad a failure is, on a 0–5 scale. Higher severity means greater potential impact on the user, the business, or safety. Valiqor computes severity automatically based on the failure type, its context, and how frequently it recurs.

Severity	Interpretation
0–1	Minor — cosmetic or low-risk issue
2–3	Moderate — user-visible failure that should be investigated
4–5	Critical — potential financial, legal, or safety impact

Failures associated with high-risk security categories (e.g. self-harm, PII exposure, hate speech) are automatically escalated to critical severity.

Frequency Amplification

When the same failure type recurs across multiple items in a dataset, severity is amplified. Isolated issues score lower than systemic patterns.

Confidence

Confidence measures how certain Valiqor is about a classification, on a 0.0–1.0 scale. Confidence increases when multiple independent signals agree:

Rule-based detectors confirm the failure
LLM judge classifies the failure
Evaluation metrics corroborate the finding
Security classifiers flag related content

When signals disagree, confidence is reduced and the result may be flagged for human review.

Confidence	Interpretation
0.8–1.0	High — multiple signals agree
0.5–0.7	Moderate — some evidence, but not conclusive
< 0.5	Low — limited evidence, consider manual review

Reading FATag Results

Every failure is returned as an FATag with these key fields:

tag = result.failure_tags[0]

print(tag.decision)            # "fail", "pass", "unsure", "not_applicable"
print(tag.bucket_name)         # e.g. "Hallucination & Grounding"
print(tag.subcategory_name)    # e.g. "Unsupported factual claim"
print(tag.severity)            # 0.0 – 5.0
print(tag.confidence)          # 0.0 – 1.0
print(tag.detector_type_used)  # "deterministic", "llm_judge", or "hybrid"

Judge Rationale

For LLM-judge-detected failures, the judge_rationale field contains the judge’s explanation:

if tag.judge_rationale:
    print(f"Why: {tag.judge_rationale}")

Evidence

Each tag includes structured evidence linking back to the original data:

for evidence in tag.evidence_items:
    print(f"  [{evidence.evidence_type}] {evidence.description}")
    if evidence.content_snippet:
        print(f"    → {evidence.content_snippet}")

Field	Description
`evidence_type`	Kind of evidence (e.g. `"span"`, `"claim"`, `"metric"`)
`description`	Human-readable explanation
`source`	Where it came from (`"trace"`, `"eval"`, `"security"`)
`content_snippet`	Relevant text excerpt

Eval Metric Values

The eval_metric_values dict shows which evaluation metrics were used as supporting evidence:

for metric, score in tag.eval_metric_values.items():
    print(f"  {metric}: {score:.3f}")
# e.g. hallucination: 0.85, context_precision: 0.42

Automation Flags

The FARunResult summary includes built-in flags for automation:

result = client.failure_analysis.run(dataset=data)

if result.summary.should_alert:
    # Trigger Slack / PagerDuty notification
    send_alert(result)

if result.summary.should_gate_ci:
    # Block the CI/CD pipeline
    sys.exit(1)

if result.summary.needs_human_review:
    # Queue for manual inspection
    create_review_ticket(result)

Flag	When It Triggers
`should_alert`	Critical failures with high confidence
`should_gate_ci`	Failures severe enough to block deployment
`needs_human_review`	High-severity failures where evidence is ambiguous

Summary Statistics

summary = result.summary
print(f"Failures: {summary.total_failures_detected}")
print(f"Passes: {summary.total_passes}")
print(f"Items with failures: {summary.items_with_failures}")
print(f"Overall severity: {summary.overall_severity:.1f}")
print(f"Buckets affected: {summary.buckets_affected}")

Interpreting Results

High Severity + High Confidence → Act Now

Reliable, serious failures. Set up automated alerts and CI gates.

High Severity + Low Confidence → Review

The system suspects a serious failure but evidence is ambiguous. Queue for human review — the needs_human_review flag catches these automatically.

Low Severity + High Confidence → Monitor

Real but minor issues. Track trends with get_trends() — if frequency increases, severity will be amplified.

Unsure Decision → Investigate

The detector couldn’t reach a conclusion. This typically means the input data is insufficient for classification (e.g. missing context for RAG checks). Provide richer data for better results.

Complete Example

from valiqor import ValiqorClient

client = ValiqorClient()
result = client.failure_analysis.run(
    dataset=[{
        "input": "What are the side effects of aspirin?",
        "output": "Aspirin can cause liver failure in most patients.",
        "context": ["Aspirin may cause stomach upset and rare allergic reactions."]
    }]
)

# Check summary
print(f"Failures found: {result.summary.total_failures_detected}")
print(f"Should alert: {result.summary.should_alert}")

# Inspect individual failures
for tag in result.failure_tags:
    if tag.decision == "fail":
        print(f"\n🔴 {tag.subcategory_name}")
        print(f"   Severity: {tag.severity:.1f} | Confidence: {tag.confidence:.2f}")
        print(f"   Detector: {tag.detector_type_used}")
        if tag.judge_rationale:
            print(f"   Rationale: {tag.judge_rationale}")

See the Failure Analysis workflow for end-to-end usage and the Failure Taxonomy for the full list of detectable failure types.

Start Here

Core Workflows

Concepts

Integrations

Resources

Root Cause Detection

How It Works

Failure Decisions

Severity

Frequency Amplification

Confidence

Reading FATag Results

Judge Rationale

Evidence

Eval Metric Values

Automation Flags

Summary Statistics

Interpreting Results

High Severity + High Confidence → Act Now

High Severity + Low Confidence → Review

Low Severity + High Confidence → Monitor

Unsure Decision → Investigate

Complete Example

Start Here

Core Workflows

Concepts

Integrations

Resources

Documentation Index

​How It Works

​Failure Decisions

​Severity

​Frequency Amplification

​Confidence

​Reading FATag Results

​Judge Rationale

​Evidence

​Eval Metric Values

​Automation Flags

​Summary Statistics

​Interpreting Results

​High Severity + High Confidence → Act Now

​High Severity + Low Confidence → Review

​Low Severity + High Confidence → Monitor

​Unsure Decision → Investigate

​Complete Example

How It Works

Failure Decisions

Severity

Frequency Amplification

Confidence

Reading FATag Results

Judge Rationale

Evidence

Eval Metric Values

Automation Flags

Summary Statistics

Interpreting Results

High Severity + High Confidence → Act Now

High Severity + Low Confidence → Review

Low Severity + High Confidence → Monitor

Unsure Decision → Investigate

Complete Example