Failure Taxonomy

Valiqor’s Failure Analysis engine classifies every failure into a two-level taxonomy: top-level buckets (L1) group related failure types, and subcategories (L2) identify the specific failure mode. This taxonomy is the foundation that powers severity scoring, root cause detection, and actionable remediation suggestions.

Taxonomy Hierarchy

Failure Taxonomy
├── Bucket (L1)                 # High-level failure category
│   ├── Subcategory (L2)        # Specific failure mode
│   │   ├── detection_approach  # How this failure is detected
│   │   ├── primary_metrics     # Eval metrics that evidence this failure
│   │   └── applies_to          # Which app types this applies to
│   ├── Subcategory (L2)
│   └── ...
├── Bucket (L1)
│   └── ...

Buckets (L1)

There are 4 failure buckets in the current taxonomy (v0):

Bucket ID	Name	Description
`instruction_compliance`	Instruction & Task Compliance	Failures related to following instructions, constraints, and task requirements
`hallucination_grounding`	Hallucination & Grounding	Failures related to factual accuracy, fabrication, and grounding
`retrieval_failures`	Retrieval (RAG) Failures	Failures in retrieval-augmented generation pipelines
`tool_failures`	Tool & Function Failures	Failures in tool selection, invocation, and output handling

Subcategories (L2)

Each bucket contains specific failure subcategories. There are 15 subcategories in the current taxonomy:

Instruction & Task Compliance

ID	Name	Definition	Detection	Applies To
`task_not_followed`	Task not followed	Model fails to follow explicit user or system instructions	LLM Judge	All
`partial_task_completion`	Partial task completion	Some required steps or outputs are missing in the final response	LLM Judge	All
`wrong_intent_resolution`	Wrong intent resolution	Model solves a different problem than what the user intended	LLM Judge	All
`output_format_non_compliance`	Output format non-compliance	Response does not follow the required output schema or format	Deterministic	All

Hallucination & Grounding

ID	Name	Definition	Detection	Applies To
`unsupported_factual_claim`	Unsupported factual claim	Model makes factual claims not supported by retrieved context or verified knowledge	LLM Judge	All
`fabricated_details`	Fabricated details or entities	Model invents entities, APIs, facts, or details	Hybrid	All
`ungrounded_answer`	Ungrounded answer vs context	Answer is not supported by retrieved documents or tool outputs	LLM Judge	RAG, Agent, Agentic RAG

Retrieval (RAG) Failures

ID	Name	Definition	Detection	Applies To
`missing_retrieval`	No retrieval when needed	Model answers without retrieval when external knowledge is required	Hybrid	RAG, Agent, Agentic RAG
`low_relevance_retrieval`	Low relevance retrieval	Retrieved documents are irrelevant to the user query	LLM Judge	RAG, Agent, Agentic RAG
`insufficient_coverage`	Insufficient retrieval coverage	Retrieved context misses key facts required to answer the query	LLM Judge	RAG, Agent, Agentic RAG
`citation_failure`	Citation / attribution failure	Answer does not cite or incorrectly cites retrieved sources	Hybrid	RAG, Agent, Agentic RAG

Tool & Function Failures

ID	Name	Definition	Detection	Applies To
`wrong_tool_selected`	Wrong tool selected	Agent selects an inappropriate tool for the task	LLM Judge	Agent, Tool Agent, Agentic RAG
`tool_not_invoked`	Tool not invoked when required	Agent fails to call a tool despite the task requiring it	Hybrid	Agent, Tool Agent, Agentic RAG
`invalid_tool_arguments`	Invalid tool arguments	Tool is invoked with malformed or incorrect parameters	Deterministic	Agent, Tool Agent, Agentic RAG
`tool_output_misused`	Tool output misused	Agent misinterprets or ignores the tool output in the final response	LLM Judge	Agent, Tool Agent, Agentic RAG

Application Types

The applies_to field controls which subcategories are active based on your application type. Set this via the feature_kind parameter in run():

Type	Value	Description
All	`all`	Universal — applies to every app type
RAG	`rag`	Retrieval-augmented generation
Agent	`agent`	Autonomous agent with tool use
Agentic RAG	`agentic_rag`	Agent + retrieval
Tool Agent	`tool_agent`	Agent focused on tool orchestration
Multimodal	`multimodal`	Multi-modal applications

When you specify feature_kind="rag", only subcategories that apply to RAG apps (plus universal ones) are evaluated. This keeps results relevant and reduces LLM judge calls.

Detection Approaches

Each subcategory uses one of three detection approaches:

Deterministic

Rule-based checks that don’t require an LLM. Fast, cheap, and highly reproducible. Used for format validation, tool argument checking, and pattern matching.

LLM Judge

An LLM evaluates the input/output against the subcategory definition. Used for semantic analysis like hallucination detection, intent resolution, and relevance assessment.

Hybrid

Deterministic pre-filter followed by LLM judge confirmation. Combines speed of rules with accuracy of LLM judgment. Used for fabrication detection and citation verification.

Metric Correlations

Each subcategory is linked to evaluation metrics that provide supporting evidence. For example, unsupported_factual_claim correlates with metrics like hallucination and factual_accuracy. When Failure Analysis detects a failure, it cross-references evaluation metric scores to strengthen or weaken its confidence in the classification. This means running evaluations alongside Failure Analysis produces higher-quality results.

Taxonomy Versioning

The taxonomy is versioned and frozen to ensure reproducibility:

v0 — 4 buckets, 15 subcategories (current)
New subcategories can be added in future versions without breaking existing classifications
Each FATag result includes the taxonomy version used

The taxonomy is designed to be extensible. Future versions may add new buckets (e.g. multi-modal failures) or subcategories without changing existing classifications.

SDK Access

You can browse the taxonomy programmatically:

from valiqor import ValiqorClient

client = ValiqorClient()
fa = client.failure_analysis

# Get full taxonomy
taxonomy = fa.get_taxonomy()

# Get subcategories for a specific bucket
subs = fa.get_subcategories(bucket_id="hallucination_grounding")

# Get details for a specific bucket
details = fa.get_bucket_details(bucket_id="tool_failures")

See the Failure Analysis workflow for complete usage examples.

Start Here

Core Workflows

Concepts

Integrations

Resources

Failure Taxonomy

Taxonomy Hierarchy

Buckets (L1)

Subcategories (L2)

Instruction & Task Compliance

Hallucination & Grounding

Retrieval (RAG) Failures

Tool & Function Failures

Application Types

Detection Approaches

Deterministic

LLM Judge

Hybrid

Metric Correlations

Taxonomy Versioning

SDK Access

Start Here

Core Workflows

Concepts

Integrations

Resources

Documentation Index

​Taxonomy Hierarchy

​Buckets (L1)

​Subcategories (L2)

​Instruction & Task Compliance

​Hallucination & Grounding

​Retrieval (RAG) Failures

​Tool & Function Failures

​Application Types

​Detection Approaches

Deterministic

LLM Judge

Hybrid

​Metric Correlations

​Taxonomy Versioning

​SDK Access

Taxonomy Hierarchy

Buckets (L1)

Subcategories (L2)

Instruction & Task Compliance

Hallucination & Grounding

Retrieval (RAG) Failures

Tool & Function Failures

Application Types

Detection Approaches

Metric Correlations

Taxonomy Versioning

SDK Access