Skip to main content
Valiqor’s Failure Analysis engine classifies every failure into a two-level taxonomy: top-level buckets (L1) group related failure types, and subcategories (L2) identify the specific failure mode. This taxonomy is the foundation that powers severity scoring, root cause detection, and actionable remediation suggestions.

Taxonomy Hierarchy

Failure Taxonomy
├── Bucket (L1)                 # High-level failure category
│   ├── Subcategory (L2)        # Specific failure mode
│   │   ├── detection_approach  # How this failure is detected
│   │   ├── primary_metrics     # Eval metrics that evidence this failure
│   │   └── applies_to          # Which app types this applies to
│   ├── Subcategory (L2)
│   └── ...
├── Bucket (L1)
│   └── ...

Buckets (L1)

There are 4 failure buckets in the current taxonomy (v0):
Bucket IDNameDescription
instruction_complianceInstruction & Task ComplianceFailures related to following instructions, constraints, and task requirements
hallucination_groundingHallucination & GroundingFailures related to factual accuracy, fabrication, and grounding
retrieval_failuresRetrieval (RAG) FailuresFailures in retrieval-augmented generation pipelines
tool_failuresTool & Function FailuresFailures in tool selection, invocation, and output handling

Subcategories (L2)

Each bucket contains specific failure subcategories. There are 15 subcategories in the current taxonomy:

Instruction & Task Compliance

IDNameDefinitionDetectionApplies To
task_not_followedTask not followedModel fails to follow explicit user or system instructionsLLM JudgeAll
partial_task_completionPartial task completionSome required steps or outputs are missing in the final responseLLM JudgeAll
wrong_intent_resolutionWrong intent resolutionModel solves a different problem than what the user intendedLLM JudgeAll
output_format_non_complianceOutput format non-complianceResponse does not follow the required output schema or formatDeterministicAll

Hallucination & Grounding

IDNameDefinitionDetectionApplies To
unsupported_factual_claimUnsupported factual claimModel makes factual claims not supported by retrieved context or verified knowledgeLLM JudgeAll
fabricated_detailsFabricated details or entitiesModel invents entities, APIs, facts, or detailsHybridAll
ungrounded_answerUngrounded answer vs contextAnswer is not supported by retrieved documents or tool outputsLLM JudgeRAG, Agent, Agentic RAG

Retrieval (RAG) Failures

IDNameDefinitionDetectionApplies To
missing_retrievalNo retrieval when neededModel answers without retrieval when external knowledge is requiredHybridRAG, Agent, Agentic RAG
low_relevance_retrievalLow relevance retrievalRetrieved documents are irrelevant to the user queryLLM JudgeRAG, Agent, Agentic RAG
insufficient_coverageInsufficient retrieval coverageRetrieved context misses key facts required to answer the queryLLM JudgeRAG, Agent, Agentic RAG
citation_failureCitation / attribution failureAnswer does not cite or incorrectly cites retrieved sourcesHybridRAG, Agent, Agentic RAG

Tool & Function Failures

IDNameDefinitionDetectionApplies To
wrong_tool_selectedWrong tool selectedAgent selects an inappropriate tool for the taskLLM JudgeAgent, Tool Agent, Agentic RAG
tool_not_invokedTool not invoked when requiredAgent fails to call a tool despite the task requiring itHybridAgent, Tool Agent, Agentic RAG
invalid_tool_argumentsInvalid tool argumentsTool is invoked with malformed or incorrect parametersDeterministicAgent, Tool Agent, Agentic RAG
tool_output_misusedTool output misusedAgent misinterprets or ignores the tool output in the final responseLLM JudgeAgent, Tool Agent, Agentic RAG

Application Types

The applies_to field controls which subcategories are active based on your application type. Set this via the feature_kind parameter in run():
TypeValueDescription
AllallUniversal — applies to every app type
RAGragRetrieval-augmented generation
AgentagentAutonomous agent with tool use
Agentic RAGagentic_ragAgent + retrieval
Tool Agenttool_agentAgent focused on tool orchestration
MultimodalmultimodalMulti-modal applications
When you specify feature_kind="rag", only subcategories that apply to RAG apps (plus universal ones) are evaluated. This keeps results relevant and reduces LLM judge calls.

Detection Approaches

Each subcategory uses one of three detection approaches:

Deterministic

Rule-based checks that don’t require an LLM. Fast, cheap, and highly reproducible. Used for format validation, tool argument checking, and pattern matching.

LLM Judge

An LLM evaluates the input/output against the subcategory definition. Used for semantic analysis like hallucination detection, intent resolution, and relevance assessment.

Hybrid

Deterministic pre-filter followed by LLM judge confirmation. Combines speed of rules with accuracy of LLM judgment. Used for fabrication detection and citation verification.

Metric Correlations

Each subcategory is linked to evaluation metrics that provide supporting evidence. For example, unsupported_factual_claim correlates with metrics like hallucination and factual_accuracy. When Failure Analysis detects a failure, it cross-references evaluation metric scores to strengthen or weaken its confidence in the classification. This means running evaluations alongside Failure Analysis produces higher-quality results.

Taxonomy Versioning

The taxonomy is versioned and frozen to ensure reproducibility:
  • v0 — 4 buckets, 15 subcategories (current)
  • New subcategories can be added in future versions without breaking existing classifications
  • Each FATag result includes the taxonomy version used
The taxonomy is designed to be extensible. Future versions may add new buckets (e.g. multi-modal failures) or subcategories without changing existing classifications.

SDK Access

You can browse the taxonomy programmatically:
from valiqor import ValiqorClient

client = ValiqorClient()
fa = client.failure_analysis

# Get full taxonomy
taxonomy = fa.get_taxonomy()

# Get subcategories for a specific bucket
subs = fa.get_subcategories(bucket_id="hallucination_grounding")

# Get details for a specific bucket
details = fa.get_bucket_details(bucket_id="tool_failures")
See the Failure Analysis workflow for complete usage examples.