Valiqor’s Failure Analysis engine classifies every failure into a two-level taxonomy: top-level buckets (L1) group related failure types, and subcategories (L2) identify the specific failure mode. This taxonomy is the foundation that powers severity scoring, root cause detection, and actionable remediation suggestions.Documentation Index
Fetch the complete documentation index at: https://docs.valiqor.com/llms.txt
Use this file to discover all available pages before exploring further.
Taxonomy Hierarchy
Buckets (L1)
There are 4 failure buckets in the current taxonomy (v0):| Bucket ID | Name | Description |
|---|---|---|
instruction_compliance | Instruction & Task Compliance | Failures related to following instructions, constraints, and task requirements |
hallucination_grounding | Hallucination & Grounding | Failures related to factual accuracy, fabrication, and grounding |
retrieval_failures | Retrieval (RAG) Failures | Failures in retrieval-augmented generation pipelines |
tool_failures | Tool & Function Failures | Failures in tool selection, invocation, and output handling |
Subcategories (L2)
Each bucket contains specific failure subcategories. There are 15 subcategories in the current taxonomy:Instruction & Task Compliance
| ID | Name | Definition | Detection | Applies To |
|---|---|---|---|---|
task_not_followed | Task not followed | Model fails to follow explicit user or system instructions | LLM Judge | All |
partial_task_completion | Partial task completion | Some required steps or outputs are missing in the final response | LLM Judge | All |
wrong_intent_resolution | Wrong intent resolution | Model solves a different problem than what the user intended | LLM Judge | All |
output_format_non_compliance | Output format non-compliance | Response does not follow the required output schema or format | Deterministic | All |
Hallucination & Grounding
| ID | Name | Definition | Detection | Applies To |
|---|---|---|---|---|
unsupported_factual_claim | Unsupported factual claim | Model makes factual claims not supported by retrieved context or verified knowledge | LLM Judge | All |
fabricated_details | Fabricated details or entities | Model invents entities, APIs, facts, or details | Hybrid | All |
ungrounded_answer | Ungrounded answer vs context | Answer is not supported by retrieved documents or tool outputs | LLM Judge | RAG, Agent, Agentic RAG |
Retrieval (RAG) Failures
| ID | Name | Definition | Detection | Applies To |
|---|---|---|---|---|
missing_retrieval | No retrieval when needed | Model answers without retrieval when external knowledge is required | Hybrid | RAG, Agent, Agentic RAG |
low_relevance_retrieval | Low relevance retrieval | Retrieved documents are irrelevant to the user query | LLM Judge | RAG, Agent, Agentic RAG |
insufficient_coverage | Insufficient retrieval coverage | Retrieved context misses key facts required to answer the query | LLM Judge | RAG, Agent, Agentic RAG |
citation_failure | Citation / attribution failure | Answer does not cite or incorrectly cites retrieved sources | Hybrid | RAG, Agent, Agentic RAG |
Tool & Function Failures
| ID | Name | Definition | Detection | Applies To |
|---|---|---|---|---|
wrong_tool_selected | Wrong tool selected | Agent selects an inappropriate tool for the task | LLM Judge | Agent, Tool Agent, Agentic RAG |
tool_not_invoked | Tool not invoked when required | Agent fails to call a tool despite the task requiring it | Hybrid | Agent, Tool Agent, Agentic RAG |
invalid_tool_arguments | Invalid tool arguments | Tool is invoked with malformed or incorrect parameters | Deterministic | Agent, Tool Agent, Agentic RAG |
tool_output_misused | Tool output misused | Agent misinterprets or ignores the tool output in the final response | LLM Judge | Agent, Tool Agent, Agentic RAG |
Application Types
Theapplies_to field controls which subcategories are active based on your
application type. Set this via the feature_kind parameter in run():
| Type | Value | Description |
|---|---|---|
| All | all | Universal — applies to every app type |
| RAG | rag | Retrieval-augmented generation |
| Agent | agent | Autonomous agent with tool use |
| Agentic RAG | agentic_rag | Agent + retrieval |
| Tool Agent | tool_agent | Agent focused on tool orchestration |
| Multimodal | multimodal | Multi-modal applications |
feature_kind="rag", only subcategories that apply to RAG
apps (plus universal ones) are evaluated. This keeps results relevant and
reduces LLM judge calls.
Detection Approaches
Each subcategory uses one of three detection approaches:Deterministic
Rule-based checks that don’t require an LLM. Fast, cheap, and
highly reproducible. Used for format validation, tool argument
checking, and pattern matching.
LLM Judge
An LLM evaluates the input/output against the subcategory definition.
Used for semantic analysis like hallucination detection, intent
resolution, and relevance assessment.
Hybrid
Deterministic pre-filter followed by LLM judge confirmation.
Combines speed of rules with accuracy of LLM judgment. Used for
fabrication detection and citation verification.
Metric Correlations
Each subcategory is linked to evaluation metrics that provide supporting evidence. For example,unsupported_factual_claim correlates with metrics
like hallucination and factual_accuracy.
When Failure Analysis detects a failure, it cross-references evaluation
metric scores to strengthen or weaken its confidence in the classification.
This means running evaluations alongside Failure Analysis produces
higher-quality results.
Taxonomy Versioning
The taxonomy is versioned and frozen to ensure reproducibility:- v0 — 4 buckets, 15 subcategories (current)
- New subcategories can be added in future versions without breaking existing classifications
- Each
FATagresult includes the taxonomy version used
The taxonomy is designed to be extensible. Future versions may add new
buckets (e.g. multi-modal failures) or subcategories without changing
existing classifications.