Valiqor finds why your AI app fails
Most evaluation tools tell you something is wrong. Valiqor tells you what went wrong, why it happened, and how to fix it — with severity scoring, root-cause analysis, and actionable remediation.You can get started in 5 minutes — no tracing or instrumentation needed. Just pass your existing AI inputs and outputs to Failure Analysis and see results immediately.
Five modules, one SDK
Valiqor’s Python SDK exposes five modules through a single client:| Module | What it does | When to use it |
|---|---|---|
| Failure Analysis | Classifies failures into buckets & subcategories, scores severity (0–5), explains root cause | First thing to run — works on any input/output data |
| Evaluation | Runs metric-based checks (hallucination, relevance, coherence, etc.) | When you need granular metric scores |
| Security | Runs red-team attacks, checks for prompt injection, data leakage, jailbreaks | Before deploying to production |
| Tracing | Auto-captures LLM calls, tool use, retrieval steps as structured traces | When you want continuous monitoring in production |
| Scanner | Scans repositories and prompts for best-practice violations | During code review or CI |
Two ways to get started
Quick start — Run on existing data
Pass your AI inputs/outputs directly. No tracing, no instrumentation. See failure results in under 5 minutes.
Full observability — Add tracing
Instrument your LLM calls with auto-tracing. Run Failure Analysis on traces for continuous production monitoring.
Who is Valiqor for?
Valiqor is built for AI engineers and teams who build applications on top of:- LLM providers — OpenAI, Anthropic, Google, Mistral, Cohere, and any OpenAI-compatible API
- Orchestration frameworks — LangChain, CrewAI, LlamaIndex, Haystack
- Custom pipelines — RAG systems, agent loops, multi-step chains
What you get
Failure taxonomy
Every failure is classified into buckets (e.g., Hallucination, Context Ignorance) and subcategories (e.g., Entity Fabrication, Contradicts Source).
Root-cause analysis
Each failure includes a
judge_rationale explaining exactly what went wrong, plus a scoring_breakdown with evidence.Severity & confidence scoring
Severity from 0 (no issue) to 5 (critical). Confidence from 0.0 to 1.0. Filter and prioritize failures automatically.
Ready? See your first failure →
Run Failure Analysis on your existing data in under 5 minutes.