Use this file to discover all available pages before exploring further.
Evaluations run metric-based quality checks on your AI outputs. Unlike Failure Analysis which classifies root causes, evaluations give you numeric scores for specific quality dimensions.
from valiqor import ValiqorClientclient = ValiqorClient(api_key="vq_...", project_name="my-app")result = client.eval.evaluate( dataset=[ { "input": "What is the capital of France?", "output": "The capital of France is Paris.", "expected": "Paris", "context": ["France is a country in Europe. Its capital is Paris."], } ], metrics=["factual_accuracy", "answer_relevance", "coherence"],)print(f"Overall score: {result.overall_score}")print(f"Run ID: {result.run_id}")
result = client.eval.evaluate( dataset=[...], # List of {input, output, context, expected} dicts metrics=["..."], # Metric keys to compute (required) project_name=None, # Overrides client-level project_name run_name=None, # Optional label for this run metadata=None, # Optional metadata dict openai_api_key=None, # Your OpenAI key for LLM judges (BYOK))
If you have a captured trace, evaluate it directly:
import json# Load a trace filewith open("valiqor_output/traces/trace.json") as f: trace = json.load(f)result = client.eval.evaluate_trace( trace=trace, metrics=["hallucination", "answer_relevance", "context_recall"], run_name="trace-eval-v1",)
evaluate_trace() takes a trace dict (the full JSON object), not a trace ID string. Load the trace data first using client.trace_query.get_full_trace() or from a local file.
For large datasets (≥20 rows or ≥5 LLM metrics), use explicit async:
job = client.eval.evaluate_async( dataset=large_dataset, metrics=["hallucination", "answer_relevance", "coherence"], run_name="large-batch-v1",)# Wait with progressresult = job.wait( on_progress=lambda s: print(f"{s.progress_percent:.0f}%"))# Or poll manuallywhile job.is_running(): status = job.status() print(f"{status.progress_percent:.0f}% ({status.current_item}/{status.total_items})") import time; time.sleep(2)result = job.result()
Even with evaluate() (not async), the backend may decide to process asynchronously for large datasets. The SDK handles this transparently — it auto-polls until the result is ready.
items = client.eval.get_run_items( run_id=result.run_id, limit=10, offset=0,)for item in items: print(f"Item: {item}")# Deep-dive into a single itemdetail = client.eval.get_item_detail( run_id=result.run_id, item_id="item_xyz",)
# List available metric templatestemplates = client.eval.list_metric_templates()for t in templates: print(f"{t.key}: {t.display_name}")# List metrics configured for your projectproject_metrics = client.eval.list_project_metrics(project_name="my-app")# Add a new metric to your projectclient.eval.add_project_metric( metric_key="hallucination", display_name="Hallucination Score", project_name="my-app",)# Get project statsstats = client.eval.get_project_stats(project_name="my-app")
# Run evaluationvaliqor eval run --dataset test_data.json --metrics hallucination,coherence \ --project-name my-app --run-name "test-v1"# Check status (for async runs)valiqor eval status --run-id run_xyz# Get resultsvaliqor eval result --run-id run_xyz --output results.json# List past runsvaliqor eval list --project-name my-app --limit 10# List available metricsvaliqor eval metrics
Evaluation Model →
How LLM judges score, thresholds, and metric details.