Evaluations run metric-based quality checks on your AI outputs. Unlike Failure Analysis which classifies root causes, evaluations give you numeric scores for specific quality dimensions.
Quick start
from valiqor import ValiqorClient
client = ValiqorClient(api_key="vq_...", project_name="my-app")
result = client.eval.evaluate(
dataset=[
{
"input": "What is the capital of France?",
"output": "The capital of France is Paris.",
"expected": "Paris",
"context": ["France is a country in Europe. Its capital is Paris."],
}
],
metrics=["factual_accuracy", "answer_relevance", "coherence"],
)
print(f"Overall score: {result.overall_score}")
print(f"Run ID: {result.run_id}")
Full evaluate() signature
result = client.eval.evaluate(
dataset=[...], # List of {input, output, context, expected} dicts
metrics=["..."], # Metric keys to compute (required)
project_name=None, # Overrides client-level project_name
run_name=None, # Optional label for this run
metadata=None, # Optional metadata dict
openai_api_key=None, # Your OpenAI key for LLM judges (BYOK)
)
| Field | Type | Required | Description |
|---|
input | str | ✅ | User prompt / query |
output | str | ✅ | Model’s response |
context | list[str] | Optional | Retrieved documents (for RAG metrics) |
expected | str | Optional | Expected / ground-truth output (for comparison metrics) |
Available metrics
Heuristic metrics (fast, no LLM needed)
| Metric | Description |
|---|
contains | Checks if output contains expected string |
equals | Exact match between output and expected |
levenshtein | Edit distance similarity score |
regex_match | Regex pattern match against output |
LLM-based metrics (uses LLM judge)
| Metric | Description |
|---|
hallucination | Detects fabricated or unsupported claims |
answer_relevance | How relevant the answer is to the input question |
context_precision | How precisely the context was used (RAG) |
context_recall | How completely the context was utilized (RAG) |
coherence | Logical flow and consistency of the response |
fluency | Grammar, readability, and naturalness |
factual_accuracy | Correctness of facts stated in the output |
task_adherence | How well the output follows the given instructions |
response_completeness | Whether the response fully addresses the query |
LLM-based metrics use OpenAI GPT-4o by default. You can use your own key via the openai_api_key parameter or VALIQOR_OPENAI_API_KEY env var. See BYOK.
Evaluate from a trace
If you have a captured trace, evaluate it directly:
import json
# Load a trace file
with open("valiqor_output/traces/trace.json") as f:
trace = json.load(f)
result = client.eval.evaluate_trace(
trace=trace,
metrics=["hallucination", "answer_relevance", "context_recall"],
run_name="trace-eval-v1",
)
evaluate_trace() takes a trace dict (the full JSON object), not a trace ID string. Load the trace data first using client.trace_query.get_full_trace() or from a local file.
Async evaluation
For large datasets (≥20 rows or ≥5 LLM metrics), use explicit async:
job = client.eval.evaluate_async(
dataset=large_dataset,
metrics=["hallucination", "answer_relevance", "coherence"],
run_name="large-batch-v1",
)
# Wait with progress
result = job.wait(
on_progress=lambda s: print(f"{s.progress_percent:.0f}%")
)
# Or poll manually
while job.is_running():
status = job.status()
print(f"{status.progress_percent:.0f}% ({status.current_item}/{status.total_items})")
import time; time.sleep(2)
result = job.result()
Even with evaluate() (not async), the backend may decide to process asynchronously for large datasets. The SDK handles this transparently — it auto-polls until the result is ready.
Reading results
Overall score
print(f"Overall score: {result.overall_score}")
print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Total items: {result.total_items}")
Per-metric scores
metrics = client.eval.get_run_metrics(run_id=result.run_id)
for metric in metrics:
print(f"{metric.key}: {metric.score:.3f} (weight: {metric.weight})")
Per-item details
items = client.eval.get_run_items(
run_id=result.run_id,
limit=10,
offset=0,
)
for item in items:
print(f"Item: {item}")
# Deep-dive into a single item
detail = client.eval.get_item_detail(
run_id=result.run_id,
item_id="item_xyz",
)
Trends and comparison
Trends over time
trends = client.eval.get_trends(
project_name="my-app",
metric="hallucination",
)
for point in trends:
print(f"{point.date}: {point.score:.3f}")
Compare runs
comparison = client.eval.compare_runs(
run_ids=["run_1", "run_2", "run_3"], # 2-5 runs
)
for run in comparison:
print(f"Run {run.run_id}: {run.overall_score:.3f}")
Project metrics management
# List available metric templates
templates = client.eval.list_metric_templates()
for t in templates:
print(f"{t.key}: {t.display_name}")
# List metrics configured for your project
project_metrics = client.eval.list_project_metrics(project_name="my-app")
# Add a new metric to your project
client.eval.add_project_metric(
metric_key="hallucination",
display_name="Hallucination Score",
project_name="my-app",
)
# Get project stats
stats = client.eval.get_project_stats(project_name="my-app")
CLI
# Run evaluation
valiqor eval run --dataset test_data.json --metrics hallucination,coherence \
--project-name my-app --run-name "test-v1"
# Check status (for async runs)
valiqor eval status --run-id run_xyz
# Get results
valiqor eval result --run-id run_xyz --output results.json
# List past runs
valiqor eval list --project-name my-app --limit 10
# List available metrics
valiqor eval metrics