Skip to main content
Evaluations run metric-based quality checks on your AI outputs. Unlike Failure Analysis which classifies root causes, evaluations give you numeric scores for specific quality dimensions.

Quick start

from valiqor import ValiqorClient

client = ValiqorClient(api_key="vq_...", project_name="my-app")

result = client.eval.evaluate(
    dataset=[
        {
            "input": "What is the capital of France?",
            "output": "The capital of France is Paris.",
            "expected": "Paris",
            "context": ["France is a country in Europe. Its capital is Paris."],
        }
    ],
    metrics=["factual_accuracy", "answer_relevance", "coherence"],
)

print(f"Overall score: {result.overall_score}")
print(f"Run ID: {result.run_id}")

Full evaluate() signature

result = client.eval.evaluate(
    dataset=[...],              # List of {input, output, context, expected} dicts
    metrics=["..."],            # Metric keys to compute (required)
    project_name=None,          # Overrides client-level project_name
    run_name=None,              # Optional label for this run
    metadata=None,              # Optional metadata dict
    openai_api_key=None,        # Your OpenAI key for LLM judges (BYOK)
)

Dataset item format

FieldTypeRequiredDescription
inputstrUser prompt / query
outputstrModel’s response
contextlist[str]OptionalRetrieved documents (for RAG metrics)
expectedstrOptionalExpected / ground-truth output (for comparison metrics)

Available metrics

Heuristic metrics (fast, no LLM needed)

MetricDescription
containsChecks if output contains expected string
equalsExact match between output and expected
levenshteinEdit distance similarity score
regex_matchRegex pattern match against output

LLM-based metrics (uses LLM judge)

MetricDescription
hallucinationDetects fabricated or unsupported claims
answer_relevanceHow relevant the answer is to the input question
context_precisionHow precisely the context was used (RAG)
context_recallHow completely the context was utilized (RAG)
coherenceLogical flow and consistency of the response
fluencyGrammar, readability, and naturalness
factual_accuracyCorrectness of facts stated in the output
task_adherenceHow well the output follows the given instructions
response_completenessWhether the response fully addresses the query
LLM-based metrics use OpenAI GPT-4o by default. You can use your own key via the openai_api_key parameter or VALIQOR_OPENAI_API_KEY env var. See BYOK.

Evaluate from a trace

If you have a captured trace, evaluate it directly:
import json

# Load a trace file
with open("valiqor_output/traces/trace.json") as f:
    trace = json.load(f)

result = client.eval.evaluate_trace(
    trace=trace,
    metrics=["hallucination", "answer_relevance", "context_recall"],
    run_name="trace-eval-v1",
)
evaluate_trace() takes a trace dict (the full JSON object), not a trace ID string. Load the trace data first using client.trace_query.get_full_trace() or from a local file.

Async evaluation

For large datasets (≥20 rows or ≥5 LLM metrics), use explicit async:
job = client.eval.evaluate_async(
    dataset=large_dataset,
    metrics=["hallucination", "answer_relevance", "coherence"],
    run_name="large-batch-v1",
)

# Wait with progress
result = job.wait(
    on_progress=lambda s: print(f"{s.progress_percent:.0f}%")
)

# Or poll manually
while job.is_running():
    status = job.status()
    print(f"{status.progress_percent:.0f}% ({status.current_item}/{status.total_items})")
    import time; time.sleep(2)

result = job.result()
Even with evaluate() (not async), the backend may decide to process asynchronously for large datasets. The SDK handles this transparently — it auto-polls until the result is ready.

Reading results

Overall score

print(f"Overall score: {result.overall_score}")
print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Total items: {result.total_items}")

Per-metric scores

metrics = client.eval.get_run_metrics(run_id=result.run_id)

for metric in metrics:
    print(f"{metric.key}: {metric.score:.3f} (weight: {metric.weight})")

Per-item details

items = client.eval.get_run_items(
    run_id=result.run_id,
    limit=10,
    offset=0,
)

for item in items:
    print(f"Item: {item}")

# Deep-dive into a single item
detail = client.eval.get_item_detail(
    run_id=result.run_id,
    item_id="item_xyz",
)

trends = client.eval.get_trends(
    project_name="my-app",
    metric="hallucination",
)

for point in trends:
    print(f"{point.date}: {point.score:.3f}")

Compare runs

comparison = client.eval.compare_runs(
    run_ids=["run_1", "run_2", "run_3"],  # 2-5 runs
)

for run in comparison:
    print(f"Run {run.run_id}: {run.overall_score:.3f}")

Project metrics management

# List available metric templates
templates = client.eval.list_metric_templates()
for t in templates:
    print(f"{t.key}: {t.display_name}")

# List metrics configured for your project
project_metrics = client.eval.list_project_metrics(project_name="my-app")

# Add a new metric to your project
client.eval.add_project_metric(
    metric_key="hallucination",
    display_name="Hallucination Score",
    project_name="my-app",
)

# Get project stats
stats = client.eval.get_project_stats(project_name="my-app")

CLI

# Run evaluation
valiqor eval run --dataset test_data.json --metrics hallucination,coherence \
    --project-name my-app --run-name "test-v1"

# Check status (for async runs)
valiqor eval status --run-id run_xyz

# Get results
valiqor eval result --run-id run_xyz --output results.json

# List past runs
valiqor eval list --project-name my-app --limit 10

# List available metrics
valiqor eval metrics