Evaluations

Evaluations run metric-based quality checks on your AI outputs. Unlike Failure Analysis which classifies root causes, evaluations give you numeric scores for specific quality dimensions.

Quick start

from valiqor import ValiqorClient

client = ValiqorClient(api_key="vq_...", project_name="my-app")

result = client.eval.evaluate(
    dataset=[
        {
            "input": "What is the capital of France?",
            "output": "The capital of France is Paris.",
            "expected": "Paris",
            "context": ["France is a country in Europe. Its capital is Paris."],
        }
    ],
    metrics=["factual_accuracy", "answer_relevance", "coherence"],
)

print(f"Overall score: {result.overall_score}")
print(f"Run ID: {result.run_id}")

Full `evaluate()` signature

result = client.eval.evaluate(
    dataset=[...],              # List of {input, output, context, expected} dicts
    metrics=["..."],            # Metric keys to compute (required)
    project_name=None,          # Overrides client-level project_name
    run_name=None,              # Optional label for this run
    metadata=None,              # Optional metadata dict
    openai_api_key=None,        # Your OpenAI key for LLM judges (BYOK)
)

Dataset item format

Field	Type	Required	Description
`input`	`str`	✅	User prompt / query
`output`	`str`	✅	Model’s response
`context`	`list[str]`	Optional	Retrieved documents (for RAG metrics)
`expected`	`str`	Optional	Expected / ground-truth output (for comparison metrics)

Available metrics

Heuristic metrics (fast, no LLM needed)

Metric	Description
`contains`	Checks if output contains expected string
`equals`	Exact match between output and expected
`levenshtein`	Edit distance similarity score
`regex_match`	Regex pattern match against output

LLM-based metrics (uses LLM judge)

Metric	Description
`hallucination`	Detects fabricated or unsupported claims
`answer_relevance`	How relevant the answer is to the input question
`context_precision`	How precisely the context was used (RAG)
`context_recall`	How completely the context was utilized (RAG)
`coherence`	Logical flow and consistency of the response
`fluency`	Grammar, readability, and naturalness
`factual_accuracy`	Correctness of facts stated in the output
`task_adherence`	How well the output follows the given instructions
`response_completeness`	Whether the response fully addresses the query

LLM-based metrics use OpenAI GPT-4o by default. You can use your own key via the openai_api_key parameter or VALIQOR_OPENAI_API_KEY env var. See BYOK.

Evaluate from a trace

If you have a captured trace, evaluate it directly:

import json

# Load a trace file
with open("valiqor_output/traces/trace.json") as f:
    trace = json.load(f)

result = client.eval.evaluate_trace(
    trace=trace,
    metrics=["hallucination", "answer_relevance", "context_recall"],
    run_name="trace-eval-v1",
)

evaluate_trace() takes a trace dict (the full JSON object), not a trace ID string. Load the trace data first using client.trace_query.get_full_trace() or from a local file.

Async evaluation

For large datasets (≥20 rows or ≥5 LLM metrics), use explicit async:

job = client.eval.evaluate_async(
    dataset=large_dataset,
    metrics=["hallucination", "answer_relevance", "coherence"],
    run_name="large-batch-v1",
)

# Wait with progress
result = job.wait(
    on_progress=lambda s: print(f"{s.progress_percent:.0f}%")
)

# Or poll manually
while job.is_running():
    status = job.status()
    print(f"{status.progress_percent:.0f}% ({status.current_item}/{status.total_items})")
    import time; time.sleep(2)

result = job.result()

Even with evaluate() (not async), the backend may decide to process asynchronously for large datasets. The SDK handles this transparently — it auto-polls until the result is ready.

Reading results

Overall score

print(f"Overall score: {result.overall_score}")
print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Total items: {result.total_items}")

Per-metric scores

metrics = client.eval.get_run_metrics(run_id=result.run_id)

for metric in metrics:
    print(f"{metric.key}: {metric.score:.3f} (weight: {metric.weight})")

Per-item details

items = client.eval.get_run_items(
    run_id=result.run_id,
    limit=10,
    offset=0,
)

for item in items:
    print(f"Item: {item}")

# Deep-dive into a single item
detail = client.eval.get_item_detail(
    run_id=result.run_id,
    item_id="item_xyz",
)

Trends and comparison

Trends over time

trends = client.eval.get_trends(
    project_name="my-app",
    metric="hallucination",
)

for point in trends:
    print(f"{point.date}: {point.score:.3f}")

Compare runs

comparison = client.eval.compare_runs(
    run_ids=["run_1", "run_2", "run_3"],  # 2-5 runs
)

for run in comparison:
    print(f"Run {run.run_id}: {run.overall_score:.3f}")

Project metrics management

# List available metric templates
templates = client.eval.list_metric_templates()
for t in templates:
    print(f"{t.key}: {t.display_name}")

# List metrics configured for your project
project_metrics = client.eval.list_project_metrics(project_name="my-app")

# Add a new metric to your project
client.eval.add_project_metric(
    metric_key="hallucination",
    display_name="Hallucination Score",
    project_name="my-app",
)

# Get project stats
stats = client.eval.get_project_stats(project_name="my-app")

CLI

# Run evaluation
valiqor eval run --dataset test_data.json --metrics hallucination,coherence \
    --project-name my-app --run-name "test-v1"

# Check status (for async runs)
valiqor eval status --run-id run_xyz

# Get results
valiqor eval result --run-id run_xyz --output results.json

# List past runs
valiqor eval list --project-name my-app --limit 10

# List available metrics
valiqor eval metrics

Evaluation Model →

How LLM judges score, thresholds, and metric details.

Failure Analysis →

Go beyond metrics — find root causes.

Start Here

Core Workflows

Concepts

Integrations

Resources

Quick start

Full `evaluate()` signature

Dataset item format

Available metrics

Heuristic metrics (fast, no LLM needed)

LLM-based metrics (uses LLM judge)

Evaluate from a trace

Async evaluation

Reading results

Overall score

Per-metric scores

Per-item details

Trends and comparison

Trends over time

Compare runs

Project metrics management

CLI

Evaluation Model →

Failure Analysis →

Start Here

Core Workflows

Concepts

Integrations

Resources

Documentation Index

​Quick start

​Full evaluate() signature

​Dataset item format

​Available metrics

​Heuristic metrics (fast, no LLM needed)

​LLM-based metrics (uses LLM judge)

​Evaluate from a trace

​Async evaluation

​Reading results

​Overall score

​Per-metric scores

​Per-item details

​Trends and comparison

​Trends over time

​Compare runs

​Project metrics management

​CLI

Evaluation Model →

Failure Analysis →

Quick start

Full `evaluate()` signature

Dataset item format

Available metrics

Heuristic metrics (fast, no LLM needed)

LLM-based metrics (uses LLM judge)

Evaluate from a trace

Async evaluation

Reading results

Overall score

Per-metric scores

Per-item details

Trends and comparison

Trends over time

Compare runs

Project metrics management

CLI