Evaluation Model

Valiqor’s evaluation engine measures AI output quality using a combination of heuristic metrics (fast, deterministic) and LLM judge metrics (semantic, context-aware). This page explains what each metric measures, how scoring works, and how quality grades are assigned.

Metrics Overview

Valiqor ships with 17 built-in metrics across two categories:

4 Heuristic Metrics

Fast, deterministic checks that don’t require an LLM. Ideal for format validation, exact matching, and string comparison.

13 LLM Judge Metrics

Semantic evaluation using an LLM as a judge. Used for quality, relevance, hallucination detection, and task adherence.

Heuristic Metrics

These metrics run locally without any LLM calls. They return a score between 0.0 and 1.0:

Metric	Description	Scoring
`contains`	Checks if the output contains expected substrings	1.0 if all expected strings found, 0.0 otherwise
`equals`	Exact string match between output and expected	1.0 if exact match, 0.0 otherwise
`regex_match`	Tests output against a regular expression pattern	1.0 if pattern matches, 0.0 otherwise
`levenshtein`	Edit distance between output and expected text	Normalized 0.0–1.0 (1.0 = identical)

Usage example:

result = client.eval.evaluate(
    dataset=[{
        "input": "What is 2+2?",
        "output": "The answer is 4",
        "expected": "4"
    }],
    metrics=["contains", "levenshtein"]
)

LLM Judge Metrics

These metrics use an LLM to evaluate output quality semantically. Each metric sends a structured prompt to the judge model and receives a normalized score (0.0–1.0) with an explanation.

Quality & Relevance

Metric	What It Measures
`answer_relevance`	How relevant the response is to the user’s question
`coherence`	Logical flow and consistency of the response
`fluency`	Language quality — grammar, readability, naturalness
`response_completeness`	Whether all parts of the question are addressed
`task_adherence`	How well the output follows specific instructions or constraints

Factual Accuracy

Metric	What It Measures
`hallucination`	Whether the output contains fabricated or unsupported claims
`factual_accuracy`	Whether statements are factually correct

RAG-Specific

Metric	What It Measures
`context_precision`	How much of the retrieved context is relevant to the query
`context_recall`	Whether the retrieved context covers all information needed

Specialized

Metric	What It Measures
`intent_resolution`	Whether the correct user intent was identified and addressed
`tool_call_accuracy`	Whether the right tools were called with correct arguments
`moderation`	Content safety — flags harmful, toxic, or inappropriate content
`retrieval`	Quality of the retrieval step (if applicable)

How LLM Judging Works

Each LLM judge metric follows this flow:

Input Data (prompt, response, context)
    ↓
Structured Evaluation Prompt
    ↓
LLM Judge (default: gpt-4o)
    ↓
Structured Output (score + explanation)
    ↓
Normalized Score (0.0 – 1.0)

The metric receives the dataset item (input, output, context, expected)
A structured prompt is constructed for the specific metric
The prompt is sent to the judge LLM
The LLM returns a structured response with a score and rationale
The score is normalized to the 0.0–1.0 range

The judge model defaults to gpt-4o but can be configured. You can use your own OpenAI API key via the openai_api_key parameter — see BYOK for details.

Score Results

Each metric evaluation produces a ScoreResult:

Field	Type	Description
`value`	`float`	Score (typically 0.0–1.0)
`name`	`str`	Metric name
`reason`	`str`	LLM explanation for the score
`metadata`	`dict`	Additional data (confidence, model used, etc.)
`scoring_failed`	`bool`	Whether computation failed
`execution_time_ms`	`float`	Time taken to compute

Aggregation & Quality Grades

When evaluating a dataset with multiple items and metrics, Valiqor computes aggregate scores:

Aggregation

Per-metric average: Mean of all item scores for each metric
Overall quality: Mean of all per-metric averages

Quality Grades

The overall quality score is mapped to a letter grade:

Grade	Score Range	Interpretation
A	≥ 0.9	Excellent — system performing well
B	≥ 0.8	Good — minor improvements possible
C	≥ 0.7	Acceptable — meets minimum threshold
D	≥ 0.6	Below threshold — review recommended
F	< 0.6	Poor — significant issues detected

The default quality threshold is 0.7 (Grade C). Scores below this trigger insights and recommendations in the evaluation result.

Metric Configuration

You can customize which metrics to run and configure them per-project:

# Run specific metrics only
result = client.eval.evaluate(
    dataset=data,
    metrics=["hallucination", "answer_relevance", "task_adherence"]
)

# List available metric templates
templates = client.eval.list_metric_templates()

# Configure per-project metrics
client.eval.add_project_metric(
    project_name="my-chatbot",
    metric_key="hallucination"
)

Quality Threshold

The default quality threshold is 0.7 (Grade C). Scores below this trigger insights and recommendations in the evaluation result. You can adjust this threshold per-project.

Dataset Format

Evaluation datasets are lists of dictionaries. Required and optional fields depend on the metrics being used:

dataset = [
    {
        "input": "What is the capital of France?",    # Required: user prompt
        "output": "The capital of France is Paris.",   # Required: AI response
        "context": ["France is a country in Europe. Its capital is Paris."],  # Optional: for RAG metrics
        "expected": "Paris"                            # Optional: for heuristic metrics
    }
]

Field	Required	Used By
`input`	Yes	All metrics
`output`	Yes	All metrics
`context`	For RAG metrics	`context_precision`, `context_recall`, `hallucination`
`expected`	For heuristic metrics	`contains`, `equals`, `regex_match`, `levenshtein`

Where Metrics Are Used

Evaluation metrics appear throughout Valiqor:

Standalone evaluations — evaluate() runs metrics on a dataset
Trace evaluations — evaluate_trace() extracts data from a trace and evaluates
Failure Analysis — FA uses metric scores as evidence for failure classification
Trends — get_trends() tracks metric scores over time
Comparisons — compare_runs() compares metrics across evaluation runs

See the Evaluations workflow for complete usage examples.

Start Here

Core Workflows

Concepts

Integrations

Resources

Evaluation Model

Metrics Overview

4 Heuristic Metrics

13 LLM Judge Metrics

Heuristic Metrics

LLM Judge Metrics

Quality & Relevance

Factual Accuracy

RAG-Specific

Specialized

How LLM Judging Works

Score Results

Aggregation & Quality Grades

Aggregation

Quality Grades

Metric Configuration

Quality Threshold

Dataset Format

Where Metrics Are Used

Start Here

Core Workflows

Concepts

Integrations

Resources

Documentation Index

​Metrics Overview

4 Heuristic Metrics

13 LLM Judge Metrics

​Heuristic Metrics

​LLM Judge Metrics

​Quality & Relevance

​Factual Accuracy

​RAG-Specific

​Specialized

​How LLM Judging Works

​Score Results

​Aggregation & Quality Grades

​Aggregation

​Quality Grades

​Metric Configuration

​Quality Threshold

​Dataset Format

​Where Metrics Are Used

Metrics Overview

Heuristic Metrics

LLM Judge Metrics

Quality & Relevance

Factual Accuracy

RAG-Specific

Specialized

How LLM Judging Works

Score Results

Aggregation & Quality Grades

Aggregation

Quality Grades

Metric Configuration

Quality Threshold

Dataset Format

Where Metrics Are Used