Skip to main content
Valiqor’s evaluation engine measures AI output quality using a combination of heuristic metrics (fast, deterministic) and LLM judge metrics (semantic, context-aware). This page explains what each metric measures, how scoring works, and how quality grades are assigned.

Metrics Overview

Valiqor ships with 17 built-in metrics across two categories:

4 Heuristic Metrics

Fast, deterministic checks that don’t require an LLM. Ideal for format validation, exact matching, and string comparison.

13 LLM Judge Metrics

Semantic evaluation using an LLM as a judge. Used for quality, relevance, hallucination detection, and task adherence.

Heuristic Metrics

These metrics run locally without any LLM calls. They return a score between 0.0 and 1.0:
MetricDescriptionScoring
containsChecks if the output contains expected substrings1.0 if all expected strings found, 0.0 otherwise
equalsExact string match between output and expected1.0 if exact match, 0.0 otherwise
regex_matchTests output against a regular expression pattern1.0 if pattern matches, 0.0 otherwise
levenshteinEdit distance between output and expected textNormalized 0.0–1.0 (1.0 = identical)
Usage example:
result = client.eval.evaluate(
    dataset=[{
        "input": "What is 2+2?",
        "output": "The answer is 4",
        "expected": "4"
    }],
    metrics=["contains", "levenshtein"]
)

LLM Judge Metrics

These metrics use an LLM to evaluate output quality semantically. Each metric sends a structured prompt to the judge model and receives a normalized score (0.0–1.0) with an explanation.

Quality & Relevance

MetricWhat It Measures
answer_relevanceHow relevant the response is to the user’s question
coherenceLogical flow and consistency of the response
fluencyLanguage quality — grammar, readability, naturalness
response_completenessWhether all parts of the question are addressed
task_adherenceHow well the output follows specific instructions or constraints

Factual Accuracy

MetricWhat It Measures
hallucinationWhether the output contains fabricated or unsupported claims
factual_accuracyWhether statements are factually correct

RAG-Specific

MetricWhat It Measures
context_precisionHow much of the retrieved context is relevant to the query
context_recallWhether the retrieved context covers all information needed

Specialized

MetricWhat It Measures
intent_resolutionWhether the correct user intent was identified and addressed
tool_call_accuracyWhether the right tools were called with correct arguments
moderationContent safety — flags harmful, toxic, or inappropriate content
retrievalQuality of the retrieval step (if applicable)

How LLM Judging Works

Each LLM judge metric follows this flow:
Input Data (prompt, response, context)

Structured Evaluation Prompt

LLM Judge (default: gpt-4o)

Structured Output (score + explanation)

Normalized Score (0.0 – 1.0)
  1. The metric receives the dataset item (input, output, context, expected)
  2. A structured prompt is constructed for the specific metric
  3. The prompt is sent to the judge LLM
  4. The LLM returns a structured response with a score and rationale
  5. The score is normalized to the 0.0–1.0 range
The judge model defaults to gpt-4o but can be configured. You can use your own OpenAI API key via the openai_api_key parameter — see BYOK for details.

Score Results

Each metric evaluation produces a ScoreResult:
FieldTypeDescription
valuefloatScore (typically 0.0–1.0)
namestrMetric name
reasonstrLLM explanation for the score
metadatadictAdditional data (confidence, model used, etc.)
scoring_failedboolWhether computation failed
execution_time_msfloatTime taken to compute

Aggregation & Quality Grades

When evaluating a dataset with multiple items and metrics, Valiqor computes aggregate scores:

Aggregation

  • Per-metric average: Mean of all item scores for each metric
  • Overall quality: Mean of all per-metric averages

Quality Grades

The overall quality score is mapped to a letter grade:
GradeScore RangeInterpretation
A≥ 0.9Excellent — system performing well
B≥ 0.8Good — minor improvements possible
C≥ 0.7Acceptable — meets minimum threshold
D≥ 0.6Below threshold — review recommended
F< 0.6Poor — significant issues detected
The default quality threshold is 0.7 (Grade C). Scores below this trigger insights and recommendations in the evaluation result.

Metric Configuration

You can customize which metrics to run and configure them per-project:
# Run specific metrics only
result = client.eval.evaluate(
    dataset=data,
    metrics=["hallucination", "answer_relevance", "task_adherence"]
)

# List available metric templates
templates = client.eval.list_metric_templates()

# Configure per-project metrics
client.eval.add_project_metric(
    project_name="my-chatbot",
    metric_key="hallucination"
)

Quality Threshold

The default quality threshold is 0.7 (Grade C). Scores below this trigger insights and recommendations in the evaluation result. You can adjust this threshold per-project.

Dataset Format

Evaluation datasets are lists of dictionaries. Required and optional fields depend on the metrics being used:
dataset = [
    {
        "input": "What is the capital of France?",    # Required: user prompt
        "output": "The capital of France is Paris.",   # Required: AI response
        "context": ["France is a country in Europe. Its capital is Paris."],  # Optional: for RAG metrics
        "expected": "Paris"                            # Optional: for heuristic metrics
    }
]
FieldRequiredUsed By
inputYesAll metrics
outputYesAll metrics
contextFor RAG metricscontext_precision, context_recall, hallucination
expectedFor heuristic metricscontains, equals, regex_match, levenshtein

Where Metrics Are Used

Evaluation metrics appear throughout Valiqor:
  • Standalone evaluationsevaluate() runs metrics on a dataset
  • Trace evaluationsevaluate_trace() extracts data from a trace and evaluates
  • Failure Analysis — FA uses metric scores as evidence for failure classification
  • Trendsget_trends() tracks metric scores over time
  • Comparisonscompare_runs() compares metrics across evaluation runs
See the Evaluations workflow for complete usage examples.