Use this file to discover all available pages before exploring further.
Valiqor’s evaluation engine measures AI output quality using a combination of
heuristic metrics (fast, deterministic) and LLM judge metrics (semantic,
context-aware). This page explains what each metric measures, how scoring
works, and how quality grades are assigned.
These metrics run locally without any LLM calls. They return a score
between 0.0 and 1.0:
Metric
Description
Scoring
contains
Checks if the output contains expected substrings
1.0 if all expected strings found, 0.0 otherwise
equals
Exact string match between output and expected
1.0 if exact match, 0.0 otherwise
regex_match
Tests output against a regular expression pattern
1.0 if pattern matches, 0.0 otherwise
levenshtein
Edit distance between output and expected text
Normalized 0.0–1.0 (1.0 = identical)
Usage example:
result = client.eval.evaluate( dataset=[{ "input": "What is 2+2?", "output": "The answer is 4", "expected": "4" }], metrics=["contains", "levenshtein"])
These metrics use an LLM to evaluate output quality semantically. Each metric
sends a structured prompt to the judge model and receives a normalized score
(0.0–1.0) with an explanation.
The default quality threshold is 0.7 (Grade C). Scores below this
trigger insights and recommendations in the evaluation result. You can
adjust this threshold per-project.
Evaluation datasets are lists of dictionaries. Required and optional fields
depend on the metrics being used:
dataset = [ { "input": "What is the capital of France?", # Required: user prompt "output": "The capital of France is Paris.", # Required: AI response "context": ["France is a country in Europe. Its capital is Paris."], # Optional: for RAG metrics "expected": "Paris" # Optional: for heuristic metrics }]