Metrics Overview
Valiqor ships with 17 built-in metrics across two categories:4 Heuristic Metrics
Fast, deterministic checks that don’t require an LLM. Ideal for
format validation, exact matching, and string comparison.
13 LLM Judge Metrics
Semantic evaluation using an LLM as a judge. Used for quality,
relevance, hallucination detection, and task adherence.
Heuristic Metrics
These metrics run locally without any LLM calls. They return a score between 0.0 and 1.0:| Metric | Description | Scoring |
|---|---|---|
contains | Checks if the output contains expected substrings | 1.0 if all expected strings found, 0.0 otherwise |
equals | Exact string match between output and expected | 1.0 if exact match, 0.0 otherwise |
regex_match | Tests output against a regular expression pattern | 1.0 if pattern matches, 0.0 otherwise |
levenshtein | Edit distance between output and expected text | Normalized 0.0–1.0 (1.0 = identical) |
LLM Judge Metrics
These metrics use an LLM to evaluate output quality semantically. Each metric sends a structured prompt to the judge model and receives a normalized score (0.0–1.0) with an explanation.Quality & Relevance
| Metric | What It Measures |
|---|---|
answer_relevance | How relevant the response is to the user’s question |
coherence | Logical flow and consistency of the response |
fluency | Language quality — grammar, readability, naturalness |
response_completeness | Whether all parts of the question are addressed |
task_adherence | How well the output follows specific instructions or constraints |
Factual Accuracy
| Metric | What It Measures |
|---|---|
hallucination | Whether the output contains fabricated or unsupported claims |
factual_accuracy | Whether statements are factually correct |
RAG-Specific
| Metric | What It Measures |
|---|---|
context_precision | How much of the retrieved context is relevant to the query |
context_recall | Whether the retrieved context covers all information needed |
Specialized
| Metric | What It Measures |
|---|---|
intent_resolution | Whether the correct user intent was identified and addressed |
tool_call_accuracy | Whether the right tools were called with correct arguments |
moderation | Content safety — flags harmful, toxic, or inappropriate content |
retrieval | Quality of the retrieval step (if applicable) |
How LLM Judging Works
Each LLM judge metric follows this flow:- The metric receives the dataset item (input, output, context, expected)
- A structured prompt is constructed for the specific metric
- The prompt is sent to the judge LLM
- The LLM returns a structured response with a score and rationale
- The score is normalized to the 0.0–1.0 range
The judge model defaults to
gpt-4o but can be configured. You can use
your own OpenAI API key via the openai_api_key parameter — see
BYOK for details.Score Results
Each metric evaluation produces aScoreResult:
| Field | Type | Description |
|---|---|---|
value | float | Score (typically 0.0–1.0) |
name | str | Metric name |
reason | str | LLM explanation for the score |
metadata | dict | Additional data (confidence, model used, etc.) |
scoring_failed | bool | Whether computation failed |
execution_time_ms | float | Time taken to compute |
Aggregation & Quality Grades
When evaluating a dataset with multiple items and metrics, Valiqor computes aggregate scores:Aggregation
- Per-metric average: Mean of all item scores for each metric
- Overall quality: Mean of all per-metric averages
Quality Grades
The overall quality score is mapped to a letter grade:| Grade | Score Range | Interpretation |
|---|---|---|
| A | ≥ 0.9 | Excellent — system performing well |
| B | ≥ 0.8 | Good — minor improvements possible |
| C | ≥ 0.7 | Acceptable — meets minimum threshold |
| D | ≥ 0.6 | Below threshold — review recommended |
| F | < 0.6 | Poor — significant issues detected |
Metric Configuration
You can customize which metrics to run and configure them per-project:Quality Threshold
The default quality threshold is 0.7 (Grade C). Scores below this trigger insights and recommendations in the evaluation result. You can adjust this threshold per-project.Dataset Format
Evaluation datasets are lists of dictionaries. Required and optional fields depend on the metrics being used:| Field | Required | Used By |
|---|---|---|
input | Yes | All metrics |
output | Yes | All metrics |
context | For RAG metrics | context_precision, context_recall, hallucination |
expected | For heuristic metrics | contains, equals, regex_match, levenshtein |
Where Metrics Are Used
Evaluation metrics appear throughout Valiqor:- Standalone evaluations —
evaluate()runs metrics on a dataset - Trace evaluations —
evaluate_trace()extracts data from a trace and evaluates - Failure Analysis — FA uses metric scores as evidence for failure classification
- Trends —
get_trends()tracks metric scores over time - Comparisons —
compare_runs()compares metrics across evaluation runs