dataclass objects with to_dict() and from_dict(cls, data) methods for serialization. Paginated models support len() and iteration.
EvaluationResult
Returned by evaluate(), evaluate_trace(), and get_run().
| Field | Type | Description |
|---|---|---|
run_id | str | Unique evaluation run identifier. |
project_id | str | Project this run belongs to. |
status | str | Run status: "completed", "running", "failed". |
overall_score | Optional[float] | Weighted overall score (0.0–1.0). |
aggregate_scores | Dict[str, float] | Per-metric aggregate scores. |
total_items | int | Total items in the dataset. |
items_evaluated | int | Items successfully evaluated. |
metadata | Dict[str, Any] | Custom metadata attached to the run. |
JobHandle
Returned by evaluate_async(). Wraps an async evaluation job with status polling and result retrieval.
| Property/Method | Return Type | Description |
|---|---|---|
job_id | str | The async job identifier. |
job_type | str | Always "evaluation". |
status() | JobStatus | Poll current status. |
is_running() | bool | Whether the job is still running. |
is_completed() | bool | Whether the job completed successfully. |
cancel() | Dict[str, Any] | Cancel the job. |
result() | EvaluationResult | Block until complete and return result. |
wait(poll_interval, timeout, on_progress) | JobStatus | Poll with optional progress callback. |
JobStatus
Status of an async evaluation job.
| Field | Type | Description |
|---|---|---|
job_id | str | Job identifier. |
job_type | str | "evaluation". |
status | str | "queued", "running", "completed", "failed", "cancelled". |
progress_percent | float | Progress (0.0–100.0). |
current_item | int | Current item being processed. |
total_items | int | Total items. |
started_at | Optional[str] | ISO timestamp. |
finished_at | Optional[str] | ISO timestamp. |
error | Optional[str] | Error message if failed. |
result | Optional[EvaluationResult] | Result if completed. |
is_running, is_completed, is_failed (all bool).
RunMetric
Per-metric score for an evaluation run.
| Field | Type | Description |
|---|---|---|
key | str | Metric key (e.g. "hallucination"). |
display_name | str | Human-readable name. |
score | float | Aggregate score. |
value_type | str | "numeric" (default). |
EvalItemDetail
Per-item evaluation detail with scores and explanations.
| Field | Type | Description |
|---|---|---|
id | str | Item identifier. |
run_id | str | Parent run ID. |
input | str | Input text. |
output | str | Output text. |
context | Optional[str] | Context text. |
expected | Optional[str] | Expected output. |
overall_score | Optional[float] | Item-level overall score. |
metric_scores | Dict[str, float] | Per-metric scores for this item. |
explanations | Dict[str, str] | Per-metric explanations. |
metadata | Dict[str, Any] | Item metadata. |
EvalItemsPage
Paginated list of EvalItemDetail. Supports len() and iteration.
| Field | Type |
|---|---|
items | List[EvalItemDetail] |
total | int |
page | int |
page_size | int |
EvalTrendPoint
A single data point in an evaluation trend.
| Field | Type |
|---|---|
date | str |
metric | str |
score | float |
run_count | int |
EvalRunComparison
Result of comparing multiple evaluation runs.
| Field | Type |
|---|---|
runs | List[Dict[str, Any]] |
metrics | List[Dict[str, Any]] |
overall_scores | List[float] |
MetricInfo
Metric template or project metric configuration.
| Field | Type | Default |
|---|---|---|
key | str | — |
display_name | str | — |
definition | Optional[str] | None |
value_type | str | "numeric" |
category | Optional[str] | None |
ProjectInfo
| Field | Type | Default |
|---|---|---|
id | str | — |
name | str | — |
key | Optional[str] | None |
model_name | Optional[str] | None |
created_at | Optional[str] | None |
AuthInfo
| Field | Type | Default |
|---|---|---|
valid | bool | — |
user_id | Optional[str] | None |
org_id | Optional[str] | None |
org_name | Optional[str] | None |
plan | Optional[str] | None |
CancelResponse
| Field | Type | Default |
|---|---|---|
status | str | — |
message | str | "" |
job_id | Optional[str] | None |
Related
- EvalClient — Methods returning these models
- Failure Analysis Models — FA data models
- Security Models — Security data models