PHPackages                             padosoft/eval-harness - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Testing &amp; Quality](/categories/testing)
4. /
5. padosoft/eval-harness

ActiveLibrary[Testing &amp; Quality](/categories/testing)

padosoft/eval-harness
=====================

Laravel evaluation framework for RAG / LLM applications: golden datasets, exact-match + cosine-embedding + LLM-as-judge metrics, JSON + Markdown reports, Artisan-driven CI gate.

v1.4.0(1mo ago)67.1kApache-2.0PHPPHP ^8.3CI passing

Since Apr 29Pushed 1mo ago1 watchersCompare

[ Source](https://github.com/padosoft/eval-harness)[ Packagist](https://packagist.org/packages/padosoft/eval-harness)[ RSS](/packages/padosoft-eval-harness/feed)WikiDiscussions main Synced 3w ago

READMEChangelog (6)Dependencies (29)Versions (20)Used By (0)

padosoft/eval-harness
=====================

[](#padosofteval-harness)

> Laravel-native evaluation framework for RAG / LLM applications. Golden datasets in YAML, fifteen built-in metrics (including retrieval-ranking and ordinal scoring), judge calibration against human labels, production online monitoring with drift alerts, standalone output assertions, Markdown + JSON reports, and an Artisan CI gate. Stop shipping silent regressions in your AI pipeline.

[![Latest Version on Packagist](https://camo.githubusercontent.com/7cca0215e4600f120d40940d7b9937d983e4d4ba22b5675f6879b3d8af8dd7f7/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f7061646f736f66742f6576616c2d6861726e6573732e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/padosoft/eval-harness)[![Tests](https://github.com/padosoft/eval-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/padosoft/eval-harness/actions/workflows/ci.yml)[![Total Downloads](https://camo.githubusercontent.com/fdee439f67ff4c6e2da2e4cd4dbd310c4fb8bc443bc834fd7cbe97a91dcc62cc/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f7061646f736f66742f6576616c2d6861726e6573732e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/padosoft/eval-harness)[![License](https://camo.githubusercontent.com/383ef3102cfe7599d61d03ef91fa7920a2e7935a0e2e70ea7dd5709049b014f3/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f6c2f7061646f736f66742f6576616c2d6861726e6573732e7376673f7374796c653d666c61742d737175617265)](LICENSE)[![PHP Version Require](https://camo.githubusercontent.com/79a9feaffcebeb649725dae52a05a085dab49a2b566796b720e67d27760ac273/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f646570656e64656e63792d762f7061646f736f66742f6576616c2d6861726e6573732f7068702e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/padosoft/eval-harness)

---

[![eval-harness report banner](https://raw.githubusercontent.com/padosoft/eval-harness/main/resources/banner.png)](https://raw.githubusercontent.com/padosoft/eval-harness/main/resources/banner.png)

Official Documentation
----------------------

[](#official-documentation)

📚 **Full documentation is available at [doc.eval-harness.padosoft.com](https://doc.eval-harness.padosoft.com/).**

The documentation site covers everything in depth: a five-minute quickstart, the fifteen built-in metrics with their underlying theory and formulas, guides for CI gating, judge calibration, online monitoring and adversarial testing, the batch/Horizon operations model, the architecture and decision records, and the full CLI, configuration, and report-API reference.

---

Table of Contents
-----------------

[](#table-of-contents)

1. [Official Documentation](https://doc.eval-harness.padosoft.com/)
2. [Why eval-harness?](#why-eval-harness)
3. [Design rationale](#design-rationale)
4. [Features](#features)
5. [Comparison with alternatives](#comparison-with-alternatives)
6. [Installation](#installation)
7. [Quick start](#quick-start)
8. [Usage examples](#usage-examples)
9. [Web admin panel UI](#web-admin-panel-ui)
10. [Contract stability and migration](#contract-stability-and-migration)
11. [Configuration](#configuration)
12. [Architecture](#architecture)
13. [AI vibe-coding pack included](#ai-vibe-coding-pack-included)
14. [Testing](#testing)
15. [Roadmap](#roadmap)
16. [Contributing](#contributing)
17. [Security](#security)
18. [License](#license)

---

Why eval-harness?
-----------------

[](#why-eval-harness)

Imagine deploying a RAG-powered chatbot to production. Quality is great on launch day. Three months later, somebody:

- bumps the embedding model from `text-embedding-3-small` to `text-embedding-3-large`,
- swaps the chat model from `gpt-4o` to `gpt-4o-mini` for cost,
- tweaks the prompt template,
- updates `laravel/ai` from `^0.5` to `^0.6`,
- changes the chunker from 800-token sliding window to 1200-token semantic.

Every one of those changes is a quality regression risk, and you have **no programmatic signal** they shipped intact. Your test suite green-lights the deployment because PHPUnit doesn't know what a "correct answer" looks like.

`padosoft/eval-harness` fixes that loop:

1. You curate a small golden dataset — a YAML file with 30-200 `(question, expected answer)` pairs that represent the queries you actually care about.
2. You declare metrics — exact-match for deterministic outputs, cosine-embedding for paraphrase tolerance, LLM-as-judge for subjective grading.
3. You wire up a callable that drives your real pipeline against the dataset.
4. CI runs `php artisan eval-harness:run rag.factuality` on every PR and gates the merge on the macro-F1 score.

Now your AI pipeline has the same regression protection your business logic has had for the last fifteen years.

---

Design rationale
----------------

[](#design-rationale)

The package is opinionated. Three decisions matter most:

**1. No SDK lock-in.** Every external call goes through Laravel's `Http::` facade — never an OpenAI / Anthropic / Vertex SDK. Tests substitute via `Http::fake()` for deterministic offline runs, and swapping providers is a config-file change, not a refactor.

**2. The dataset is YAML, not a Laravel model.** YAML is reviewable in pull requests, diffable across releases, and survives database wipes. The package never stores datasets in your DB — they live in `eval/golden/*.yml` next to your code.

**3. Failures are captured by default.** A timeout on sample 47 should not mask the macro-F1 score across 200 valid samples. Every metric exception is recorded against `(sample, metric)` and surfaced in the final report so the operator can investigate, not re-run the whole 30-minute suite. Strict CI lanes can opt into `EVAL_HARNESS_RAISE_EXCEPTIONS=true` to abort on the first `MetricException` provider/metric contract error.

These decisions cost some flexibility (you can't dispatch metrics across multiple processes yet — see Roadmap) but they keep the public surface small and the offline path fast.

---

Features
--------

[](#features)

- **Fifteen metrics out of the box** — `exact-match`, `contains`, `regex`, `rouge-l`, `citation-groundedness`, `cosine-embedding`, `bertscore-like`, `llm-as-judge`, `refusal-quality`, `ordinal-distance`, and the retrieval-ranking family (`retrieval-hit-at-k`, `retrieval-recall-at-k`, `retrieval-mrr`, `retrieval-ndcg-at-k`, `answer-containment-at-k`) — and a clean `Metric` interface for adding more.
- **Retrieval-ranking metrics** — domain-agnostic hit@k, recall@k, MRR, nDCG@k (binary or graded gains), and top-k answer containment over a ranked id/text list your retriever emits. A single tested source of truth for RAG ranking math (`metrics.retrieval.default_k`config + per-sample `metadata.k` override).
- **Ordinal / distance scoring** — `ordinal-distance` gives partial credit for ordered labels (e.g. low &lt; medium &lt; high &lt; urgent): exact = 1.0, off-by-one = 0.5, further = 0.0.
- **Judge calibration** — `php artisan eval-harness:calibrate-judge`validates the LLM judge against human-labelled cases, reporting verdict agreement rate, a confusion matrix, a length-bias signal, and a self-preference guard (fails when the judge model equals the model under test). Gate CI on judge trustworthiness, not vibes.
- **Online / production monitoring** — `OnlineMonitor::capture()`samples a configurable fraction of live AI traffic, judges it on a queue, stores scores historically, charts pass-rate-over-time, and fires an `OnlinePassRateDropped` event when recent quality dips below threshold. Off by default; Horizon-ready; exposed read-only at `GET /{prefix}/online/{dataset}/trend` for a companion dashboard.
- **Strict-schema YAML loader** — versioned dataset contracts and actionable validation errors for malformed samples.
- **Deterministic LLM-as-judge** — temperature 0, seed 42, `response_format=json_object`. Strict-JSON parser rejects malformed responses instead of silently scoring 0.
- **Stable JSON report shape** — every payload carries explicit `schema_version` and `dataset_schema_version` fields. Wire into your CI dashboard once, then evolve additively.
- **Cohort-ready report data** — JSON and Markdown reports aggregate scores by `metadata.tags`, expose an explicit untagged bucket, and include per-metric score histograms for dashboards.
- **Citation evidence checks** — `citation-groundedness` can score simple citation markers or stricter `metadata.citation_evidence`spans that require both citation markers and quote text.
- **Opt-in adversarial lane** — `AdversarialDatasetFactory` and `php artisan eval-harness:adversarial` build/run safety regression seeds for prompt injection, jailbreaks, data leaks, SSRF, tool abuse, and similar red-team categories. JSON/Markdown reports add category and compliance-framework summaries, and optional manifests retain adversarial run summaries while preserving latest failure-free baselines per compatible report schema, dataset, metric names, and adversarial category/sample-count slice under tight retention; `--regression-gate` fails CI when macro-F1 or configured metric aggregates drop, and `--promote-failures` writes failed samples back to a reloadable YAML dataset seed. Scheduler/CI guidance shows how to run the lane continuously without bundling a daemon in this package.
- **Standalone output assertions** — score saved JSON/YAML outputs with the same metrics and report contract, without invoking your agent in CI.
- **Usage summaries** — JSON and Markdown reports aggregate structured `usage` details for provider token counts, cost USD, and latency.
- **Runtime guardrails** — provider timeouts are normalized, optional retries cover Laravel HTTP connection failures plus HTTP 429/5xx, and strict mode can rethrow `MetricException` failures instead of capturing them.
- **Batch execution modes** — SUT runs flow through deterministic `SerialBatch` by default, or queue-backed `LazyParallelBatch` via `--batch=lazy-parallel` for Laravel queue/Horizon workers.
- **Operational batch profiles** — `--batch-profile=ci|smoke|nightly`applies named presets of batch defaults (concurrency, queue, timeouts, chunk size, rate limit, checkpoint cadence, result TTL). Explicit CLI flags always win; host apps can override or add profiles under `eval-harness.batches.profiles.*` in `config/eval-harness.php`.
- **Producer-side backpressure** — `--chunk-size=N`, `--rate-limit=N --rate-window-seconds=W`, and `--result-ttl-seconds`bound dispatch in flight, throttle to N samples per W-second sliding window (monotonic-clock math, amortized O(1)), and size result metadata TTL for delayed collection. Pass `none` on any nullable numeric flag to clear an inherited profile value for a one-off run.
- **Progress checkpoints and terminal status** — `--checkpoint-every=N`emits structured progress events through an optional `BatchProgressReporter` container binding (default `NullBatchProgressReporter`). Dashboards that need to distinguish a finished failed batch from a stalled one can implement the optional `BatchTerminalProgressReporter` sub-contract for explicit `success` / `failure` / `empty` terminal status with partial-wins tolerance on the failure path.
- **Live batch registry endpoints** — `GET //batches/live` and `GET //batches/{id}/progress` expose active lazy-parallel batch ids and compact progress counters through cache-backed read-only API contracts. The live registry is enabled by default and can be disabled with `eval-harness.batches.live_registry.enabled`.
- **Adversarial manifest discovery endpoints** — `GET //adversarial/manifests` and `GET //adversarial/manifests/{name}` enumerate adversarial run manifests written to a configured disk so a companion UI can browse compliance history without scraping the filesystem. Opt-in via `eval-harness.adversarial.manifests.{disk,path_prefix}`; the CLI `--adversarial-manifest=` flag is preserved for existing operators.
- **Report diff endpoint** — `GET //reports/{id}/diff/{otherId}` computes signed deltas (macro\_f1, per-metric mean/pass\_rate, per-cohort status with `added` / `removed` / `regressed` / `improved` / `stable`, total\_samples / total\_failures, adversarial categories when present) so a companion UI can show regression diffs side-by-side without fetching full reports. The prefix defaults to `eval-harness/api` and is configurable through `eval-harness.api.prefix`.
- **Dataset trend endpoint** — `GET //datasets/{name}/trend?limit=N` scans stored JSON reports for one dataset, skips malformed artifacts, caps `limit` at 100, caps scanned JSON files through `eval-harness.api.trend.max_files_scanned`, and returns chronological points with metrics, cohorts, usage, and the `eval-harness.report-api.v1.trend` schema discriminator.
- **Provider-agnostic** — works with OpenAI, OpenRouter, Regolo, Mistral, any OpenAI-compatible chat-completions endpoint.
- **No DB migrations required** — datasets are YAML, results are JSON. The package adds zero rows to your schema.
- **Artisan-driven CI gate** — `php artisan eval-harness:run` exits non-zero on any captured failure. Wire it into the same workflow your unit tests run in.
- **Architecture tests included** — every release proves it doesn't leak symbols from sibling packages, ever.

---

Comparison with alternatives
----------------------------

[](#comparison-with-alternatives)

Status legend: `✅ YES` means first-class support, `⚠️ partial` means supported with limits or outside the Laravel-native path, and `❌ NO` means not a primary fit.

Concern**eval-harness**OpenAI EvalsLangSmithRagasPromptfooDeepEvalLaravel-native package**✅ YES - PHP/Laravel package**❌ NO - Python CLI/library❌ NO - hosted Python/TS workflow❌ NO - Python library❌ NO - Node/YAML CLI❌ NO - Python libraryRuns inside your app container**✅ YES - resolves Laravel services directly**⚠️ partial - custom completion functions⚠️ partial - SDK/API integration⚠️ partial - integrate from Python⚠️ partial - external CLI/provider call⚠️ partial - local Python runnerLocal-first storage**✅ YES - YAML datasets + JSON/Markdown reports**⚠️ partial - local logs or Snowflake❌ NO - LangSmith cloud workspace✅ YES - local datasets/results✅ YES - local YAML/results⚠️ partial - local evals, optional Confident AI cloudRead-only report API**✅ YES - opt-in Laravel routes for report listing/show, cohorts, histograms, artifact download, and row CSV export**⚠️ partial - custom artifact/API layer✅ YES - hosted experiment API⚠️ partial - custom app/API layer⚠️ partial - local result files/viewer workflows⚠️ partial - local results or hosted platform APIBuilt-in metrics**✅ YES - offline exact/contains/regex/ROUGE-L/citation plus fakeable cosine/BERTScore-like/judge/refusal**⚠️ partial - custom eval code✅ YES - evaluators in platform/SDK✅ YES - RAG-focused metrics✅ YES - assertions and graders✅ YES - built-in metricsEmbedding semantic overlap**✅ YES - cosine-embedding + bertscore-like via fakeable EmbeddingClient**⚠️ partial - custom embedding eval code⚠️ partial - SDK evaluator path✅ YES - RAG embedding metrics⚠️ partial - provider-backed similarity assertions✅ YES - semantic metricsDeterministic no-network tests**✅ YES - Http::fake, fake LLM/embedding clients**⚠️ partial - depends on eval⚠️ partial - cloud/API path common⚠️ partial - many metrics need LLMs⚠️ partial - assertions can be local, red team needs models⚠️ partial - metric dependentLLM-as-judge**✅ YES - schema-checked, fakeable judge client**✅ YES - model-graded evals✅ YES - evaluators✅ YES - LLM metrics✅ YES - rubric/grader assertions✅ YES - LLM metricsRefusal quality / safety judge**✅ YES - refusal-quality with required metadata + strict JSON schema**⚠️ partial - custom model-graded eval⚠️ partial - custom evaluator workflow⚠️ partial - custom LLM metric✅ YES - safety/red-team assertions✅ YES - safety metricsAdversarial red-team seeds**✅ YES - opt-in Laravel seed factory for 10 categories**⚠️ partial - custom eval registry⚠️ partial - custom datasets/evaluators⚠️ partial - RAG-focused tests✅ YES - red-team plugins✅ YES - safety test casesAdversarial CLI lane**✅ YES - `eval-harness:adversarial` with `eval:adversarial` alias, saved outputs, and batch options**⚠️ partial - custom eval runner scripts⚠️ partial - custom evaluator automation⚠️ partial - Python code orchestration✅ YES - red-team CLI workflow✅ YES - safety test runnerAdversarial compliance mapping**✅ YES - JSON/Markdown category + OWASP/NIST/EU AI Act summaries**⚠️ partial - custom eval metadata⚠️ partial - custom evaluator metadata⚠️ partial - custom report code✅ YES - red-team category reporting⚠️ partial - safety metadata/reportingAdversarial run history manifests**✅ YES - local JSON manifest retains adversarial summaries and clean baselines**⚠️ partial - custom eval logs✅ YES - hosted experiment history⚠️ partial - custom persistence✅ YES - monitoring/history workflows⚠️ partial - platform/history workflowAdversarial regression gate**✅ YES - `--regression-gate` fails on macro-F1 or metric drops from local manifests**⚠️ partial - custom eval thresholds✅ YES - hosted experiment comparisons⚠️ partial - custom CI checks✅ YES - threshold/regression workflows✅ YES - test assertions/regression workflowsScheduled/continuous monitoring**✅ YES - Laravel Scheduler/CI cron guidance with manifests, Horizon queues, gates, and failure promotion**⚠️ partial - custom scheduler around eval runs✅ YES - hosted monitoring workflows⚠️ partial - custom scheduler around Python metrics✅ YES - CLI/CI monitoring workflows⚠️ partial - local runner or hosted platform workflowFailure promotion to datasets**✅ YES - `--promote-failures` exports failed adversarial samples to YAML seeds**⚠️ partial - custom eval scripts✅ YES - trace-to-dataset workflows⚠️ partial - custom dataset curation✅ YES - failure-driven test cases✅ YES - failed test cases can become datasetsCitation evidence spans**✅ YES - citation\_evidence requires marker + quote match**⚠️ partial - custom eval code⚠️ partial - custom evaluator workflow✅ YES - RAG faithfulness/context metrics⚠️ partial - custom assertions✅ YES - RAG faithfulness metricsCost/token/latency summaries**✅ YES - built-in provider usage + JSON/Markdown summaries**⚠️ partial - custom logging✅ YES - experiment usage analytics✅ YES - usage/cost hooks⚠️ partial - provider output dependent⚠️ partial - metric/provider dependentRuntime retry / strict exception controls**✅ YES - normalized timeouts, connection/429/5xx retries, optional raise\_exceptions**⚠️ partial - custom eval code⚠️ partial - SDK/platform behavior✅ YES - runtime metric settings⚠️ partial - provider/config dependent⚠️ partial - custom evaluator handlingProvider choice**✅ YES - any OpenAI-compatible endpoint via Laravel HTTP**⚠️ partial - OpenAI API defaults, custom completion functions possible✅ YES - multi-provider ecosystem✅ YES - via integrations✅ YES - multi-provider✅ YES - multi-providerCI gate**✅ YES - Artisan command with non-zero failure exit**⚠️ partial - script around CLI/API⚠️ partial - API/automation hook⚠️ partial - custom script✅ YES - CLI gate✅ YES - test runner/CI flowQueue/Horizon batch execution**✅ YES - SerialBatch + LazyParallelBatch for Laravel queues/Horizon**❌ NO - not Laravel queues❌ NO - hosted tracing/evals❌ NO - not Laravel queues❌ NO - external CLI concurrency❌ NO - not Laravel queuesNamed operational profiles**✅ YES - `--batch-profile=ci|smoke|nightly` with host-app overrides under `eval-harness.batches.profiles.*`**⚠️ partial - custom run scripts⚠️ partial - hosted experiment configs❌ NO - custom Python wrappers⚠️ partial - YAML config presets❌ NO - custom Python wrappersProducer-side backpressure**✅ YES - `--chunk-size`, `--rate-limit`, `--rate-window-seconds` with monotonic sliding-window math**❌ NO - per-eval concurrency only⚠️ partial - hosted rate controls❌ NO - in-process only⚠️ partial - external concurrency flag❌ NO - in-process onlyProgress checkpoints + terminal status**✅ YES - `--checkpoint-every` plus optional `BatchProgressReporter` / `BatchTerminalProgressReporter` bindings**❌ NO - log-tail only✅ YES - hosted run progress⚠️ partial - custom callbacks⚠️ partial - CLI progress lines⚠️ partial - custom hooksEval sets / multi-dataset runs**✅ YES - EvalSetDefinition + resumable manifests**✅ YES - `oaievalset`✅ YES - dataset experiments⚠️ partial - run multiple datasets in code✅ YES - suites/configs✅ YES - metric collections/test suitesResume interrupted multi-dataset progress**✅ YES - explicit per-dataset resume manifest**❌ NO - no mid-eval resume⚠️ partial - platform run history⚠️ partial - custom code⚠️ partial - rerun/filter workflows⚠️ partial - platform/regression workflowsCohorts / tags / facets**✅ YES - tag cohorts in JSON/Markdown**⚠️ partial - custom eval/reporting✅ YES - dataset filtering/metadata⚠️ partial - custom analysis✅ YES - metadata/config-driven views⚠️ partial - test metadataSaved-output assertions**✅ YES - `--outputs` and `Eval::scoreOutputs()`**⚠️ partial - custom eval code⚠️ partial - compare uploaded runs⚠️ partial - build dataset/results manually✅ YES - assertion-first workflow✅ YES - test-case assertionsAuditable in PR diff**✅ YES - YAML datasets + stable JSON/Markdown artifacts**⚠️ partial - local YAML/code possible❌ NO - cloud-first✅ YES - code/data files✅ YES - YAML config✅ YES - Python test filesVendor lock-in**✅ YES - headless, local-first, provider-agnostic**⚠️ partial - OpenAI-oriented defaults❌ NO - LangSmith workspace✅ YES - OSS library✅ YES - OSS CLI⚠️ partial - OSS plus Confident AI optionCost to evaluate 200 offline samples**✅ YES - free for offline metrics and faked providers**⚠️ partial - depends on model calls❌ NO - cloud/API usage⚠️ partial - free only for non-LLM metrics⚠️ partial - free only for local assertions⚠️ partial - free only for local/non-LLM metricsThe Python-stack tools are excellent if your stack is Python. If your RAG pipeline lives in a Laravel monolith, `eval-harness` is the shortest path from "we have AI in prod" to "we have a regression test for our AI in prod".

---

Installation
------------

[](#installation)

```
composer require padosoft/eval-harness
```

The package is auto-discovered. No `config/app.php` edits required.

Optional config publishing:

```
php artisan vendor:publish --tag=eval-harness-config
```

This drops `config/eval-harness.php` into your app where you can override the embeddings + judge endpoints / models / API keys.

### Compatibility matrix

[](#compatibility-matrix)

eval-harnessPHPLaravellaravel/ai SDKsymfony/yaml0.x (current)8.3 / 8.4 / 8.512.x / 13.x^0.6^7 / ^8---

Quick start
-----------

[](#quick-start)

### 1. Curate a golden dataset

[](#1-curate-a-golden-dataset)

`eval/golden/factuality.yml`:

```
schema_version: eval-harness.dataset.v1
name: rag.factuality.fy2026
samples:
  - id: capital-france
    input:
      question: "What is the capital of France?"
    expected_output: "Paris"
    metadata:
      tags: [geography, easy]

  - id: refund-policy
    input:
      question: "How many days do I have to return an order?"
    expected_output: "30 days from delivery."
    metadata:
      tags: [policy, support]
```

`schema_version` is optional for existing datasets. If omitted, the loader defaults to `eval-harness.dataset.v1`.

### 2. Wire up a registrar in your app

[](#2-wire-up-a-registrar-in-your-app)

`app/Console/EvalRegistrar.php`:

```