Eval Datasheet
A datasheet for the golden evaluation set that measures specialist-agent quality. It describes what the dataset contains, how it is used, and the regression gate that protects against quality drift.
All evaluation data is synthetic. No real company names, financial data, or PII appears in the set, per the project's sensitive-data policy. Subjects use placeholders such as "Subject A" and "Acme".
What it contains
The golden set lives under tests/evals/ground_truth/:
contracts/— synthetic contract documents, one markdown file per scenario. They cover golden-path cases (e.g. change-of-control, revenue schedules, SLAs, IP assignment, DPAs), cross-domain cases, sparse/minimal documents, and adversarial false-positive traps.expected/{agent}/{contract}.json— the expected result for each contract, per agent. Each file declares:expected_findings— the findings an agent should produce, each with acategory, a severity range (min_severity/max_severity),must_contain_keywords(with optionalkeyword_synonyms), acitation_must_referencefile, and arequiredflag. An optionalalternative_categorieslist accepts additional categories for a single finding whose risk legitimately spans domains (e.g. an SLA-triggered termination right an agent may file undersla_riskrather thantermination); the keyword + citation checks still gate the match, so this widens category acceptance for one finding without weakening global matching.expected_gaps— gap types the agent should report (e.g.Missing_Doc).must_not_find— adversarial guards: categories the agent must not report on this document, each with areason. These catch false positives and category-confusion (for example, an employee-handbook document must not yield a contract-terminationfinding).- Adversarial files may also set
ambiguity_zone,min_expected_findings, andmax_expected_findings.
Metrics
Per-agent metrics are computed by tests/evals/metrics.py and stored as the
baseline in tests/evals/baselines/latest.json:
finding_recall— fraction of required expected findings that were surfaced. Recall is many-to-one: one produced finding that legitimately covers two expected risks (e.g. a clause whose text addresses both a non-compete and a termination provision) credits both — each expected still independently requires its own category, keyword, and citation match, so a finding only satisfies multiple expecteds when it genuinely contains each one's discriminating keyword.finding_precision— fraction of produced findings that were expected (1:1 matched, so over-production still lowers precision).citation_accuracy— fraction of findings whose citation references the correct source.severity_accuracy— fraction of findings whose severity falls in the expected range.false_positive_rate— fraction of produced findings that hit amust_not_findguard.f1_score— harmonic mean of recall and precision.
The baseline file also records a finding_count per agent, plus the commit
and timestamp it was captured at.
How it is used
The eval suite under tests/evals/ has two tiers:
- Deterministic tier —
test_contract_tier.pyandtest_trigger_evals.pyexercise matching/threshold logic and the cross-domain trigger rules with no model calls. These run in CI on every push (the-m "not eval"selection). - Model-graded tier —
test_agent_evals.pyandtest_cross_agent_evals.pyrun real specialist agents against the golden contracts and score them with the metrics above. These are marked with theevalpytest marker (defined inpyproject.toml) and require an API key.
Run them locally:
# Deterministic eval logic only (no API key)
pytest tests/evals/test_contract_tier.py tests/evals/test_trigger_evals.py -m "not eval"
# Full model-graded eval tier (requires ANTHROPIC_API_KEY or Bedrock creds)
pytest tests/evals/ -m eval
In CI, the model-graded tier runs on the main branch as a separate,
non-blocking job (it makes real model calls and reports quality regressions
without failing the build). The deterministic tier is part of the normal test
run. See .github/workflows/ci.yml for the exact job wiring.
Handling non-determinism (median-of-N)
LLM agents are stochastic, so a single run's recall/precision/F1 swings between
runs. The eval_results fixture (tests/evals/conftest.py) collapses that noise
by sampling each (agent, contract) pair DD_EVAL_SAMPLES times and taking the
median of each metric via aggregate_metrics_median:
DD_EVAL_SAMPLES=1(default) — one sample; identical cost and behavior to a single run, for fast local iteration.- CI main sets
DD_EVAL_SAMPLES=3— the median of three runs removes the single-unlucky-draw swing. Median (not best-of-N) is deliberate: one lucky high draw cannot rescue a genuinely degraded agent.
A sample whose run did not complete successfully — an exception OR a non-success status such as a stalled-stream timeout — is dropped as inconclusive (infrastructure noise, not an agent miss); a successful run that genuinely produced zero findings is kept as a real recall failure. If fewer than a majority of an agent's samples for a contract succeed, that contract is excluded from the aggregate rather than counted as a zero. Per-agent recall and false-positive rate aggregate as the WORST contract (min recall / max fp), so a real per-contract regression cannot be averaged away.
The near-deterministic quality metrics (citation accuracy, severity calibration)
are evaluated through a three-valued verdict (evaluate_verdict): a value inside
a small ambiguity band is inconclusive (logged, not auto-passed), only a
value clearly below the band fails. Required-recall (≥0.80) and
false-positive-rate (≤0.15) are never banded — a missed required finding or a
forbidden finding is exactly what the suite exists to catch, so those stay hard
asserts. See tests/evals/test_eval_robustness.py for the anti-masking
invariants (a value below threshold - zone can never become non-FAIL).
The F1 regression gate
test_agent_evals.py enforces hard per-agent thresholds (recall ≥0.80,
false-positive rate ≤0.15) plus baseline-backed banded quality metrics (citation,
severity), and a secondary no-regression check: an agent's median f1_score
must not fall more than _F1_REGRESSION_TOLERANCE (0.15) below its stored
baseline in tests/evals/baselines/latest.json. The tolerance is sized to
observed run-to-run variance: recall is stable but precision swings ~0.10–0.13
(an agent surfaces a variable number of extra-but-valid findings), so a 0.05 band
flaked on honest noise. 0.15 absorbs precision noise while still catching a
genuine recall regression (a 1.0→0.67 recall drop moves f1 ~0.18); the hard
recall floor is the independent primary backstop, so a real quality drop fails on
two gates, not one. If there is no baseline for an agent, the check is skipped.
The baseline is captured with the same median-of-N methodology so the comparison
is apples-to-apples.
To intentionally move the baseline (after a deliberate, reviewed change), re-run
the eval tier with --update-baseline so latest.json captures the new
metrics, and commit the updated baseline:
Related Documentation
- System Card — anti-hallucination layers the evals validate
- Agent Customization — how agents are configured