Skip to content

Running Evals

Prerequisites

  • Python 3.12+
  • uv package manager
  • just command runner
  • API keys for the models you want to use

Setup

# Clone and install
git clone https://github.com/EHS-Data-Standards/soma-evals.git
cd soma-evals
just setup

# Set API keys
uv run llm keys set openai
uv run llm keys set anthropic
uv run llm keys set gemini

# Verify models are available
just list-models

CBORG Models

Models prefixed with cborg/ require access to Lawrence Berkeley National Lab's CBORG gateway. If you don't have CBORG access, use the OpenAI models directly or configure your own model aliases.

Running Evaluations

Single Ablation Level

# Run baseline (no schema context) with the standard tier
just run-baseline

# Run a specific level with a specific tier
just run-class-names tier=cheap
just run-full-classes tier=full
just run-with-enums tier=standard

All Levels

# Run all four ablation levels sequentially
just run-all

# With a specific tier
just run-all tier=full

Custom Paper

Override the default PDF and paper slug via environment variables:

EVAL_PDF=my-paper.pdf EVAL_SLUG=my-paper-slug just run-baseline

Or pass them directly to the CLI:

uv run python -m soma_evals run \
  --level baseline \
  --tier standard \
  --pdf my-paper.pdf \
  --paper-slug my-paper-slug

Output

Results are written to results/<level>/<model>/:

results/
└── baseline/
    ├── run_metadata.yaml
    ├── gpt-4o/montgomery2020-pm25-mucociliary.yaml
    ├── gpt-4o-mini/montgomery2020-pm25-mucociliary.yaml
    ├── cborg--claude-sonnet-4-6/montgomery2020-pm25-mucociliary.yaml
    └── cborg--gemini-2.5-flash/montgomery2020-pm25-mucociliary.yaml

Each model produces a YAML file with its structured extraction. The run_metadata.yaml file records token counts, latency, and status for each model.

Debugging

# Show what schema context looks like at each level
just show-context

# Run QC checks (no API calls)
just fix     # lint + format
just test    # pytest (excludes API tests)

Justfile Reference

Command Description
just setup Install dependencies
just list-models Show configured models and tiers
just run-baseline Run baseline ablation level
just run-class-names Run class_names level
just run-full-classes Run full_classes level
just run-with-enums Run with_enums level
just run-all Run all four levels
just show-context Print schema context at each level
just fix Auto-fix lint and format
just test Run tests (no API calls)
just clean-results Delete all results