Prompts¶

Prompt Template¶

The extraction prompt is defined in extract.yaml at the repository root. It has two parts: a system message and a user prompt.

System Message¶

You are a scientific data extraction assistant specializing in biological
assays and measurements. Your task is to extract structured data from
scientific text, identifying assays, measurements, protocols, study
subjects, and experimental conditions.

User Prompt¶

Extract structured assay and measurement data from the following scientific text.
Identify all assays, their measurements, study subjects, protocols, and
experimental conditions. Use ontology identifiers where possible (e.g.,
OBI, CHEBI, CL, UBERON, NCBITaxon, UO). Provide the extracted data in
a structured format (e.g., YAML).

Prompt Assembly¶

The prompt_builder.py module assembles the final prompt from three components:

┌─────────────────────────────────────────┐
│  ## Schema Context      (if not baseline)│
│  [generated by schema_context.py]       │
│                                         │
│  [extraction instructions from template]│
│                                         │
│  ## Source Text                          │
│  [full text extracted from PDF]         │
└─────────────────────────────────────────┘

For baseline runs, the schema context section is omitted entirely. For other levels, it is prepended before the extraction instructions.

Ontology Targets¶

The prompt instructs models to use ontology identifiers from these vocabularies:

Prefix	Ontology	Domain
OBI	Ontology for Biomedical Investigations	Assay types, protocols
CHEBI	Chemical Entities of Biological Interest	Chemical compounds
CL	Cell Ontology	Cell types
UBERON	Uber-anatomy Ontology	Anatomical structures
NCBITaxon	NCBI Taxonomy	Organisms
UO	Units of Measurement Ontology	Measurement units

Token Budget¶

The prompt length varies by ablation level due to the schema context:

Level	Approximate Prompt Size
`baseline`	~18,700 tokens (paper text + instructions only)
`class_names`	~19,500 tokens
`full_classes`	~20,500 tokens
`with_enums`	~21,700 tokens

These are approximate and depend on the specific paper length and schema size.