Skip to content

Prompts

Prompt Template

The extraction prompt is defined in extract.yaml at the repository root. It has two parts: a system message and a user prompt.

System Message

You are a scientific data extraction assistant specializing in biological
assays and measurements. Your task is to extract structured data from
scientific text, identifying assays, measurements, protocols, study
subjects, and experimental conditions.

User Prompt

Extract structured assay and measurement data from the following scientific text.
Identify all assays, their measurements, study subjects, protocols, and
experimental conditions. Use ontology identifiers where possible (e.g.,
OBI, CHEBI, CL, UBERON, NCBITaxon, UO). Provide the extracted data in
a structured format (e.g., YAML).

Prompt Assembly

The prompt_builder.py module assembles the final prompt from three components:

┌─────────────────────────────────────────┐
│  ## Schema Context      (if not baseline)│
│  [generated by schema_context.py]       │
│                                         │
│  [extraction instructions from template]│
│                                         │
│  ## Source Text                          │
│  [full text extracted from PDF]         │
└─────────────────────────────────────────┘

For baseline runs, the schema context section is omitted entirely. For other levels, it is prepended before the extraction instructions.

Ontology Targets

The prompt instructs models to use ontology identifiers from these vocabularies:

Prefix Ontology Domain
OBI Ontology for Biomedical Investigations Assay types, protocols
CHEBI Chemical Entities of Biological Interest Chemical compounds
CL Cell Ontology Cell types
UBERON Uber-anatomy Ontology Anatomical structures
NCBITaxon NCBI Taxonomy Organisms
UO Units of Measurement Ontology Measurement units

Token Budget

The prompt length varies by ablation level due to the schema context:

Level Approximate Prompt Size
baseline ~18,700 tokens (paper text + instructions only)
class_names ~19,500 tokens
full_classes ~20,500 tokens
with_enums ~21,700 tokens

These are approximate and depend on the specific paper length and schema size.