Prompts¶
Prompt Template¶
The extraction prompt is defined in extract.yaml at the repository root.
It has two parts: a system message and a user prompt.
System Message¶
You are a scientific data extraction assistant specializing in biological
assays and measurements. Your task is to extract structured data from
scientific text, identifying assays, measurements, protocols, study
subjects, and experimental conditions.
User Prompt¶
Extract structured assay and measurement data from the following scientific text.
Identify all assays, their measurements, study subjects, protocols, and
experimental conditions. Use ontology identifiers where possible (e.g.,
OBI, CHEBI, CL, UBERON, NCBITaxon, UO). Provide the extracted data in
a structured format (e.g., YAML).
Prompt Assembly¶
The prompt_builder.py module assembles the final prompt from three components:
┌─────────────────────────────────────────┐
│ ## Schema Context (if not baseline)│
│ [generated by schema_context.py] │
│ │
│ [extraction instructions from template]│
│ │
│ ## Source Text │
│ [full text extracted from PDF] │
└─────────────────────────────────────────┘
For baseline runs, the schema context section is omitted entirely. For other levels, it is prepended before the extraction instructions.
Ontology Targets¶
The prompt instructs models to use ontology identifiers from these vocabularies:
| Prefix | Ontology | Domain |
|---|---|---|
| OBI | Ontology for Biomedical Investigations | Assay types, protocols |
| CHEBI | Chemical Entities of Biological Interest | Chemical compounds |
| CL | Cell Ontology | Cell types |
| UBERON | Uber-anatomy Ontology | Anatomical structures |
| NCBITaxon | NCBI Taxonomy | Organisms |
| UO | Units of Measurement Ontology | Measurement units |
Token Budget¶
The prompt length varies by ablation level due to the schema context:
| Level | Approximate Prompt Size |
|---|---|
baseline |
~18,700 tokens (paper text + instructions only) |
class_names |
~19,500 tokens |
full_classes |
~20,500 tokens |
with_enums |
~21,700 tokens |
These are approximate and depend on the specific paper length and schema size.