Ablation Levels¶
The core of this evaluation is a schema ablation study with four cumulative levels. Each level adds more SOMA schema information to the LLM prompt, letting us measure the marginal impact of each type of schema context on extraction quality.
Overview¶
| Level | Schema Context | What's Added | Approx. Tokens |
|---|---|---|---|
baseline |
None | No schema — LLM uses only training knowledge | ~18,700 |
class_names |
Class headers | Class names, descriptions, parent classes, URIs, mappings | ~19,500 |
full_classes |
+ Slot definitions | All induced slots with ranges, cardinality, and constraints | ~20,500 |
with_enums |
+ Enumerations | All enum definitions with permissible values and ontology meanings | ~21,700 |
Each level is cumulative — it includes everything from the previous level plus new context.
Level 1: baseline¶
Schema context injected: None
The prompt contains only the extraction instructions and the paper text. This is the control condition — it measures what each model can do purely from its pre-training knowledge of scientific data structures.
Prompt sent to the LLM
System:
You are a scientific data extraction assistant specializing in biological
assays and measurements. Your task is to extract structured data from
scientific text, identifying assays, measurements, protocols, study
subjects, and experimental conditions.
User:
Extract structured assay and measurement data from the following scientific text.
Identify all assays, their measurements, study subjects, protocols, and
experimental conditions. Use ontology identifiers where possible (e.g.,
OBI, CHEBI, CL, UBERON, NCBITaxon, UO). Provide the extracted data in
a structured format (e.g., YAML).
[paper text follows]
What to expect: Output structure is entirely model-dependent. Field naming is inconsistent across models. Ontology IDs appear sporadically. Controlled vocabulary terms are free text.
| Baseline | |
|---|---|
| Output structure | Model-dependent |
| Field naming | Inconsistent |
| Ontology IDs | Sporadic |
| Controlled vocab | Free text |
| Output consistency | Low |
Results: results/baseline/<model>/<paper>.yaml
(browse on GitHub)
Next level: class_names adds class names, descriptions, and ontology mappings.
Level 2: class_names¶
Schema context injected: Class names, descriptions, parent classes, URIs, and mappings.
Schema context prepended to prompt
# SOMA Schema Classes
## Assay
An experimental procedure to test a hypothesis or measure something.
Parent: NamedThing
URI: soma:Assay
Mappings: OBI:0000070
## Protocol
A detailed description of how an assay is performed.
Parent: NamedThing
URI: soma:Protocol
## Measurement
A quantitative or qualitative result from an assay.
Parent: NamedThing
URI: soma:Measurement
## Subject
An entity that is the focus of an investigation.
Parent: NamedThing
URI: soma:Subject
[... all SOMA classes listed ...]
What changed from baseline: This gives the LLM the vocabulary and hierarchy of the target data model. It knows what categories of information to extract (Assay, Protocol, Measurement, Subject, etc.) and has ontology URIs to anchor them.
| Baseline | + Class Names | |
|---|---|---|
| Output structure | Model-dependent | Aligned categories |
| Field naming | Inconsistent | Improved |
| Ontology IDs | Sporadic | Improved |
| Controlled vocab | Free text | Free text |
| Output consistency | Low | Medium |
Results: results/class_names/<model>/<paper>.yaml
(browse on GitHub)
Previous level: baseline — no schema context at all. Next level: full_classes adds slot definitions with types and cardinality.
Level 3: full_classes¶
Schema context injected: Everything from class_names, plus all slot (field) definitions
for each class.
Schema context prepended to prompt
# SOMA Schema Classes and Slots
## Assay
An experimental procedure to test a hypothesis or measure something.
Parent: NamedThing
URI: soma:Assay
Mappings: OBI:0000070
Slots:
- name (string) [required, identifier]
- description (string)
- has_protocol (Protocol) [multivalued, inlined_as_list]
- has_measurement (Measurement) [multivalued, inlined_as_list]
- study_subjects (Subject) [multivalued]
- assay_type (AssayTypeEnum)
- gene_expression_method (GeneExpressionMethodEnum)
## Protocol
A detailed description of how an assay is performed.
Parent: NamedThing
URI: soma:Protocol
Slots:
- name (string) [required, identifier]
- description (string)
- protocol_type (ProtocolTypeEnum)
[... all classes with full slot definitions ...]
What changed from class_names: The LLM now knows the exact fields to extract for each class — their names, types, whether they're required, and whether they accept single or multiple values. This should produce output that is structurally conformant with the SOMA schema.
| Baseline | + Class Names | + Full Classes | |
|---|---|---|---|
| Output structure | Model-dependent | Aligned categories | Schema-conformant fields |
| Field naming | Inconsistent | Improved | Matching schema |
| Ontology IDs | Sporadic | Improved | Good |
| Controlled vocab | Free text | Free text | Free text |
| Output consistency | Low | Medium | High |
Results: results/full_classes/<model>/<paper>.yaml
(browse on GitHub)
Previous level: class_names — class headers only, no slot details. Next level: with_enums adds enumeration values with ontology meanings.
Level 4: with_enums¶
Schema context injected: Everything from full_classes, plus all enumeration definitions
with permissible values and ontology term mappings.
Schema context prepended to prompt
# SOMA Schema Classes and Slots
[... all classes with full slot definitions as in Level 3 ...]
# Enumerations
## AssayTypeEnum
Values:
- RNA_sequencing: Whole-transcriptome RNA-seq (meaning: OBI:0001271)
- chemical_analysis: Analytical chemistry assay (meaning: OBI:0000070)
- microscopy: Imaging assay (meaning: OBI:0000185)
- qRT_PCR: Quantitative RT-PCR (meaning: OBI:0002631)
- western_blot: Protein immunoblot (meaning: OBI:0000920)
## GeneExpressionMethodEnum
Values:
- qRT_PCR: Quantitative reverse-transcription PCR
- RNA_seq: RNA sequencing
- microarray: Gene expression microarray
[... all enumerations with permissible values and ontology meanings ...]
What changed from full_classes:
The LLM now has controlled vocabularies with exact permitted values and their ontology
term mappings. Instead of guessing "RNA-seq" vs "RNA sequencing" vs "RNAseq", it can use
the canonical RNA_sequencing value from the enum. This is the most complete level of
schema guidance.
| Baseline | + Class Names | + Full Classes | + Enums | |
|---|---|---|---|---|
| Output structure | Model-dependent | Aligned categories | Schema-conformant fields | Schema-conformant fields |
| Field naming | Inconsistent | Improved | Matching schema | Matching schema |
| Ontology IDs | Sporadic | Improved | Good | Best |
| Controlled vocab | Free text | Free text | Free text | Enum-aligned |
| Output consistency | Low | Medium | High | Highest |
Results: results/with_enums/<model>/<paper>.yaml
(browse on GitHub)
Previous level: full_classes — classes and slots, but no enum values.
How schema context is generated¶
The schema context for each level is generated programmatically by schema_context.py,
which uses LinkML's SchemaView to introspect the SOMA schema and produce the text
blocks shown above.
┌─────────────────────────────────────────┐
│ ## Schema Context (if not baseline)│
│ [generated by schema_context.py] │
│ │
│ [extraction instructions from template]│
│ │
│ ## Source Text │
│ [full text extracted from PDF] │
└─────────────────────────────────────────┘
For baseline runs, the schema context section is omitted entirely.
To inspect the actual generated context at each level:
just show-context