Dataset containing abstracts from scientific journals and named entity annotations for relevant scientific words.

The data contains 350 samples for the train set and 100 samples for the test set. Each dataset sample is a tokenized abstract from a scientific journal. Each token is annotated with a named entity tag. The dataset contains 7 named entity tags: Task, Method, Material, Metric, OtherScientificTerm, and Generic. The dataset is a subset of the SciERC dataset (

Original publication: Luan, Yi, He, Luheng, Ostendorf, Mari, and Hajishirzi, Hannaneh. (2018). “Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

The SCIERC dataset, in turn, was extracted from the S2ORC dataset (

Citation for the S2ORC dataset: Lo, Kyle, Wang, Lucy Lu, Neumann, Mark, Kinney, Rodney, and Weld, Daniel. (2020). “S2ORC: The Semantic Scholar Open Research Corpus.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.447. pp. 4969-4983.

The S2ORC dataset is licensed under the ODC-By 1.0 licence ( by the AllanAI institute


load_data([data_format, include_properties, ...])

Load and returns the SCIERC Abstract NER dataset (token classification).